Nutch ERROR parse.OutlinkExtractor - getOutlinks

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Nutch ERROR parse.OutlinkExtractor - getOutlinks

Armel T. Nene-2
Hi guys,

 

I have been running successfully recently with most of the plug-ins enabled.
Lately, I have been trying to index some xml files which has some strings in
the form of ftawi:xyz.

 

Nutch version 8.2-dev on MS Windows Server 2003

 

During Outlinks extractor I get the following errors:

 

2007-04-17 21:52:51,598 ERROR parse.OutlinkExtractor - getOutlinks

java.net.MalformedURLException: unknown protocol: ftawi

                at java.net.URL.<init>(Unknown Source)

                at java.net.URL.<init>(Unknown Source)

                at java.net.URL.<init>(Unknown Source)

                at
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78
)

                at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)

                at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:11
1)

                at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70
)

                at
org.apache.nutch.parse.stellent.StellentParser.getParse(StellentParser.java:
53)

                at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)

                at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:283)

                at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)

 

I get the same error with all the parser plug-ins when running over the same
xml files. Can you let me know if there is a way of using the regular
expression to let the application know what kind of url should be included
in the url. Also, Nutch should not crash if the url in the outlink is not
valid. Is there any other HTML parser in Nutch that I can try.

 

Awaiting your kind reply.

 

Regards,

 

Armel

 

===========================

Armel T. Nene

iDNA Solutions LTD

Tel: +44 (20) 7257 6124

Mobile: +44 (7886)950 483

Web:  <http://www.idna-solutions.com> http://www.idna-solutions.com

Blog:  <http://blog.idna-solutions.com> http://blog.idna-solutions.com