[jira] Created: (NUTCH-274) Empty row in/at end of URL-list results in error

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-274) Empty row in/at end of URL-list results in error

Nick Burch (Jira)
Empty row in/at end of URL-list results in error
------------------------------------------------

         Key: NUTCH-274
         URL: http://issues.apache.org/jira/browse/NUTCH-274
     Project: Nutch
        Type: Bug

    Versions: 0.8-dev    
 Environment: nightly-2006-05-20
    Reporter: Stefan Neufeind
    Priority: Minor


This is minor - but it's a little unclean :-)

Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line.

Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection.


60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml
060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060521 022639 fetching http://www.bild.de/
060521 022639 fetching
060521 022639 fetch of  failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol:
060521 022639 http.proxy.host = null
060521 022639 http.proxy.port = 8080
060521 022639 http.timeout = 10000
060521 022639 http.content.limit = 65536
060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; [hidden email])
060521 022639 fetcher.server.delay = 1000
060521 022639 http.max.delays = 1000
060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml
060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 022640  map 0%  reduce 0%
060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,
060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ]

Stefan Groschupf commented on NUTCH-274:
----------------------------------------

Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the Injector?

> Empty row in/at end of URL-list results in error
> ------------------------------------------------
>
>          Key: NUTCH-274
>          URL: http://issues.apache.org/jira/browse/NUTCH-274
>      Project: Nutch
>         Type: Bug

>     Versions: 0.8-dev
>  Environment: nightly-2006-05-20
>     Reporter: Stefan Neufeind
>     Priority: Minor

>
> This is minor - but it's a little unclean :-)
> Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line.
> Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection.
> 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
> 060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml
> 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
> 060521 022639 fetching http://www.bild.de/
> 060521 022639 fetching
> 060521 022639 fetch of  failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol:
> 060521 022639 http.proxy.host = null
> 060521 022639 http.proxy.port = 8080
> 060521 022639 http.timeout = 10000
> 060521 022639 http.content.limit = 65536
> 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; [hidden email])
> 060521 022639 fetcher.server.delay = 1000
> 060521 022639 http.max.delays = 1000
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 022640  map 0%  reduce 0%
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-274) Empty row in/at end of URL-list results in error

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-274?page=all ]

Stefan Groschupf updated NUTCH-274:
-----------------------------------

    Attachment: ignoreEmpthyLineDuringInjectV1.patch

Ignore empthy lines during injecting.
Thanks for spotting this Stefan!

> Empty row in/at end of URL-list results in error
> ------------------------------------------------
>
>          Key: NUTCH-274
>          URL: http://issues.apache.org/jira/browse/NUTCH-274
>      Project: Nutch
>         Type: Bug

>     Versions: 0.8-dev
>  Environment: nightly-2006-05-20
>     Reporter: Stefan Neufeind
>     Priority: Minor
>  Attachments: ignoreEmpthyLineDuringInjectV1.patch
>
> This is minor - but it's a little unclean :-)
> Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line.
> Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection.
> 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
> 060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml
> 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
> 060521 022639 fetching http://www.bild.de/
> 060521 022639 fetching
> 060521 022639 fetch of  failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol:
> 060521 022639 http.proxy.host = null
> 060521 022639 http.proxy.port = 8080
> 060521 022639 http.timeout = 10000
> 060521 022639 http.content.limit = 65536
> 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; [hidden email])
> 060521 022639 fetcher.server.delay = 1000
> 060521 022639 http.max.delays = 1000
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 022640  map 0%  reduce 0%
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira