[jira] Created: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

Isabelle Giguere (Jira)
Generator throws java.io.IOException and dies on injected urls with no protocol
--------------------------------------------------------------------------------

                 Key: NUTCH-554
                 URL: https://issues.apache.org/jira/browse/NUTCH-554
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 1.0.0
         Environment: Linux(debian) Java 1.6
            Reporter: Brian Whitman


On trunk nutch, injecting URLs with no protocol (like issues.apache.org/jira/ vs. https://issues.apache.org/jira/) causes the generator to fail with an IOException:

java.net.MalformedURLException: no protocol: www.variogr.am
        at java.net.URL.<init>(URL.java:567)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
        at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:187)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
2007-09-15 11:11:26,986 FATAL crawl.Generator - Generator: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
        at org.apache.nutch.crawl.Generator.run(Generator.java:557)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.crawl.Generator.main(Generator.java:520)

To test:

# cat test/urls.txt
www.variogr.am
http://www.variogr.am/

# bin/nutch inject testcrawl/crawldb test/
(this goes fine)

# bin/nutch generate testcrawl/crawldb testcrawl/segments -topN 10
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: testcrawl/segments/20070915111125
Generator: filtering: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: java.io.IOException: Job failed!
 

This issue did not exist in earlier versions of nutch -- it would ignore the malformed URL without crashing.






--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

Isabelle Giguere (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Whitman updated NUTCH-554:
--------------------------------

    Attachment: genpatch.diff

Attaching patch that seems to fix the problem for me. This just catches the MalformedURLException from generator. I don't know why this exception would create a fatal error in nutch, but it was.



> Generator throws java.io.IOException and dies on injected urls with no protocol
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-554
>                 URL: https://issues.apache.org/jira/browse/NUTCH-554
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Linux(debian) Java 1.6
>            Reporter: Brian Whitman
>         Attachments: genpatch.diff
>
>
> On trunk nutch, injecting URLs with no protocol (like issues.apache.org/jira/ vs. https://issues.apache.org/jira/) causes the generator to fail with an IOException:
> java.net.MalformedURLException: no protocol: www.variogr.am
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:187)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
> 2007-09-15 11:11:26,986 FATAL crawl.Generator - Generator: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:557)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:520)
> To test:
> # cat test/urls.txt
> www.variogr.am
> http://www.variogr.am/
> # bin/nutch inject testcrawl/crawldb test/
> (this goes fine)
> # bin/nutch generate testcrawl/crawldb testcrawl/segments -topN 10
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: testcrawl/segments/20070915111125
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: java.io.IOException: Job failed!
>  
> This issue did not exist in earlier versions of nutch -- it would ignore the malformed URL without crashing.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

Isabelle Giguere (Jira)
In reply to this post by Isabelle Giguere (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-554.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Andrzej Bialecki

Patch applied to trunk/ in rev. 577018. Thank you!

> Generator throws java.io.IOException and dies on injected urls with no protocol
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-554
>                 URL: https://issues.apache.org/jira/browse/NUTCH-554
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Linux(debian) Java 1.6
>            Reporter: Brian Whitman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: genpatch.diff
>
>
> On trunk nutch, injecting URLs with no protocol (like issues.apache.org/jira/ vs. https://issues.apache.org/jira/) causes the generator to fail with an IOException:
> java.net.MalformedURLException: no protocol: www.variogr.am
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:187)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
> 2007-09-15 11:11:26,986 FATAL crawl.Generator - Generator: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:557)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:520)
> To test:
> # cat test/urls.txt
> www.variogr.am
> http://www.variogr.am/
> # bin/nutch inject testcrawl/crawldb test/
> (this goes fine)
> # bin/nutch generate testcrawl/crawldb testcrawl/segments -topN 10
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: testcrawl/segments/20070915111125
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: java.io.IOException: Job failed!
>  
> This issue did not exist in earlier versions of nutch -- it would ignore the malformed URL without crashing.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

Isabelle Giguere (Jira)
In reply to this post by Isabelle Giguere (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-554.
-----------------------------------


> Generator throws java.io.IOException and dies on injected urls with no protocol
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-554
>                 URL: https://issues.apache.org/jira/browse/NUTCH-554
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Linux(debian) Java 1.6
>            Reporter: Brian Whitman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: genpatch.diff
>
>
> On trunk nutch, injecting URLs with no protocol (like issues.apache.org/jira/ vs. https://issues.apache.org/jira/) causes the generator to fail with an IOException:
> java.net.MalformedURLException: no protocol: www.variogr.am
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:187)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
> 2007-09-15 11:11:26,986 FATAL crawl.Generator - Generator: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:557)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:520)
> To test:
> # cat test/urls.txt
> www.variogr.am
> http://www.variogr.am/
> # bin/nutch inject testcrawl/crawldb test/
> (this goes fine)
> # bin/nutch generate testcrawl/crawldb testcrawl/segments -topN 10
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: testcrawl/segments/20070915111125
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: java.io.IOException: Job failed!
>  
> This issue did not exist in earlier versions of nutch -- it would ignore the malformed URL without crashing.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

Isabelle Giguere (Jira)
In reply to this post by Isabelle Giguere (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528658 ]

Hudson commented on NUTCH-554:
------------------------------

Integrated in Nutch-Nightly #211 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/211/])

> Generator throws java.io.IOException and dies on injected urls with no protocol
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-554
>                 URL: https://issues.apache.org/jira/browse/NUTCH-554
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Linux(debian) Java 1.6
>            Reporter: Brian Whitman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: genpatch.diff
>
>
> On trunk nutch, injecting URLs with no protocol (like issues.apache.org/jira/ vs. https://issues.apache.org/jira/) causes the generator to fail with an IOException:
> java.net.MalformedURLException: no protocol: www.variogr.am
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:187)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
> 2007-09-15 11:11:26,986 FATAL crawl.Generator - Generator: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:557)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:520)
> To test:
> # cat test/urls.txt
> www.variogr.am
> http://www.variogr.am/
> # bin/nutch inject testcrawl/crawldb test/
> (this goes fine)
> # bin/nutch generate testcrawl/crawldb testcrawl/segments -topN 10
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: testcrawl/segments/20070915111125
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: java.io.IOException: Job failed!
>  
> This issue did not exist in earlier versions of nutch -- it would ignore the malformed URL without crashing.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.