[jira] Created: (NUTCH-359) extraction of links will fail for whole page if one single link cannot be parsed

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-359) extraction of links will fail for whole page if one single link cannot be parsed

JIRA jira@apache.org
extraction of links will fail for whole page if one single link cannot be parsed
--------------------------------------------------------------------------------

                 Key: NUTCH-359
                 URL: http://issues.apache.org/jira/browse/NUTCH-359
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8
         Environment: Ubuntu Dapper
            Reporter: Renaud Richardet
            Priority: Minor
         Attachments: outlink.diff

When Nutch parses the outlinks of a fetched page, the process will fail if a single link cannot be parsed (e.g. java.net.MalformedURLException: unknown protocol). The attached patch will keep indexing the remaining links on that page even if one fails.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-359) extraction of links will fail for whole page if one single link cannot be parsed

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-359?page=comments#action_12433315 ]
           
Otis Gospodnetic commented on NUTCH-359:
----------------------------------------

Looks fine and simple (and has a small typo in the last comment).  Sami is doing 0.8.1 soon, so I won't mess with this now.

> extraction of links will fail for whole page if one single link cannot be parsed
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-359
>                 URL: http://issues.apache.org/jira/browse/NUTCH-359
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: Ubuntu Dapper
>            Reporter: Renaud Richardet
>            Priority: Minor
>         Attachments: outlink.diff
>
>
> When Nutch parses the outlinks of a fetched page, the process will fail if a single link cannot be parsed (e.g. java.net.MalformedURLException: unknown protocol). The attached patch will keep indexing the remaining links on that page even if one fails.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira