[jira] Created: (NUTCH-504) NUTCH-443 broke parsing during fetching

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-504) NUTCH-443 broke parsing during fetching

Prajeeth Emanuel (Jira)
NUTCH-443 broke parsing during fetching
---------------------------------------

                 Key: NUTCH-504
                 URL: https://issues.apache.org/jira/browse/NUTCH-504
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.0.0
            Reporter: Doğacan Güney
             Fix For: 1.0.0


After NUTCH-443, if one is parsing during fetching and parsing for a url fails, that url doesn't get segment name or similar properties in its metadata. Because of this, indexer fails (because, index expects to see segment name for all parses, even those that failed).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-504) NUTCH-443 broke parsing during fetching

Prajeeth Emanuel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-504:
--------------------------------

    Attachment: parse_in_fetchers.patch

Patch for the problem. I think it would be nice to add a test case for this, but I am not sure how we can force a parse to fail so we can test it properly(comments are welcome:).



> NUTCH-443 broke parsing during fetching
> ---------------------------------------
>
>                 Key: NUTCH-504
>                 URL: https://issues.apache.org/jira/browse/NUTCH-504
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: parse_in_fetchers.patch
>
>
> After NUTCH-443, if one is parsing during fetching and parsing for a url fails, that url doesn't get segment name or similar properties in its metadata. Because of this, indexer fails (because, index expects to see segment name for all parses, even those that failed).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching

Prajeeth Emanuel (Jira)
In reply to this post by Prajeeth Emanuel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507162 ]

Doğacan Güney commented on NUTCH-504:
-------------------------------------

Also, should we actually index documents even if their parses have failed? Since, when a url fails we replace its parse with an empty parse anyway, it may be a good idea to skip such documents.

> NUTCH-443 broke parsing during fetching
> ---------------------------------------
>
>                 Key: NUTCH-504
>                 URL: https://issues.apache.org/jira/browse/NUTCH-504
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: parse_in_fetchers.patch
>
>
> After NUTCH-443, if one is parsing during fetching and parsing for a url fails, that url doesn't get segment name or similar properties in its metadata. Because of this, indexer fails (because, index expects to see segment name for all parses, even those that failed).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching

Prajeeth Emanuel (Jira)
In reply to this post by Prajeeth Emanuel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507168 ]

Andrzej Bialecki  commented on NUTCH-504:
-----------------------------------------

+1 - we should skip documents that failed to parse properly, in such cases we have no usable text anyway.

> NUTCH-443 broke parsing during fetching
> ---------------------------------------
>
>                 Key: NUTCH-504
>                 URL: https://issues.apache.org/jira/browse/NUTCH-504
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: parse_in_fetchers.patch
>
>
> After NUTCH-443, if one is parsing during fetching and parsing for a url fails, that url doesn't get segment name or similar properties in its metadata. Because of this, indexer fails (because, index expects to see segment name for all parses, even those that failed).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-504) NUTCH-443 broke parsing during fetching

Prajeeth Emanuel (Jira)
In reply to this post by Prajeeth Emanuel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-504:
--------------------------------

    Attachment: NUTCH-504_v2.patch

New version.

* Includes older patch.
* Indexer filters unsuccessful parses.
* Updated TestFetcher unit case, TestFetcher now fails without this patch.
* Also added a http.robots.agents property to src/test/crawl-tests.xml. Without this, TestFetcher logs a FATAL RobotRuleParser error(which doesn't cause TestFetcher to fail but is still annoying).

> NUTCH-443 broke parsing during fetching
> ---------------------------------------
>
>                 Key: NUTCH-504
>                 URL: https://issues.apache.org/jira/browse/NUTCH-504
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-504_v2.patch, parse_in_fetchers.patch
>
>
> After NUTCH-443, if one is parsing during fetching and parsing for a url fails, that url doesn't get segment name or similar properties in its metadata. Because of this, indexer fails (because, index expects to see segment name for all parses, even those that failed).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-504) NUTCH-443 broke parsing during fetching

Prajeeth Emanuel (Jira)
In reply to this post by Prajeeth Emanuel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-504.
---------------------------------

    Resolution: Fixed
      Assignee: Doğacan Güney

Fixed in rev. 550196.

> NUTCH-443 broke parsing during fetching
> ---------------------------------------
>
>                 Key: NUTCH-504
>                 URL: https://issues.apache.org/jira/browse/NUTCH-504
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-504_v2.patch, parse_in_fetchers.patch
>
>
> After NUTCH-443, if one is parsing during fetching and parsing for a url fails, that url doesn't get segment name or similar properties in its metadata. Because of this, indexer fails (because, index expects to see segment name for all parses, even those that failed).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching

Prajeeth Emanuel (Jira)
In reply to this post by Prajeeth Emanuel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507783 ]

Hudson commented on NUTCH-504:
------------------------------

Integrated in Nutch-Nightly #128 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/128/])

> NUTCH-443 broke parsing during fetching
> ---------------------------------------
>
>                 Key: NUTCH-504
>                 URL: https://issues.apache.org/jira/browse/NUTCH-504
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-504_v2.patch, parse_in_fetchers.patch
>
>
> After NUTCH-443, if one is parsing during fetching and parsing for a url fails, that url doesn't get segment name or similar properties in its metadata. Because of this, indexer fails (because, index expects to see segment name for all parses, even those that failed).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.