[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

Michael Gibney (Jira)
if a 404 for a robots.txt is returned no page is fetched at all from the host
-----------------------------------------------------------------------------

         Key: NUTCH-298
         URL: http://issues.apache.org/jira/browse/NUTCH-298
     Project: Nutch
        Type: Bug

    Reporter: Stefan Groschupf
     Fix For: 0.8-dev


What happen:

Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt.
In case http response code is not 200 or 403 but for example 404 we do " robotRules = EMPTY_RULES; " (line: 402)
EMPTY_RULES is a RobotRuleSet created with the default constructor.
tmpEntries and entries is null and will never changed.
If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet.
In this case a NPE is thrown in this line:
 if (entries == null) {
        entries= new RobotsEntry[tmpEntries.size()];

possible Solution:
We can intialize the tmpEntries by default and also remove other null checks and initialisations.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
-----------------------------------

    Attachment: fixNpeRobotRuleSet.patch

fix the npe in RobotRuleSet happen in case we use a empthy RuleSet

> if a 404 for a robots.txt is returned no page is fetched at all from the host
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-298
>          URL: http://issues.apache.org/jira/browse/NUTCH-298
>      Project: Nutch
>         Type: Bug

>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do " robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
>  if (entries == null) {
>         entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks and initialisations.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-298?page=comments#action_12414647 ]

Stefan Neufeind commented on NUTCH-298:
---------------------------------------

Is the description-line of this bug correct? I've been indexing pages without robots.txt, and I just  checked that those hosts give a 404 since robots.txt does not exist.

> if a 404 for a robots.txt is returned no page is fetched at all from the host
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-298
>          URL: http://issues.apache.org/jira/browse/NUTCH-298
>      Project: Nutch
>         Type: Bug

>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do " robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
>  if (entries == null) {
>         entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks and initialisations.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
-----------------------------------

    Summary: if a 404 for a robots.txt is returned a NPE is thrown  (was: if a 404 for a robots.txt is returned no page is fetched at all from the host)

Sorry, worng description.

> if a 404 for a robots.txt is returned a NPE is thrown
> -----------------------------------------------------
>
>          Key: NUTCH-298
>          URL: http://issues.apache.org/jira/browse/NUTCH-298
>      Project: Nutch
>         Type: Bug

>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do " robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
>  if (entries == null) {
>         entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks and initialisations.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
     
Jerome Charron resolved NUTCH-298:
----------------------------------

    Resolution: Fixed

Committed + some unit tests to reproduce.
Thanks Stefan.
As you mentioned it in a previous mail, I agree that the RobotRulesParser should be rewrite.

> if a 404 for a robots.txt is returned a NPE is thrown
> -----------------------------------------------------
>
>          Key: NUTCH-298
>          URL: http://issues.apache.org/jira/browse/NUTCH-298
>      Project: Nutch
>         Type: Bug

>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do " robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
>  if (entries == null) {
>         entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks and initialisations.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira