Commented: (NUTCH-247) robot parser to restrict.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-247) robot parser to restrict.

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474085 ]

Andrzej Bialecki  commented on NUTCH-247:
-----------------------------------------

Setting even a bogus agent name is an insignificant effort compared to the further complication of the code and configuration options ... I prefer the solution where Fetcher checks the agent name just before starting the job, regardless of the protocols and locations used on the fetchlist. Besides, you may generate arbitrary fetchlists, some of them may accidentally contain external URLs - should you then have to change the config for each fetchlist?

> robot parser to restrict.
> -------------------------
>
>                 Key: NUTCH-247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-247
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>         Assigned To: Dennis Kubes
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: agent-names.patch
>
>
> If the agent name and the robots agents are not proper configure the Robot rule parser uses LOG.severe to log the problem but solve it also.
> Later on the fetcher thread checks for severe errors and stop if there is one.
> RobotRulesParser:
> if (agents.size() == 0) {
>       agents.add(agentName);
>       LOG.severe("No agents listed in 'http.robots.agents' property!");
>     } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
>       agents.add(0, agentName);
>       LOG.severe("Agent we advertise (" + agentName
>                  + ") not listed first in 'http.robots.agents' property!");
>     }
> Fetcher.FetcherThread:
>  if (LogFormatter.hasLoggedSevere())     // something bad happened
>             break;  
> I suggest to use warn or something similar instead of severe to log this problem.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.