Nutch ignoring robots.txt

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Nutch ignoring robots.txt

David Smith-53

Good afternoon all,

My installation of nutch appears to be ignoring the robots.txt for a site I'm crawling. (http://www.gardenanimals.co.nz/).  Site has a robots.txt that contains

User-agent: *
Disallow: /bot-trap/

the hadoop.log contains
INFO  http.Http - protocol.plugin.check.blocking = true
INFO  http.Http - protocol.plugin.check.robots = true

so I assume I've configured nutch to honour the robots.txt file.  But as this entry from crawldb shows

http://www.gardenanimals.co.nz/bot-trap/index.php       Version: 5
Status: 3 (db_gone)
Fetch time: Mon Sep 01 13:01:45 GMT+12:00 2008
Modified time: Thu Jan 01 12:00:00 GMT+12:00 1970
Retries since fetch: 0
Retry interval: 7.0 days
Score: 0.003542109
Signature: null
Metadata: _pst_:robots_denied(18), lastModified=0: http://www.gardenanimals.co.nz/bot-trap/index.php

nutch has still gone and fetched a banned url, thus triggering a bot-trap.  I've no idea as to what I've miss-configured / not configured, any pointers would be greatly appreciated.  Below is my actual nutch-site.xml file if this helps.

Thanks
David


<configuration>
<property>
  <name>http.agent.name</name><value>searchnz</value>
</property>

<property>
  <name>http.robots.agents</name><value>searchnz,*</value>
</property>

<property>
  <name>http.agent.description</name><value>searchnz</value>
</property>

<property>
  <name>http.agent.url</name><value>http://www.searchnz.co.nz/</value>
</property>

<property>
  <name>http.agent.email</name><value>[hidden email]</value>
</property>

<property>
  <name>http.verbose</name><value>true</value>
</property>

<property>
  <name>http.robots.403.allow</name><value>false</value>
</property>

<property>
  <name>fetcher.threads.fetch</name><value>50</value>
</property>

<property>
  <name>db.default.fetch.interval</name><value>7</value>
</property>

<property>
  <name>plugin.includes</name><value>protocol-http|parse-(text|html)|urlfilter-prefix|urlfilter-suffix|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

<property>
  <name>urlfilter.prefix.file</name><value>urlfilter-prefix.txt</value>
</property>

<property>
  <name>urlfilter.suffix.file</name><value>urlfilter-suffix.txt</value>
</property>

</configuration>