robots.txt Disallow not respected

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

robots.txt Disallow not respected

mabi
Hello,

I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt Disallow from my website. I have the following very simple robots.txt file:

User-agent: *
Disallow: /wpblog/feed/

Still the /wpblog/feed/ URL gets parsed and finally indexed.

Do I need to enable anything special in the nutch-site.xml config file maybe?

Thanks,
Mabi




Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

Sebastian Nagel
Hi,

I've tried to reproduce it. But it works as expected:

% cat robots.txt
User-agent: *
Disallow: /wpblog/feed/

% cat test.txt
http://www.example.com/wpblog/feed/
http://www.example.com/wpblog/feed/index.html

% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
not allowed:    http://www.example.com/wpblog/feed/
not allowed:    http://www.example.com/wpblog/feed/index.html


There are no steps required to make Nutch respect the robots.txt rules.
Only the robots.txt must be properly placed and readable.

Best,
Sebastian


On 12/10/2017 11:16 PM, mabi wrote:
> Hello,
>
> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
Disallow from my website. I have the following very simple robots.txt file:

>
> User-agent: *
> Disallow: /wpblog/feed/
>
> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>
> Do I need to enable anything special in the nutch-site.xml config file maybe?
>
> Thanks,
> Mabi
> ​
>
> ​
>
Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

Zoltán Zvara
Hi,

Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.

Z
On 2017-12-10 23:54:14, Sebastian Nagel <[hidden email]> wrote:
Hi,

I've tried to reproduce it. But it works as expected:

% cat robots.txt
User-agent: *
Disallow: /wpblog/feed/

% cat test.txt
http://www.example.com/wpblog/feed/
http://www.example.com/wpblog/feed/index.html

% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
not allowed: http://www.example.com/wpblog/feed/
not allowed: http://www.example.com/wpblog/feed/index.html


There are no steps required to make Nutch respect the robots.txt rules.
Only the robots.txt must be properly placed and readable.

Best,
Sebastian


On 12/10/2017 11:16 PM, mabi wrote:
> Hello,
>
> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
Disallow from my website. I have the following very simple robots.txt file:

>
> User-agent: *
> Disallow: /wpblog/feed/
>
> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>
> Do I need to enable anything special in the nutch-site.xml config file maybe?
>
> Thanks,
> Mabi
> ​
>
> ​
>
Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

mabi
Hi Sebastian,

I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:

123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"

​What I also did is to enable DEBUG logging in log4j.properties like that:

log4j.logger.org.apache.nutch=DEBUG

​and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.

What else could I try or check?

Best,
M.

>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 11, 2017 7:13 AM
>UTC Time: December 11, 2017 6:13 AM
>From: [hidden email]
>To: [hidden email]
>
>Hi,
>
> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>
> Z
> On 2017-12-10 23:54:14, Sebastian Nagel [hidden email] wrote:
> Hi,
>
> I've tried to reproduce it. But it works as expected:
>
> % cat robots.txt
> User-agent: *
> Disallow: /wpblog/feed/
>
> % cat test.txt
>http://www.example.com/wpblog/feed/
>http://www.example.com/wpblog/feed/index.html
>
> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
> not allowed: http://www.example.com/wpblog/feed/
> not allowed: http://www.example.com/wpblog/feed/index.html
>
>
> There are no steps required to make Nutch respect the robots.txt rules.
> Only the robots.txt must be properly placed and readable.
>
> Best,
> Sebastian
>
>
> On 12/10/2017 11:16 PM, mabi wrote:
>>Hello,
>>I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>> Disallow from my website. I have the following very simple robots.txt file:
>>User-agent: *
>> Disallow: /wpblog/feed/
>>Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>Do I need to enable anything special in the nutch-site.xml config file maybe?
>>Thanks,
>> Mabi
>> ​
>>​
>>
Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

Sebastian Nagel
Hi,

did you already test whether the robots.txt file is correctly parsed
and rules are applied as expected? See the previous response.

If https or non-default ports are used: is the robots.txt shipped also
for other protocol/port combinations? See
   https://issues.apache.org/jira/browse/NUTCH-1752

Also note that content is not removed when the robots.txt is changed.
The robots.txt is only applied to a URL which is (re)fetched. To be sure,
delete the web table (stored in HBase, etc.) and restart the crawl.

Best,
Sebastian

On 12/11/2017 07:39 PM, mabi wrote:

> Hi Sebastian,
>
> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
>
> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>
> ​What I also did is to enable DEBUG logging in log4j.properties like that:
>
> log4j.logger.org.apache.nutch=DEBUG
>
> ​and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
>
> What else could I try or check?
>
> Best,
> M.
>
>> -------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 11, 2017 7:13 AM
>> UTC Time: December 11, 2017 6:13 AM
>> From: [hidden email]
>> To: [hidden email]
>>
>> Hi,
>>
>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>
>> Z
>> On 2017-12-10 23:54:14, Sebastian Nagel [hidden email] wrote:
>> Hi,
>>
>> I've tried to reproduce it. But it works as expected:
>>
>> % cat robots.txt
>> User-agent: *
>> Disallow: /wpblog/feed/
>>
>> % cat test.txt
>> http://www.example.com/wpblog/feed/
>> http://www.example.com/wpblog/feed/index.html
>>
>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>> not allowed: http://www.example.com/wpblog/feed/
>> not allowed: http://www.example.com/wpblog/feed/index.html
>>
>>
>> There are no steps required to make Nutch respect the robots.txt rules.
>> Only the robots.txt must be properly placed and readable.
>>
>> Best,
>> Sebastian
>>
>>
>> On 12/10/2017 11:16 PM, mabi wrote:
>>> Hello,
>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>> Disallow from my website. I have the following very simple robots.txt file:
>>> User-agent: *
>>> Disallow: /wpblog/feed/
>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>> Thanks,
>>> Mabi
>>> ​
>>> ​
>>>

Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

mabi
Hi,

Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.

I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(

Regards,
M.

>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 12, 2017 10:09 AM
>UTC Time: December 12, 2017 9:09 AM
>From: [hidden email]
>To: [hidden email]
>
>Hi,
>
> did you already test whether the robots.txt file is correctly parsed
> and rules are applied as expected? See the previous response.
>
> If https or non-default ports are used: is the robots.txt shipped also
> for other protocol/port combinations? See
>https://issues.apache.org/jira/browse/NUTCH-1752
>
> Also note that content is not removed when the robots.txt is changed.
> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
> delete the web table (stored in HBase, etc.) and restart the crawl.
>
> Best,
> Sebastian
>
> On 12/11/2017 07:39 PM, mabi wrote:
>>Hi Sebastian,
>>I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
>>123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>>​What I also did is to enable DEBUG logging in log4j.properties like that:
>>log4j.logger.org.apache.nutch=DEBUG
>>​and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
>>What else could I try or check?
>>Best,
>> M.
>>>-------- Original Message --------
>>> Subject: Re: robots.txt Disallow not respected
>>> Local Time: December 11, 2017 7:13 AM
>>> UTC Time: December 11, 2017 6:13 AM
>>> From: [hidden email]
>>> To: [hidden email]
>>>Hi,
>>>Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>>Z
>>> On 2017-12-10 23:54:14, Sebastian Nagel [hidden email] wrote:
>>> Hi,
>>>I've tried to reproduce it. But it works as expected:
>>>% cat robots.txt
>>> User-agent: *
>>> Disallow: /wpblog/feed/
>>>% cat test.txt
>>>http://www.example.com/wpblog/feed/
>>>http://www.example.com/wpblog/feed/index.html
>>>% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>>> not allowed: http://www.example.com/wpblog/feed/
>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>There are no steps required to make Nutch respect the robots.txt rules.
>>> Only the robots.txt must be properly placed and readable.
>>>Best,
>>> Sebastian
>>>On 12/10/2017 11:16 PM, mabi wrote:
>>>>Hello,
>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>>> Disallow from my website. I have the following very simple robots.txt file:
>>>> User-agent: *
>>>> Disallow: /wpblog/feed/
>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>>> Thanks,
>>>> Mabi
>>>> ​
>>>> ​
>>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

mabi
Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that).

I confirm that the parsing of robots.txt works.


>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 12, 2017 10:54 PM
>UTC Time: December 12, 2017 9:54 PM
>From: [hidden email]
>To: [hidden email] <[hidden email]>
>
>Hi,
>
> Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.
> ​
> I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(
>
> Regards,
> M.
>>-------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 12, 2017 10:09 AM
>> UTC Time: December 12, 2017 9:09 AM
>> From: [hidden email]
>> To: [hidden email]
>>Hi,
>>did you already test whether the robots.txt file is correctly parsed
>> and rules are applied as expected? See the previous response.
>>If https or non-default ports are used: is the robots.txt shipped also
>> for other protocol/port combinations? See
>>https://issues.apache.org/jira/browse/NUTCH-1752
>>Also note that content is not removed when the robots.txt is changed.
>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
>> delete the web table (stored in HBase, etc.) and restart the crawl.
>>Best,
>> Sebastian
>>On 12/11/2017 07:39 PM, mabi wrote:
>>>Hi Sebastian,
>>> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>>> ​What I also did is to enable DEBUG logging in log4j.properties like that:
>>> log4j.logger.org.apache.nutch=DEBUG
>>> ​and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
>>> What else could I try or check?
>>> Best,
>>> M.
>>>>-------- Original Message --------
>>>> Subject: Re: robots.txt Disallow not respected
>>>> Local Time: December 11, 2017 7:13 AM
>>>> UTC Time: December 11, 2017 6:13 AM
>>>> From: [hidden email]
>>>> To: [hidden email]
>>>> Hi,
>>>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>>> Z
>>>> On 2017-12-10 23:54:14, Sebastian Nagel [hidden email] wrote:
>>>> Hi,
>>>> I've tried to reproduce it. But it works as expected:
>>>> % cat robots.txt
>>>> User-agent: *
>>>> Disallow: /wpblog/feed/
>>>> % cat test.txt
>>>>http://www.example.com/wpblog/feed/
>>>>http://www.example.com/wpblog/feed/index.html
>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>>>> not allowed: http://www.example.com/wpblog/feed/
>>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>> There are no steps required to make Nutch respect the robots.txt rules.
>>>> Only the robots.txt must be properly placed and readable.
>>>> Best,
>>>> Sebastian
>>>> On 12/10/2017 11:16 PM, mabi wrote:
>>>>>Hello,
>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>>>> Disallow from my website. I have the following very simple robots.txt file:
>>>>> User-agent: *
>>>>> Disallow: /wpblog/feed/
>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>>>> Thanks,
>>>>> Mabi
>>>>> ​
>>>>> ​
>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

Sebastian Nagel
:)

On 12/12/2017 11:11 PM, mabi wrote:

> Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that).
> ​
> I confirm that the parsing of robots.txt works.
>
>
>> -------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 12, 2017 10:54 PM
>> UTC Time: December 12, 2017 9:54 PM
>> From: [hidden email]
>> To: [hidden email] <[hidden email]>
>>
>> Hi,
>>
>> Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.
>> ​
>> I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(
>>
>> Regards,
>> M.
>>> -------- Original Message --------
>>> Subject: Re: robots.txt Disallow not respected
>>> Local Time: December 12, 2017 10:09 AM
>>> UTC Time: December 12, 2017 9:09 AM
>>> From: [hidden email]
>>> To: [hidden email]
>>> Hi,
>>> did you already test whether the robots.txt file is correctly parsed
>>> and rules are applied as expected? See the previous response.
>>> If https or non-default ports are used: is the robots.txt shipped also
>>> for other protocol/port combinations? See
>>> https://issues.apache.org/jira/browse/NUTCH-1752
>>> Also note that content is not removed when the robots.txt is changed.
>>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
>>> delete the web table (stored in HBase, etc.) and restart the crawl.
>>> Best,
>>> Sebastian
>>> On 12/11/2017 07:39 PM, mabi wrote:
>>>> Hi Sebastian,
>>>> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
>>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>>>> ​What I also did is to enable DEBUG logging in log4j.properties like that:
>>>> log4j.logger.org.apache.nutch=DEBUG
>>>> ​and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
>>>> What else could I try or check?
>>>> Best,
>>>> M.
>>>>> -------- Original Message --------
>>>>> Subject: Re: robots.txt Disallow not respected
>>>>> Local Time: December 11, 2017 7:13 AM
>>>>> UTC Time: December 11, 2017 6:13 AM
>>>>> From: [hidden email]
>>>>> To: [hidden email]
>>>>> Hi,
>>>>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>>>> Z
>>>>> On 2017-12-10 23:54:14, Sebastian Nagel [hidden email] wrote:
>>>>> Hi,
>>>>> I've tried to reproduce it. But it works as expected:
>>>>> % cat robots.txt
>>>>> User-agent: *
>>>>> Disallow: /wpblog/feed/
>>>>> % cat test.txt
>>>>> http://www.example.com/wpblog/feed/
>>>>> http://www.example.com/wpblog/feed/index.html
>>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>>>>> not allowed: http://www.example.com/wpblog/feed/
>>>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>>> There are no steps required to make Nutch respect the robots.txt rules.
>>>>> Only the robots.txt must be properly placed and readable.
>>>>> Best,
>>>>> Sebastian
>>>>> On 12/10/2017 11:16 PM, mabi wrote:
>>>>>> Hello,
>>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>>>>> Disallow from my website. I have the following very simple robots.txt file:
>>>>>> User-agent: *
>>>>>> Disallow: /wpblog/feed/
>>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>>>>> Thanks,
>>>>>> Mabi
>>>>>> ​
>>>>>> ​
>>>>>>

Reply | Threaded
Open this post in threaded view
|

Re: robots.txt Disallow not respected

Chris Mattmann
FWIW, in versions of Nutch post 1.10, there is a robots.whitelist property that you can use
to whitelist sites to ignore robots.txt explicitly.

Cheers,
Chris




On 12/12/17, 2:32 PM, "Sebastian Nagel" <[hidden email]> wrote:

    :)
   
    On 12/12/2017 11:11 PM, mabi wrote:
    > Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that).
    > ​
    > I confirm that the parsing of robots.txt works.
    >
    >
    >> -------- Original Message --------
    >> Subject: Re: robots.txt Disallow not respected
    >> Local Time: December 12, 2017 10:54 PM
    >> UTC Time: December 12, 2017 9:54 PM
    >> From: [hidden email]
    >> To: [hidden email] <[hidden email]>
    >>
    >> Hi,
    >>
    >> Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.
    >> ​
    >> I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(
    >>
    >> Regards,
    >> M.
    >>> -------- Original Message --------
    >>> Subject: Re: robots.txt Disallow not respected
    >>> Local Time: December 12, 2017 10:09 AM
    >>> UTC Time: December 12, 2017 9:09 AM
    >>> From: [hidden email]
    >>> To: [hidden email]
    >>> Hi,
    >>> did you already test whether the robots.txt file is correctly parsed
    >>> and rules are applied as expected? See the previous response.
    >>> If https or non-default ports are used: is the robots.txt shipped also
    >>> for other protocol/port combinations? See
    >>> https://issues.apache.org/jira/browse/NUTCH-1752
    >>> Also note that content is not removed when the robots.txt is changed.
    >>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
    >>> delete the web table (stored in HBase, etc.) and restart the crawl.
    >>> Best,
    >>> Sebastian
    >>> On 12/11/2017 07:39 PM, mabi wrote:
    >>>> Hi Sebastian,
    >>>> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
    >>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
    >>>> ​What I also did is to enable DEBUG logging in log4j.properties like that:
    >>>> log4j.logger.org.apache.nutch=DEBUG
    >>>> ​and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
    >>>> What else could I try or check?
    >>>> Best,
    >>>> M.
    >>>>> -------- Original Message --------
    >>>>> Subject: Re: robots.txt Disallow not respected
    >>>>> Local Time: December 11, 2017 7:13 AM
    >>>>> UTC Time: December 11, 2017 6:13 AM
    >>>>> From: [hidden email]
    >>>>> To: [hidden email]
    >>>>> Hi,
    >>>>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
    >>>>> Z
    >>>>> On 2017-12-10 23:54:14, Sebastian Nagel [hidden email] wrote:
    >>>>> Hi,
    >>>>> I've tried to reproduce it. But it works as expected:
    >>>>> % cat robots.txt
    >>>>> User-agent: *
    >>>>> Disallow: /wpblog/feed/
    >>>>> % cat test.txt
    >>>>> http://www.example.com/wpblog/feed/
    >>>>> http://www.example.com/wpblog/feed/index.html
    >>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
    >>>>> not allowed: http://www.example.com/wpblog/feed/
    >>>>> not allowed: http://www.example.com/wpblog/feed/index.html
    >>>>> There are no steps required to make Nutch respect the robots.txt rules.
    >>>>> Only the robots.txt must be properly placed and readable.
    >>>>> Best,
    >>>>> Sebastian
    >>>>> On 12/10/2017 11:16 PM, mabi wrote:
    >>>>>> Hello,
    >>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
    >>>>>> Disallow from my website. I have the following very simple robots.txt file:
    >>>>>> User-agent: *
    >>>>>> Disallow: /wpblog/feed/
    >>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
    >>>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
    >>>>>> Thanks,
    >>>>>> Mabi
    >>>>>> ​
    >>>>>> ​
    >>>>>>