RE: [VOTE] Release Apache Nutch 1.15 RC#1

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: [VOTE] Release Apache Nutch 1.15 RC#1

Markus Jelsma-2
However, the test crawl ran/runs fine, in the background, no errors. But just now, watching the fetcher, i noticed the crawl delay is not always respected. The only configuration change i have is the http.agent.* directives to run.

2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/rqlNNVQgix (queue crawl delay=5000ms)
2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 fetching http://planet.apache.org/ (queue crawl delay=5000ms)
2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://schema.org/Event (queue crawl delay=5000ms)
2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 fetching http://people.apache.org/~jianhe (queue crawl delay=5000ms)
2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://en.wikipedia.org/wiki/Internet_marketing (queue crawl delay=5000ms)
2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc (queue crawl delay=5000ms)
2018-08-01 11:47:41,607 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://twitter.com/i/directory/profiles/5 (queue crawl delay=5000ms)
2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories (queue crawl delay=5000ms)
2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 fetching http://photomatt.net/ (queue crawl delay=5000ms)
2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni (queue crawl delay=5000ms)
2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://mobile.twitter.com/david_kunz/followers (queue crawl delay=5000ms)
2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/xEOAFfp7lT (queue crawl delay=5000ms)
2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Q9BJ0FhzzF (queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/wWIMOZ3wxg (queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/dImmnEeXjb (queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/IPPSdW6o52 (queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Y85UlnueSC (queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/TvZSGiZC9D (queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/jG7BvlobXD (queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/ZJmzbWVFrh (queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/atVcrbCi5q (queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://avro.apache.org/releases.html (queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://issues.apache.org/jira/browse/HADOOP-15283 (queue crawl delay=5000ms)
2018-08-01 11:47:42,175 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=500, fetchQueues.getQueueCount=67
2018-08-01 11:47:42,225 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://www.aetna.com/ (queue crawl delay=5000ms)
2018-08-01 11:47:42,316 INFO  fetcher.FetcherThread - FetcherThread 49 fetching http://www.miredot.com/ (queue crawl delay=5000ms)
2018-08-01 11:47:42,357 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://xmlgraphics.apache.org/batik/ (queue crawl delay=5000ms)
2018-08-01 11:47:42,402 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://t.co/XgG7zomVs8 (queue crawl delay=5000ms)

I believe this problem should addressed prior to release,  therefore i withdraw my +1. Because this is not a breaking issue, i will not -1 this RC.

Regards,
Markus

 
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Wednesday 1st August 2018 11:38
> To: [hidden email]; [hidden email]
> Subject: RE: [VOTE] Release Apache Nutch 1.15 RC#1
>
> All tests pass, crawler run fine so far, +1 for 1.15!
>
> Regards,
> Markus
>


> -----Original message-----
> > From:Sebastian Nagel <[hidden email]>
> > Sent: Thursday 26th July 2018 17:05
> > To: [hidden email]
> > Cc: [hidden email]
> > Subject: [VOTE] Release Apache Nutch 1.15 RC#1
> >
> > Hi Folks,
> >
> > A first candidate for the Nutch 1.15 release is available at:
> >
> >   https://dist.apache.org/repos/dist/dev/nutch/1.15/
> >
> > The release candidate is a zip and tar.gz archive of the binary and sources in:
> >   https://github.com/apache/nutch/tree/release-1.15
> >
> > The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
> >    555d00ddc0371b05c5958bde7abb2a9db8c38ee2
> >
> > In addition, a staged maven repository is available here:
> >    https://repository.apache.org/content/repositories/orgapachenutch-1015/
> >
> > We addressed 119 Issues:
> >    https://s.apache.org/nczS
> >
> > Please vote on releasing this package as Apache Nutch 1.15.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Nutch PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Nutch 1.15.
> > [ ] -1 Do not release this package because…
> >
> > Cheers,
> > Sebastian
> > (On behalf of the Nutch PMC)
> >
> > P.S. Here is my +1.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Nutch 1.15 RC#1

Sebastian Nagel-2
Hi Markus,

thanks for running a test crawl.

> i noticed the crawl delay is not always respected

Do you mean for the host t.co ?

The host t.co disallows crawling in its robots.txt (https://t.co/robots.txt).
The first access fetches the robots.txt, all later fetches do not block because the host is not
accessed at all. That's by design.

But it could be a useful improvement to log this (or in general the status of a fetch).
It would double the logged lines but would help to understand what the fetcher is doing,
esp. regarding robots denied and redirects.

Best,
Sebastian


On 08/01/2018 11:59 AM, Markus Jelsma wrote:

> However, the test crawl ran/runs fine, in the background, no errors. But just now, watching the fetcher, i noticed the crawl delay is not always respected. The only configuration change i have is the http.agent.* directives to run.
>
> 2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/rqlNNVQgix (queue crawl delay=5000ms)in general
> 2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 fetching http://planet.apache.org/ (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> 2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://schema.org/Event (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 fetching http://people.apache.org/~jianhe (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> 2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://en.wikipedia.org/wiki/Internet_marketing (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,607863 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> 2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://twitter.com/i/directory/profiles/5 (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 fetching http://photomatt.net/ (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://mobile.twitter.com/david_kunz/followers (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/xEOAFfp7lT (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Q9BJ0FhzzF (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/wWIMOZ3wxg (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/dImmnEeXjb (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/IPPSdW6o52 (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Y85UlnueSC (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/TvZSGiZC9D (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/jG7BvlobXD (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/ZJmzbWVFrh (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/atVcrbCi5q (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://avro.apache.org/releases.html (queue crawl delay=5000ms)
> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://issues.apache.org/jira/browse/HADOOP-15283 (queue crawl delay=5000ms)
> 2018-08-01 11:47:42,175 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=500, fetchQueues.getQueueCount=67
> 2018-08-01 11:47:42,225 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://www.aetna.com/ (queue crawl delay=5000ms)
> 2018-08-01 11:47:42,316 INFO  fetcher.FetcherThread - FetcherThread 49 fetching http://www.miredot.com/ (queue crawl delay=5000ms)
> 2018-08-01 11:47:42,357 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://xmlgraphics.apache.org/batik/ (queue crawl delay=5000ms)
> 2018-08-01 11:47:42,402 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://t.co/XgG7zomVs8 (queue crawl delay=5000ms)
>
> I believe this problem should addressed prior to release,  therefore i withdraw my +1. Because this is not a breaking issue, i will not -1 this RC.
>
> Regards,
> Markus
>
>  
>  
> -----Original message-----
>> From:Markus Jelsma <[hidden email]>
>> Sent: Wednesday 1st August 2018 11:38
>> To: [hidden email]; [hidden email]
>> Subject: RE: [VOTE] Release Apache Nutch 1.15 RC#1
>>
>> All tests pass, crawler run fine so far, +1 for 1.15!
>>
>> Regards,
>> Markus
>>
>>  
>>  
>> -----Original message-----
>>> From:Sebastian Nagel <[hidden email]>
>>> Sent: Thursday 26th July 2018 17:05
>>> To: [hidden email]
>>> Cc: [hidden email]
>>> Subject: [VOTE] Release Apache Nutch 1.15 RC#1
>>>
>>> Hi Folks,
>>>
>>> A first candidate for the Nutch 1.15 release is available at:
>>>
>>>    https://dist.apache.org/repos/dist/dev/nutch/1.15/
>>>
>>> The release candidate is a zip and tar.gz archive of the binary and sources in:
>>>    https://github.com/apache/nutch/tree/release-1.15
>>>
>>> The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
>>>     555d00ddc0371b05c5958bde7abb2a9db8c38ee2
>>>
>>> In addition, a staged maven repository is available here:
>>>     https://repository.apache.org/content/repositories/orgapachenutch-1015/
>>>
>>> We addressed 119 Issues:
>>>     https://s.apache.org/nczS
>>>
>>> Please vote on releasing this package as Apache Nutch 1.15.
>>> The vote is open for the next 72 hours and passes if a majority of at
>>> least three +1 Nutch PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Nutch 1.15.
>>> [ ] -1 Do not release this package because…
>>>
>>> Cheers,
>>> Sebastian
>>> (On behalf of the Nutch PMC)
>>>
>>> P.S. Here is my +1.
>>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: [VOTE] Release Apache Nutch 1.15 RC#1

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello Sebastian,

That is unfortunately not the only example:

2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
2018-08-01 11:42:11,289 INFO  fetcher.FetcherThread - FetcherThread 52 fetching https://about.twitter.com/about (queue crawl delay=5000ms)
2018-08-01 11:42:11,313 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=151, fetchQueues.getQueueCount=19
2018-08-01 11:42:11,509 INFO  fetcher.FetcherThread - FetcherThread 50 fetching http://www.apache.org/dyn/closer.cgi/nutch/ (queue crawl delay=4000ms)
2018-08-01 11:42:11,723 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://mobile.twitter.com/MrOrdnas (queue crawl delay=1000ms)
2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)

I also saw it fetching multiple URLs of our own site within the same millisecond, on multiple occasions. Wasn't there some work done regarding crawl delay for 1.15 or is this actually an older problem?

Regarding the logging, i agree. We already log failed fetches, no reason not to log skipped fetches too.

Regards,
Markus
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Wednesday 1st August 2018 12:31
> To: [hidden email]
> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
>
> Hi Markus,
>
> thanks for running a test crawl.
>
> > i noticed the crawl delay is not always respected
>
> Do you mean for the host t.co ?
>
> The host t.co disallows crawling in its robots.txt (https://t.co/robots.txt).
> The first access fetches the robots.txt, all later fetches do not block because the host is not
> accessed at all. That's by design.
>
> But it could be a useful improvement to log this (or in general the status of a fetch).
> It would double the logged lines but would help to understand what the fetcher is doing,
> esp. regarding robots denied and redirects.
>
> Best,
> Sebastian
>
>
> On 08/01/2018 11:59 AM, Markus Jelsma wrote:
> > However, the test crawl ran/runs fine, in the background, no errors. But just now, watching the fetcher, i noticed the crawl delay is not always respected. The only configuration change i have is the http.agent.* directives to run.
> >
> > 2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/rqlNNVQgix (queue crawl delay=5000ms)in general
> > 2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 fetching http://planet.apache.org/ (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> > 2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://schema.org/Event (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 fetching http://people.apache.org/~jianhe (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> > 2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://en.wikipedia.org/wiki/Internet_marketing (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,607863 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> > 2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://twitter.com/i/directory/profiles/5 (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 fetching http://photomatt.net/ (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://mobile.twitter.com/david_kunz/followers (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/xEOAFfp7lT (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Q9BJ0FhzzF (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/wWIMOZ3wxg (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/dImmnEeXjb (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/IPPSdW6o52 (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Y85UlnueSC (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/TvZSGiZC9D (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/jG7BvlobXD (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/ZJmzbWVFrh (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/atVcrbCi5q (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://avro.apache.org/releases.html (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://issues.apache.org/jira/browse/HADOOP-15283 (queue crawl delay=5000ms)
> > 2018-08-01 11:47:42,175 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=500, fetchQueues.getQueueCount=67
> > 2018-08-01 11:47:42,225 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://www.aetna.com/ (queue crawl delay=5000ms)
> > 2018-08-01 11:47:42,316 INFO  fetcher.FetcherThread - FetcherThread 49 fetching http://www.miredot.com/ (queue crawl delay=5000ms)
> > 2018-08-01 11:47:42,357 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://xmlgraphics.apache.org/batik/ (queue crawl delay=5000ms)
> > 2018-08-01 11:47:42,402 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://t.co/XgG7zomVs8 (queue crawl delay=5000ms)
> >
> > I believe this problem should addressed prior to release,  therefore i withdraw my +1. Because this is not a breaking issue, i will not -1 this RC.
> >
> > Regards,
> > Markus
> >
> >  
> >  
> > -----Original message-----
> >> From:Markus Jelsma <[hidden email]>
> >> Sent: Wednesday 1st August 2018 11:38
> >> To: [hidden email]; [hidden email]
> >> Subject: RE: [VOTE] Release Apache Nutch 1.15 RC#1
> >>
> >> All tests pass, crawler run fine so far, +1 for 1.15!
> >>
> >> Regards,
> >> Markus
> >>
> >>  
> >>  
> >> -----Original message-----
> >>> From:Sebastian Nagel <[hidden email]>
> >>> Sent: Thursday 26th July 2018 17:05
> >>> To: [hidden email]
> >>> Cc: [hidden email]
> >>> Subject: [VOTE] Release Apache Nutch 1.15 RC#1
> >>>
> >>> Hi Folks,
> >>>
> >>> A first candidate for the Nutch 1.15 release is available at:
> >>>
> >>>    https://dist.apache.org/repos/dist/dev/nutch/1.15/
> >>>
> >>> The release candidate is a zip and tar.gz archive of the binary and sources in:
> >>>    https://github.com/apache/nutch/tree/release-1.15
> >>>
> >>> The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
> >>>     555d00ddc0371b05c5958bde7abb2a9db8c38ee2
> >>>
> >>> In addition, a staged maven repository is available here:
> >>>     https://repository.apache.org/content/repositories/orgapachenutch-1015/
> >>>
> >>> We addressed 119 Issues:
> >>>     https://s.apache.org/nczS
> >>>
> >>> Please vote on releasing this package as Apache Nutch 1.15.
> >>> The vote is open for the next 72 hours and passes if a majority of at
> >>> least three +1 Nutch PMC votes are cast.
> >>>
> >>> [ ] +1 Release this package as Apache Nutch 1.15.
> >>> [ ] -1 Do not release this package because…
> >>>
> >>> Cheers,
> >>> Sebastian
> >>> (On behalf of the Nutch PMC)
> >>>
> >>> P.S. Here is my +1.
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Nutch 1.15 RC#1

Sebastian Nagel-2
Hi Markus

> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching
https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)

Ok, non-blocking because of:
User-agent: *
Disallow: /wiki/Special:

> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching
https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
...
> 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching
http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)

That could be because of NUTCH-2623 (to be fixed in 1.16).

If you have more examples, let me know. Otherwise, let's re-test if NUTCH-2623
is fixed and the logging is improved. Could you open an issue for an improved logging?

Thanks,
Sebastian

On 08/01/2018 12:45 PM, Markus Jelsma wrote:

> Hello Sebastian,
>
> That is unfortunately not the only example:
>
> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
> 2018-08-01 11:42:11,289 INFO  fetcher.FetcherThread - FetcherThread 52 fetching https://about.twitter.com/about (queue crawl delay=5000ms)
> 2018-08-01 11:42:11,313 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=151, fetchQueues.getQueueCount=19
> 2018-08-01 11:42:11,509 INFO  fetcher.FetcherThread - FetcherThread 50 fetching http://www.apache.org/dyn/closer.cgi/nutch/ (queue crawl delay=4000ms)
> 2018-08-01 11:42:11,723 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://mobile.twitter.com/MrOrdnas (queue crawl delay=1000ms)
> 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)
>
> I also saw it fetching multiple URLs of our own site within the same millisecond, on multiple occasions. Wasn't there some work done regarding crawl delay for 1.15 or is this actually an older problem?
>
> Regarding the logging, i agree. We already log failed fetches, no reason not to log skipped fetches too.
>
> Regards,
> Markus
>  
> -----Original message-----
>> From:Sebastian Nagel <[hidden email]>
>> Sent: Wednesday 1st August 2018 12:31
>> To: [hidden email]
>> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
>>
>> Hi Markus,
>>
>> thanks for running a test crawl.
>>
>>> i noticed the crawl delay is not always respected
>>
>> Do you mean for the host t.co ?
>>
>> The host t.co disallows crawling in its robots.txt (https://t.co/robots.txt).
>> The first access fetches the robots.txt, all later fetches do not block because the host is not
>> accessed at all. That's by design.
>>
>> But it could be a useful improvement to log this (or in general the status of a fetch).
>> It would double the logged lines but would help to understand what the fetcher is doing,
>> esp. regarding robots denied and redirects.
>>
>> Best,
>> Sebastian
>>
>>
>> On 08/01/2018 11:59 AM, Markus Jelsma wrote:
>>> However, the test crawl ran/runs fine, in the background, no errors. But just now, watching the fetcher, i noticed the crawl delay is not always respected. The only configuration change i have is the http.agent.* directives to run.
>>>
>>> 2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/rqlNNVQgix (queue crawl delay=5000ms)in general
>>> 2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 fetching http://planet.apache.org/ (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
>>> 2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://schema.org/Event (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 fetching http://people.apache.org/~jianhe (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
>>> 2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://en.wikipedia.org/wiki/Internet_marketing (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,607863 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
>>> 2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://twitter.com/i/directory/profiles/5 (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 fetching http://photomatt.net/ (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://mobile.twitter.com/david_kunz/followers (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/xEOAFfp7lT (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Q9BJ0FhzzF (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/wWIMOZ3wxg (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/dImmnEeXjb (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/IPPSdW6o52 (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Y85UlnueSC (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/TvZSGiZC9D (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/jG7BvlobXD (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/ZJmzbWVFrh (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/atVcrbCi5q (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://avro.apache.org/releases.html (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://issues.apache.org/jira/browse/HADOOP-15283 (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:42,175 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=500, fetchQueues.getQueueCount=67
>>> 2018-08-01 11:47:42,225 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://www.aetna.com/ (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:42,316 INFO  fetcher.FetcherThread - FetcherThread 49 fetching http://www.miredot.com/ (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:42,357 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://xmlgraphics.apache.org/batik/ (queue crawl delay=5000ms)
>>> 2018-08-01 11:47:42,402 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://t.co/XgG7zomVs8 (queue crawl delay=5000ms)
>>>
>>> I believe this problem should addressed prior to release,  therefore i withdraw my +1. Because this is not a breaking issue, i will not -1 this RC.
>>>
>>> Regards,
>>> Markus
>>>
>>>  
>>>  
>>> -----Original message-----
>>>> From:Markus Jelsma <[hidden email]>
>>>> Sent: Wednesday 1st August 2018 11:38
>>>> To: [hidden email]; [hidden email]
>>>> Subject: RE: [VOTE] Release Apache Nutch 1.15 RC#1
>>>>
>>>> All tests pass, crawler run fine so far, +1 for 1.15!
>>>>
>>>> Regards,
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Sebastian Nagel <[hidden email]>
>>>>> Sent: Thursday 26th July 2018 17:05
>>>>> To: [hidden email]
>>>>> Cc: [hidden email]
>>>>> Subject: [VOTE] Release Apache Nutch 1.15 RC#1
>>>>>
>>>>> Hi Folks,
>>>>>
>>>>> A first candidate for the Nutch 1.15 release is available at:
>>>>>
>>>>>    https://dist.apache.org/repos/dist/dev/nutch/1.15/
>>>>>
>>>>> The release candidate is a zip and tar.gz archive of the binary and sources in:
>>>>>    https://github.com/apache/nutch/tree/release-1.15
>>>>>
>>>>> The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
>>>>>     555d00ddc0371b05c5958bde7abb2a9db8c38ee2
>>>>>
>>>>> In addition, a staged maven repository is available here:
>>>>>     https://repository.apache.org/content/repositories/orgapachenutch-1015/
>>>>>
>>>>> We addressed 119 Issues:
>>>>>     https://s.apache.org/nczS
>>>>>
>>>>> Please vote on releasing this package as Apache Nutch 1.15.
>>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>>> least three +1 Nutch PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Nutch 1.15.
>>>>> [ ] -1 Do not release this package because…
>>>>>
>>>>> Cheers,
>>>>> Sebastian
>>>>> (On behalf of the Nutch PMC)
>>>>>
>>>>> P.S. Here is my +1.
>>>>>
>>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Nutch 1.15 RC#1

Sebastian Nagel-2
Checked the logs from my test crawl using the RC package, the 5 sec. delay is respected:

2018-07-26 16:48:47,203 INFO  fetcher.FetcherThread - FetcherThread 43 fetching
http://nutch.apache.org/credits.html (queue crawl delay=5000ms)
2018-07-26 16:48:52,509 INFO  fetcher.FetcherThread - FetcherThread 43 fetching
http://nutch.apache.org/downloads.html (queue crawl delay=5000ms)
2018-07-26 16:48:57,591 INFO  fetcher.FetcherThread - FetcherThread 43 fetching
http://nutch.apache.org/index.html (queue crawl delay=5000ms)
2018-07-26 16:49:02,715 INFO  fetcher.FetcherThread - FetcherThread 43 fetching
http://nutch.apache.org/mailing_lists.html (queue crawl delay=5000ms)
2018-07-26 16:49:07,791 INFO  fetcher.FetcherThread - FetcherThread 43 fetching
http://nutch.apache.org/bot.html (queue crawl delay=5000ms)
2018-07-26 16:49:12,862 INFO  fetcher.FetcherThread - FetcherThread 43 fetching
http://nutch.apache.org/version_control.html (queue crawl delay=5000ms)
2018-07-26 16:49:17,928 INFO  fetcher.FetcherThread - FetcherThread 43 fetching
http://nutch.apache.org/javadoc.html (queue crawl delay=5000ms)
2018-07-26 16:49:35,367 INFO  fetcher.FetcherThread - FetcherThread 42 fetching
http://nutch.apache.org/apidocs/apidocs-2.1/index.html (queue crawl delay=5000ms)
2018-07-26 16:49:40,668 INFO  fetcher.FetcherThread - FetcherThread 42 fetching
http://nutch.apache.org/apidocs/apidocs-1.9/index.html (queue crawl delay=5000ms)
2018-07-26 16:49:45,752 INFO  fetcher.FetcherThread - FetcherThread 42 fetching
http://nutch.apache.org/miredot/1.12/index.html (queue crawl delay=5000ms)
2018-07-26 16:49:50,837 INFO  fetcher.FetcherThread - FetcherThread 42 fetching
http://nutch.apache.org/apidocs/apidocs-2.3.1/index.html (queue crawl delay=5000ms)

On 08/01/2018 12:55 PM, Sebastian Nagel wrote:

> Hi Markus
>
>> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching
> https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
>
> Ok, non-blocking because of:
> User-agent: *
> Disallow: /wiki/Special:
>
>> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching
> https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
> ...
>> 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching
> http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)
>
> That could be because of NUTCH-2623 (to be fixed in 1.16).
>
> If you have more examples, let me know. Otherwise, let's re-test if NUTCH-2623
> is fixed and the logging is improved. Could you open an issue for an improved logging?
>
> Thanks,
> Sebastian
>
> On 08/01/2018 12:45 PM, Markus Jelsma wrote:
>> Hello Sebastian,
>>
>> That is unfortunately not the only example:
>>
>> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
>> 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
>> 2018-08-01 11:42:11,289 INFO  fetcher.FetcherThread - FetcherThread 52 fetching https://about.twitter.com/about (queue crawl delay=5000ms)
>> 2018-08-01 11:42:11,313 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=151, fetchQueues.getQueueCount=19
>> 2018-08-01 11:42:11,509 INFO  fetcher.FetcherThread - FetcherThread 50 fetching http://www.apache.org/dyn/closer.cgi/nutch/ (queue crawl delay=4000ms)
>> 2018-08-01 11:42:11,723 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://mobile.twitter.com/MrOrdnas (queue crawl delay=1000ms)
>> 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)
>>
>> I also saw it fetching multiple URLs of our own site within the same millisecond, on multiple occasions. Wasn't there some work done regarding crawl delay for 1.15 or is this actually an older problem?
>>
>> Regarding the logging, i agree. We already log failed fetches, no reason not to log skipped fetches too.
>>
>> Regards,
>> Markus
>>  
>> -----Original message-----
>>> From:Sebastian Nagel <[hidden email]>
>>> Sent: Wednesday 1st August 2018 12:31
>>> To: [hidden email]
>>> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
>>>
>>> Hi Markus,
>>>
>>> thanks for running a test crawl.
>>>
>>>> i noticed the crawl delay is not always respected
>>>
>>> Do you mean for the host t.co ?
>>>
>>> The host t.co disallows crawling in its robots.txt (https://t.co/robots.txt).
>>> The first access fetches the robots.txt, all later fetches do not block because the host is not
>>> accessed at all. That's by design.
>>>
>>> But it could be a useful improvement to log this (or in general the status of a fetch).
>>> It would double the logged lines but would help to understand what the fetcher is doing,
>>> esp. regarding robots denied and redirects.
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 08/01/2018 11:59 AM, Markus Jelsma wrote:
>>>> However, the test crawl ran/runs fine, in the background, no errors. But just now, watching the fetcher, i noticed the crawl delay is not always respected. The only configuration change i have is the http.agent.* directives to run.
>>>>
>>>> 2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/rqlNNVQgix (queue crawl delay=5000ms)in general
>>>> 2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 fetching http://planet.apache.org/ (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
>>>> 2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://schema.org/Event (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 fetching http://people.apache.org/~jianhe (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
>>>> 2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://en.wikipedia.org/wiki/Internet_marketing (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,607863 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
>>>> 2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://twitter.com/i/directory/profiles/5 (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 fetching http://photomatt.net/ (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://mobile.twitter.com/david_kunz/followers (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/xEOAFfp7lT (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Q9BJ0FhzzF (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/wWIMOZ3wxg (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/dImmnEeXjb (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/IPPSdW6o52 (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Y85UlnueSC (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/TvZSGiZC9D (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/jG7BvlobXD (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/ZJmzbWVFrh (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/atVcrbCi5q (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://avro.apache.org/releases.html (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://issues.apache.org/jira/browse/HADOOP-15283 (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:42,175 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=500, fetchQueues.getQueueCount=67
>>>> 2018-08-01 11:47:42,225 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://www.aetna.com/ (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:42,316 INFO  fetcher.FetcherThread - FetcherThread 49 fetching http://www.miredot.com/ (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:42,357 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://xmlgraphics.apache.org/batik/ (queue crawl delay=5000ms)
>>>> 2018-08-01 11:47:42,402 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://t.co/XgG7zomVs8 (queue crawl delay=5000ms)
>>>>
>>>> I believe this problem should addressed prior to release,  therefore i withdraw my +1. Because this is not a breaking issue, i will not -1 this RC.
>>>>
>>>> Regards,
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Markus Jelsma <[hidden email]>
>>>>> Sent: Wednesday 1st August 2018 11:38
>>>>> To: [hidden email]; [hidden email]
>>>>> Subject: RE: [VOTE] Release Apache Nutch 1.15 RC#1
>>>>>
>>>>> All tests pass, crawler run fine so far, +1 for 1.15!
>>>>>
>>>>> Regards,
>>>>> Markus
>>>>>
>>>>>  
>>>>>  
>>>>> -----Original message-----
>>>>>> From:Sebastian Nagel <[hidden email]>
>>>>>> Sent: Thursday 26th July 2018 17:05
>>>>>> To: [hidden email]
>>>>>> Cc: [hidden email]
>>>>>> Subject: [VOTE] Release Apache Nutch 1.15 RC#1
>>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> A first candidate for the Nutch 1.15 release is available at:
>>>>>>
>>>>>>    https://dist.apache.org/repos/dist/dev/nutch/1.15/
>>>>>>
>>>>>> The release candidate is a zip and tar.gz archive of the binary and sources in:
>>>>>>    https://github.com/apache/nutch/tree/release-1.15
>>>>>>
>>>>>> The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
>>>>>>     555d00ddc0371b05c5958bde7abb2a9db8c38ee2
>>>>>>
>>>>>> In addition, a staged maven repository is available here:
>>>>>>     https://repository.apache.org/content/repositories/orgapachenutch-1015/
>>>>>>
>>>>>> We addressed 119 Issues:
>>>>>>     https://s.apache.org/nczS
>>>>>>
>>>>>> Please vote on releasing this package as Apache Nutch 1.15.
>>>>>> The vote is open for the next 72 hours and passes if a majority of at
>>>>>> least three +1 Nutch PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Nutch 1.15.
>>>>>> [ ] -1 Do not release this package because…
>>>>>>
>>>>>> Cheers,
>>>>>> Sebastian
>>>>>> (On behalf of the Nutch PMC)
>>>>>>
>>>>>> P.S. Here is my +1.
>>>>>>
>>>>>
>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|

RE: [VOTE] Release Apache Nutch 1.15 RC#1

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello Sebastian,

It seems to happen only occasionally, and it only in the earlier phase stages of the fetcher. I've got a very long tail fetcher running now with two hosts, the problem doesn't show up. The only real examples i got is our own site multiple times, and sporadic wiki lemmas. Created an issue [1] for the logging.

Regards,
Markus

[1] https://issues.apache.org/jira/browse/NUTCH-2630

 
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Wednesday 1st August 2018 12:55
> To: [hidden email]
> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
>
> Hi Markus
>
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching
> https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
>
> Ok, non-blocking because of:
> User-agent: *
> Disallow: /wiki/Special:
>
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching
> https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
> ...
> > 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching
> http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)
>
> That could be because of NUTCH-2623 (to be fixed in 1.16).
>
> If you have more examples, let me know. Otherwise, let's re-test if NUTCH-2623
> is fixed and the logging is improved. Could you open an issue for an improved logging?
>
> Thanks,
> Sebastian
>
> On 08/01/2018 12:45 PM, Markus Jelsma wrote:
> > Hello Sebastian,
> >
> > That is unfortunately not the only example:
> >
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
> > 2018-08-01 11:42:11,289 INFO  fetcher.FetcherThread - FetcherThread 52 fetching https://about.twitter.com/about (queue crawl delay=5000ms)
> > 2018-08-01 11:42:11,313 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=151, fetchQueues.getQueueCount=19
> > 2018-08-01 11:42:11,509 INFO  fetcher.FetcherThread - FetcherThread 50 fetching http://www.apache.org/dyn/closer.cgi/nutch/ (queue crawl delay=4000ms)
> > 2018-08-01 11:42:11,723 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://mobile.twitter.com/MrOrdnas (queue crawl delay=1000ms)
> > 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)
> >
> > I also saw it fetching multiple URLs of our own site within the same millisecond, on multiple occasions. Wasn't there some work done regarding crawl delay for 1.15 or is this actually an older problem?
> >
> > Regarding the logging, i agree. We already log failed fetches, no reason not to log skipped fetches too.
> >
> > Regards,
> > Markus
> >  
> > -----Original message-----
> >> From:Sebastian Nagel <[hidden email]>
> >> Sent: Wednesday 1st August 2018 12:31
> >> To: [hidden email]
> >> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
> >>
> >> Hi Markus,
> >>
> >> thanks for running a test crawl.
> >>
> >>> i noticed the crawl delay is not always respected
> >>
> >> Do you mean for the host t.co ?
> >>
> >> The host t.co disallows crawling in its robots.txt (https://t.co/robots.txt).
> >> The first access fetches the robots.txt, all later fetches do not block because the host is not
> >> accessed at all. That's by design.
> >>
> >> But it could be a useful improvement to log this (or in general the status of a fetch).
> >> It would double the logged lines but would help to understand what the fetcher is doing,
> >> esp. regarding robots denied and redirects.
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 08/01/2018 11:59 AM, Markus Jelsma wrote:
> >>> However, the test crawl ran/runs fine, in the background, no errors. But just now, watching the fetcher, i noticed the crawl delay is not always respected. The only configuration change i have is the http.agent.* directives to run.
> >>>
> >>> 2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/rqlNNVQgix (queue crawl delay=5000ms)in general
> >>> 2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 fetching http://planet.apache.org/ (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> >>> 2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://schema.org/Event (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 fetching http://people.apache.org/~jianhe (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> >>> 2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://en.wikipedia.org/wiki/Internet_marketing (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,607863 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
> >>> 2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://twitter.com/i/directory/profiles/5 (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 fetching http://photomatt.net/ (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://mobile.twitter.com/david_kunz/followers (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/xEOAFfp7lT (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Q9BJ0FhzzF (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/wWIMOZ3wxg (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/dImmnEeXjb (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/IPPSdW6o52 (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Y85UlnueSC (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/TvZSGiZC9D (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/jG7BvlobXD (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/ZJmzbWVFrh (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/atVcrbCi5q (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://avro.apache.org/releases.html (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://issues.apache.org/jira/browse/HADOOP-15283 (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:42,175 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=500, fetchQueues.getQueueCount=67
> >>> 2018-08-01 11:47:42,225 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://www.aetna.com/ (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:42,316 INFO  fetcher.FetcherThread - FetcherThread 49 fetching http://www.miredot.com/ (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:42,357 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://xmlgraphics.apache.org/batik/ (queue crawl delay=5000ms)
> >>> 2018-08-01 11:47:42,402 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://t.co/XgG7zomVs8 (queue crawl delay=5000ms)
> >>>
> >>> I believe this problem should addressed prior to release,  therefore i withdraw my +1. Because this is not a breaking issue, i will not -1 this RC.
> >>>
> >>> Regards,
> >>> Markus
> >>>
> >>>  
> >>>  
> >>> -----Original message-----
> >>>> From:Markus Jelsma <[hidden email]>
> >>>> Sent: Wednesday 1st August 2018 11:38
> >>>> To: [hidden email]; [hidden email]
> >>>> Subject: RE: [VOTE] Release Apache Nutch 1.15 RC#1
> >>>>
> >>>> All tests pass, crawler run fine so far, +1 for 1.15!
> >>>>
> >>>> Regards,
> >>>> Markus
> >>>>
> >>>>  
> >>>>  
> >>>> -----Original message-----
> >>>>> From:Sebastian Nagel <[hidden email]>
> >>>>> Sent: Thursday 26th July 2018 17:05
> >>>>> To: [hidden email]
> >>>>> Cc: [hidden email]
> >>>>> Subject: [VOTE] Release Apache Nutch 1.15 RC#1
> >>>>>
> >>>>> Hi Folks,
> >>>>>
> >>>>> A first candidate for the Nutch 1.15 release is available at:
> >>>>>
> >>>>>    https://dist.apache.org/repos/dist/dev/nutch/1.15/
> >>>>>
> >>>>> The release candidate is a zip and tar.gz archive of the binary and sources in:
> >>>>>    https://github.com/apache/nutch/tree/release-1.15
> >>>>>
> >>>>> The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
> >>>>>     555d00ddc0371b05c5958bde7abb2a9db8c38ee2
> >>>>>
> >>>>> In addition, a staged maven repository is available here:
> >>>>>     https://repository.apache.org/content/repositories/orgapachenutch-1015/
> >>>>>
> >>>>> We addressed 119 Issues:
> >>>>>     https://s.apache.org/nczS
> >>>>>
> >>>>> Please vote on releasing this package as Apache Nutch 1.15.
> >>>>> The vote is open for the next 72 hours and passes if a majority of at
> >>>>> least three +1 Nutch PMC votes are cast.
> >>>>>
> >>>>> [ ] +1 Release this package as Apache Nutch 1.15.
> >>>>> [ ] -1 Do not release this package because…
> >>>>>
> >>>>> Cheers,
> >>>>> Sebastian
> >>>>> (On behalf of the Nutch PMC)
> >>>>>
> >>>>> P.S. Here is my +1.
> >>>>>
> >>>>
> >>
> >>
>
>