Nutch 1.12 get stuck on same document

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Nutch 1.12 get stuck on same document

André Schild
Hello,

we see a problem where nutch 1.12 gets stuck on a single document.
We only walk one site, and so only one fetcher is active.

The document is a https://xxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf which is ~18MB in size.

We have these settings:

http.timeout=60000
http.content.limit=1412929

When we start a crawl, then we see this:

2017-02-01 10:53:56,924 INFO  fetcher.Fetcher - -activeThreads=50, spinWaiting=49, fetchQueues.totalSize=207, fetchQueues.getQueueCount=1
.
.
.
Then 5 minutes later
2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Aborting with 50 hung threads.
2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Thread #0 hung while processing https://xxxxxxxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf

It then again trys to fetch the very same document, and after 5 minutes again aborts, and so on...

I could solve the stuck problem with setting:

mapred.task.timeout= 1200000

The fetcher continued after ~6.5 Minutes with the next document.

In debugging I did see that even as I had set a content limit, it was still fetching the whole document via http(s), but somehow used longer than 5 minutes to process that fetch.
A wget from the server command line did retrieve the same pdf in ~0.5 seconds

I would find it highly interesting, if nutch would mark such fetch timeouts on a specific document/url and continue with the next document/url and retry the failed ones at a later (or random) stage.
With the actual behavior, the crawl can get stuck indefinitely...

Any thoughts on this?

André Schild

Aarboard AG<http://www.aarboard.ch/>
Egliweg 10
2560 Nidau
Switzerland
+41 32 332 97 14<tel:+41323329714>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Nutch 1.12 get stuck on same document

Markus Jelsma-2
It is probably not the fetcher but the parser that gets stuck on the document. The http.content.limit must at least be 18 MB or the parser will die trying to parse it. You might also want to take a look at memory consumption, there is a good change the JVM gets stuck because of this PDF. Finally, you need parser.timeout also to be high enough, but that depends on CPU and available heap space.

Markus

 
-----Original message-----

> From:André Schild <[hidden email]>
> Sent: Wednesday 1st February 2017 11:28
> To: [hidden email]
> Subject: Nutch 1.12 get stuck on same document
>
> Hello,
>
> we see a problem where nutch 1.12 gets stuck on a single document.
> We only walk one site, and so only one fetcher is active.
>
> The document is a https://xxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf which is ~18MB in size.
>
> We have these settings:
>
> http.timeout=60000
> http.content.limit=1412929
>
> When we start a crawl, then we see this:
>
> 2017-02-01 10:53:56,924 INFO  fetcher.Fetcher - -activeThreads=50, spinWaiting=49, fetchQueues.totalSize=207, fetchQueues.getQueueCount=1
> .
> .
> .
> Then 5 minutes later
> 2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Aborting with 50 hung threads.
> 2017-02-01 10:58:56,924 WARN  fetcher.Fetcher - Thread #0 hung while processing https://xxxxxxxxxxx/824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf
>
> It then again trys to fetch the very same document, and after 5 minutes again aborts, and so on...
>
> I could solve the stuck problem with setting:
>
> mapred.task.timeout= 1200000
>
> The fetcher continued after ~6.5 Minutes with the next document.
>
> In debugging I did see that even as I had set a content limit, it was still fetching the whole document via http(s), but somehow used longer than 5 minutes to process that fetch.
> A wget from the server command line did retrieve the same pdf in ~0.5 seconds
>
> I would find it highly interesting, if nutch would mark such fetch timeouts on a specific document/url and continue with the next document/url and retry the failed ones at a later (or random) stage.
> With the actual behavior, the crawl can get stuck indefinitely...
>
> Any thoughts on this?
>
> André Schild
>
> Aarboard AG<http://www.aarboard.ch/>
> Egliweg 10
> 2560 Nidau
> Switzerland
> +41 32 332 97 14<tel:+41323329714>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AW: Nutch 1.12 get stuck on same document

André Schild
>It is probably not the fetcher but the parser that gets stuck on the document.
>The http.content.limit must at least be 18 MB or the parser will die trying to parse it.

It does not seem to dye, but just logs this
824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf skipped. Content of size 19375003 was truncated to 1409024

But I then do not find any results for that url in solr, so the truncation also drops the whole document, and does not just index the first XY bytes?

>You might also want to take a look at memory consumption, there is a good change the JVM gets stuck because of this PDF.
Yep, that’s known.

> Finally, you need parser.timeout also to be high enough, but that depends on CPU and available heap space.
Currently this works so far

Thanks
André
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Nutch 1.12 get stuck on same document

Markus Jelsma-2
It is truncated because http.content.limit is nog high enough to accomodate the PDF. Incraese the value for that setting to 20 MB, and you're good to go, for that URL at least.

Markjus
 
-----Original message-----

> From:André Schild <[hidden email]>
> Sent: Wednesday 1st February 2017 12:01
> To: [hidden email]
> Subject: AW: Nutch 1.12 get stuck on same document
>
> >It is probably not the fetcher but the parser that gets stuck on the document.
> >The http.content.limit must at least be 18 MB or the parser will die trying to parse it.
>
> It does not seem to dye, but just logs this
> 824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf skipped. Content of size 19375003 was truncated to 1409024
>
> But I then do not find any results for that url in solr, so the truncation also drops the whole document, and does not just index the first XY bytes?
>
> >You might also want to take a look at memory consumption, there is a good change the JVM gets stuck because of this PDF.
> Yep, that’s known.
>
> > Finally, you need parser.timeout also to be high enough, but that depends on CPU and available heap space.
> Currently this works so far
>
> Thanks
> André
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AW: Nutch 1.12 get stuck on same document

André Schild
>It is truncated because http.content.limit is nog high enough to accomodate the PDF. Incraese the value for that setting to 20 MB, and you're good to go, for that URL at least.

Thanks Marcus

André
Loading...