bug with generate performance

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

bug with generate performance

misc

Hello-

    I am almost certain I have found a nasty bug with nutch genereate.

    Problem: Nutch generate can take many hours, even a day to complete (on a crawldb that has less than 2 million urls).

    I added debug code to Generator->Selector.map to see when map is called and returns, and observed interesting behavior, described here:

    1. Most of the time, when generate is run urls are processed in chunky batches, usually about 40 at a time, followed by a 1 second delay.  I timed the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.)  When this happens it takes hours to complete.

    2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed without delays.  It is an all or nothing event, either I run and all urls process quickly without delay (in minutes), or more likely I get the chunky processing with many 1 second delays and the program takes hours to end.  The one exception is....

    3. When the processing runs quickly I've seen the main thread end (I have some profiling going, so I know when a thread ends), and then more likely than not a second thread begins where the first starts, chunky like usual.  Although I sometimes can get fast processing in one thread, it is almost impossible for me te get it in all threads and therefore general processing is very slow (hours).

    4. I tried to put in more debug code to find the line where the delays occured, but the last line printed to the log at a delay seemed random, leading me to believe that the log is not being flushed uniformly.

    5. The profiler I used seemed to imply that about 100% of the time was spent in javallang.Thread.sleep.  I am not completely familiar with the profiler I used so I am not completely sure I inturpreted this correctly.

    I will keep debugging here, but perhaps someone here has some insight into what might be happening?

                        thanks
                            -J
Reply | Threaded
Open this post in threaded view
|

Re: bug with generate performance

Doğacan Güney-3
Hi,

On 8/31/07, misc <[hidden email]> wrote:

>
> Hello-
>
>     I am almost certain I have found a nasty bug with nutch genereate.
>
>     Problem: Nutch generate can take many hours, even a day to complete (on a crawldb that has less than 2 million urls).
>
>     I added debug code to Generator->Selector.map to see when map is called and returns, and observed interesting behavior, described here:
>
>     1. Most of the time, when generate is run urls are processed in chunky batches, usually about 40 at a time, followed by a 1 second delay.  I timed the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.)  When this happens it takes hours to complete.
>
>     2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed without delays.  It is an all or nothing event, either I run and all urls process quickly without delay (in minutes), or more likely I get the chunky processing with many 1 second delays and the program takes hours to end.  The one exception is....
>
>     3. When the processing runs quickly I've seen the main thread end (I have some profiling going, so I know when a thread ends), and then more likely than not a second thread begins where the first starts, chunky like usual.  Although I sometimes can get fast processing in one thread, it is almost impossible for me te get it in all threads and therefore general processing is very slow (hours).
>
>     4. I tried to put in more debug code to find the line where the delays occured, but the last line printed to the log at a delay seemed random, leading me to believe that the log is not being flushed uniformly.
>
>     5. The profiler I used seemed to imply that about 100% of the time was spent in javallang.Thread.sleep.  I am not completely familiar with the profiler I used so I am not completely sure I inturpreted this correctly.
>
>     I will keep debugging here, but perhaps someone here has some insight into what might be happening?

Others have also reported a problem with generate performance. It
seems we have a problem here but I can not reproduce this behaviour so
I am not sure what causes it. Can you open a JIRA issue and enter your
comments there? Also, how you are running generate will be very
helpful (what is generate.max.per.host? what is -topN argument, etc.)

>
>                         thanks
>                             -J


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: bug with generate performance

Andrzej Białecki-2
Doğacan Güney wrote:

> Others have also reported a problem with generate performance. It
> seems we have a problem here but I can not reproduce this behaviour so
> I am not sure what causes it. Can you open a JIRA issue and enter your
> comments there? Also, how you are running generate will be very
> helpful (what is generate.max.per.host? what is -topN argument, etc.)

Also the value of generate.max.per.host.by.ip - this could be a
DNS-related issue.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: bug with generate performance

misc
In reply to this post by Doğacan Güney-3

Hello-

    I've made a bug, and included the extra required information
(generate.max.per.host = -1, error seen with small topN around 100 and large
topN around 1000000).

    I've since tried to run with a debugger, but the slowness went away
(ugh).  I also know that dns lookups are not the problem as I ran with
wireshark running and there were no dns lookups.

                        thanks
                            -Jim


>
> Others have also reported a problem with generate performance. It
> seems we have a problem here but I can not reproduce this behaviour so
> I am not sure what causes it. Can you open a JIRA issue and enter your
> comments there? Also, how you are running generate will be very
> helpful (what is generate.max.per.host? what is -topN argument, etc.)
>

Reply | Threaded
Open this post in threaded view
|

Two suggestions

misc
In reply to this post by Andrzej Białecki-2

Hello All-

    Two suggested (small) changes:

Change 1

    Use case: Want a list of all ".mov" files found during crawl, don't want
to actually download them and store in the content database (too much
bandwidth and space!).

    Partial solution: filter out with regex-urlfilter.  Problem is, no
record of this url being parsed is stored anywhere

    Full proposed solution: Change code in ParseOutputFormat from

(line 173)

    toUrl = filters.filter(toUrl);   // filter the url
              if (toUrl == null) {
                continue;
              }

to (the new line 173)

    if (filters.filter(toUrl) == null)   // filter the url
                  {
                      LOG.debug("filtering out " + toUrl);
                      continue;
                  }

    This way, all filtered out URLs can be saved if the log level is changed
to debug.  This is also useful to verify that stuff isn't accidentally
getting trown away in a parse.

Change 2

    Add pdf the the default regex-urlfilter removal list.  There doesn't
seem to be any pdf parser (yet), and my output logs are filled with errors
about this.

                        thanks
                            -Jim

Reply | Threaded
Open this post in threaded view
|

Re: Two suggestions

Susam Pal
On 10/6/07, misc <[hidden email]> wrote:-
> Change 2
>
>     Add pdf the the default regex-urlfilter removal list.  There doesn't
> seem to be any pdf parser (yet), and my output logs are filled with errors
> about this.
>
>                         thanks
>                             -Jim

Jim,

Have you tried parse-pdf?

Regards,
Susam Pal
http://susam.in/
Reply | Threaded
Open this post in threaded view
|

Re: Two suggestions

misc

Hello-

    I didn't know about parse-pdf.  Thanks for the information.  Why doesn't
this come turned on by default?

    I am still interested in hearing about the other suggested change.

                        thanks
                            -Jim


----- Original Message -----
From: "Susam Pal" <[hidden email]>
To: <[hidden email]>
Sent: Friday, October 05, 2007 11:39 PM
Subject: Re: Two suggestions


> On 10/6/07, misc <[hidden email]> wrote:-
>> Change 2
>>
>>     Add pdf the the default regex-urlfilter removal list.  There doesn't
>> seem to be any pdf parser (yet), and my output logs are filled with
>> errors
>> about this.
>>
>>                         thanks
>>                             -Jim
>
> Jim,
>
> Have you tried parse-pdf?
>
> Regards,
> Susam Pal
> http://susam.in/
>