Fetcher Stops Reports Pushes CPU to 100%

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Fetcher Stops Reports Pushes CPU to 100%

Dennis Kubes
Has anybody seen behavior where a fetcher duing the reduce phase will
stop reporting and push the CPU to 100% and stay that way until the task
times out.  I am seeing this on Fedora 5 minimal running Java 1.5_06 on
dual core processor machines with 2G of memory.  I have tracked this
down and I think this has something to do with the Java Inflater class.  
Anybody seen similar behavior?

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Andrzej Białecki-2
Dennis Kubes wrote:
> Has anybody seen behavior where a fetcher duing the reduce phase will
> stop reporting and push the CPU to 100% and stay that way until the
> task times out.  I am seeing this on Fedora 5 minimal running Java
> 1.5_06 on dual core processor machines with 2G of memory.  I have
> tracked this down and I think this has something to do with the Java
> Inflater class.  Anybody seen similar behavior?

Could you do a kill -SIGQUIT to get a thread dump? Could you also re-run
Fetcher on the same segment (you need to delete all parts except
crawl_generate), BUT with the flag -noParsing ?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Dennis Kubes
Do you think it is the parsing that is causing it?

I was looking at a smaller fetching run and the cpu gets pushed to 100%
as well but the reports keep happening.  This only seems to happen when
I run very large fetches (> 500K pages).  I just ran a 100K fetch and it
worked just fine.  Should I have some special settings for larger fetches?

Dennis

Andrzej Bialecki wrote:

> Dennis Kubes wrote:
>> Has anybody seen behavior where a fetcher duing the reduce phase will
>> stop reporting and push the CPU to 100% and stay that way until the
>> task times out.  I am seeing this on Fedora 5 minimal running Java
>> 1.5_06 on dual core processor machines with 2G of memory.  I have
>> tracked this down and I think this has something to do with the Java
>> Inflater class.  Anybody seen similar behavior?
>
> Could you do a kill -SIGQUIT to get a thread dump? Could you also
> re-run Fetcher on the same segment (you need to delete all parts
> except crawl_generate), BUT with the flag -noParsing ?
>
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Andrzej Białecki-2
Dennis Kubes wrote:
> Do you think it is the parsing that is causing it?

Just checking ... probably not. You could figure out from a thread dump
where it's spending time.


> I was looking at a smaller fetching run and the cpu gets pushed to
> 100% as well but the reports keep happening.  This only seems to
> happen when I run very large fetches (> 500K pages).  I just ran a
> 100K fetch and it worked just fine.  Should I have some special
> settings for larger fetches?

You could try tweaking the io.sort values, if it times out during the
sorting phase.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Dennis Kubes
I will start taking a look at some thread dumps.  It is not the sorting
phase.  It gets past the sort and gets through part of the reduce phase
(and always the same percentage, when the job is restarts on the same
machine it gets to the same part again before stalling again).  And this
is happening on multiple machines so I do think it is a machine
problem.  Again I need to spend some time looking through thread dumps.

Dennis

Andrzej Bialecki wrote:

> Dennis Kubes wrote:
>> Do you think it is the parsing that is causing it?
>
> Just checking ... probably not. You could figure out from a thread
> dump where it's spending time.
>
>
>> I was looking at a smaller fetching run and the cpu gets pushed to
>> 100% as well but the reports keep happening.  This only seems to
>> happen when I run very large fetches (> 500K pages).  I just ran a
>> 100K fetch and it worked just fine.  Should I have some special
>> settings for larger fetches?
>
> You could try tweaking the io.sort values, if it times out during the
> sorting phase.
>
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Dennis Kubes
The thread dumps pointed me to the Regex URL Filter and greedy pattern
matching.  It seems that there is a standing "error" in the JVM where
the "wrong" regular expression will cause the program to hang and the
cpu to go to 100%.  Basically the behaviors that we are seeing.  And
this would make sense as this error wouldn't appear unless the "right"
url came up.  See this link for a complete explanation.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051

After reviewing the regular expressions in the regex-urlfilter.txt file,
here is what I think needs to be changed.

this: -.*(/.+?)/.*?\1/.*?\1/
changed to: -.*?(/.+?)/.*?\1/.*?\1/

I am currently testing this to see if it runs correctly without stalling
as before.  Problem is that I am not a regular expressions expert.  Will
changing this regex affect this expression in a negative way?

Dennis

Dennis Kubes wrote:

> I will start taking a look at some thread dumps.  It is not the
> sorting phase.  It gets past the sort and gets through part of the
> reduce phase (and always the same percentage, when the job is restarts
> on the same machine it gets to the same part again before stalling
> again).  And this is happening on multiple machines so I do think it
> is a machine problem.  Again I need to spend some time looking through
> thread dumps.
>
> Dennis
>
> Andrzej Bialecki wrote:
>> Dennis Kubes wrote:
>>> Do you think it is the parsing that is causing it?
>>
>> Just checking ... probably not. You could figure out from a thread
>> dump where it's spending time.
>>
>>
>>> I was looking at a smaller fetching run and the cpu gets pushed to
>>> 100% as well but the reports keep happening.  This only seems to
>>> happen when I run very large fetches (> 500K pages).  I just ran a
>>> 100K fetch and it worked just fine.  Should I have some special
>>> settings for larger fetches?
>>
>> You could try tweaking the io.sort values, if it times out during the
>> sorting phase.
>>
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Lukáš Vlček
Hi,

This reminds me one saying:
If you got a problem and you use regular expession, then you got a two
problems...

I am not regex guru but I think those two expressions are different
due to the first question mare (lazy repetition). Depending on given
string it should behave differently.
Try to give an examples of string you want to check.

Regards,
Lukas

On 6/10/06, Dennis Kubes <[hidden email]> wrote:

> The thread dumps pointed me to the Regex URL Filter and greedy pattern
> matching.  It seems that there is a standing "error" in the JVM where
> the "wrong" regular expression will cause the program to hang and the
> cpu to go to 100%.  Basically the behaviors that we are seeing.  And
> this would make sense as this error wouldn't appear unless the "right"
> url came up.  See this link for a complete explanation.
>
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051
>
> After reviewing the regular expressions in the regex-urlfilter.txt file,
> here is what I think needs to be changed.
>
> this: -.*(/.+?)/.*?\1/.*?\1/
> changed to: -.*?(/.+?)/.*?\1/.*?\1/
>
> I am currently testing this to see if it runs correctly without stalling
> as before.  Problem is that I am not a regular expressions expert.  Will
> changing this regex affect this expression in a negative way?
>
> Dennis
>
> Dennis Kubes wrote:
> > I will start taking a look at some thread dumps.  It is not the
> > sorting phase.  It gets past the sort and gets through part of the
> > reduce phase (and always the same percentage, when the job is restarts
> > on the same machine it gets to the same part again before stalling
> > again).  And this is happening on multiple machines so I do think it
> > is a machine problem.  Again I need to spend some time looking through
> > thread dumps.
> >
> > Dennis
> >
> > Andrzej Bialecki wrote:
> >> Dennis Kubes wrote:
> >>> Do you think it is the parsing that is causing it?
> >>
> >> Just checking ... probably not. You could figure out from a thread
> >> dump where it's spending time.
> >>
> >>
> >>> I was looking at a smaller fetching run and the cpu gets pushed to
> >>> 100% as well but the reports keep happening.  This only seems to
> >>> happen when I run very large fetches (> 500K pages).  I just ran a
> >>> 100K fetch and it worked just fine.  Should I have some special
> >>> settings for larger fetches?
> >>
> >> You could try tweaking the io.sort values, if it times out during the
> >> sorting phase.
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Andrzej Białecki-2
In reply to this post by Dennis Kubes
Dennis Kubes wrote:
> The thread dumps pointed me to the Regex URL Filter and greedy pattern
> matching.  It seems that there is a standing "error" in the JVM where
> the "wrong" regular expression will cause the program to hang and the
> cpu to go to 100%.  Basically the behaviors that we are seeing.  And
> this would make sense as this error wouldn't appear unless the "right"
> url came up.  See this link for a complete explanation.

Ah, that would explain why I don't see this behavior - one of the first
changes I do in my installations is to remove regex-urlfilter and
replace it with a suitable combination of prefix/suffix-urlfilter, or a
custom one ... Of course, we should solve this issue in our code, if
possible, but using different urlfilters is a quick workaround.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Fetcher Stops Reports Pushes CPU to 100%

Dennis Kubes
I just completed a 500K run with no problems.  I had to comment out the
-.*(/.+?)/.*?\1/.*?\1/ filter to get it to work.  Even using
-.*?(/.+?)/.*?\1/.*?\1/ would stall it.  Good news is that it did it
consistently so it will take me a while but I can narrow down what url
was causing it.  I don't think the other urls in the default
regex-urlfilter file will cause any problems because they are not
greedy, but I would suggest that we look at either changing this regular
expression or removing it altogether from the default install.

Dennis

Andrzej Bialecki wrote:

> Dennis Kubes wrote:
>> The thread dumps pointed me to the Regex URL Filter and greedy
>> pattern matching.  It seems that there is a standing "error" in the
>> JVM where the "wrong" regular expression will cause the program to
>> hang and the cpu to go to 100%.  Basically the behaviors that we are
>> seeing.  And this would make sense as this error wouldn't appear
>> unless the "right" url came up.  See this link for a complete
>> explanation.
>
> Ah, that would explain why I don't see this behavior - one of the
> first changes I do in my installations is to remove regex-urlfilter
> and replace it with a suitable combination of prefix/suffix-urlfilter,
> or a custom one ... Of course, we should solve this issue in our code,
> if possible, but using different urlfilters is a quick workaround.
>