Problem with fetch reduce phase

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with fetch reduce phase

Ned Rockson
(sorry if this is a repost, I'm not sure if it sent last time).

I have a very strange, reproducible bug that shows up when running
fetch across any number of documents >10000.  I'm running 47 map tasks
and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
does the majority of the reduce phase, however there are always two
segments that perpetually hang in the reduce > reduce phase.  What
happens is the reducer gets to 85.xx% and then stops responding.  Once
10 minutes go by, a new worker starts the task, gets to the same
85.xx(+/- .1%) and hangs.  The other consistent part is that it's
always segment 2 and segment 5 (out of 47 segments).

I figured I could fix it by simply copying data from a different
segment in and continuing on the next iteration, but low and behold
the same exact problem happens in segment 2 and segment 5.

I assume it's not IO problems because all of the nodes involved in
these segments finish other reduce tasks in the same iteration with no
problems.  Furthermore, I have seen this happen persistently over the
last many iterations.  My last iteration had 400,000 (+/-) documents
pulled down and I saw the same behavior.

Does anyone have any suggestions?

--
Ned Rockson
Discovery Engine
795 Folsom Street
San Francisco, CA 94107
Reply | Threaded
Open this post in threaded view
|

Re: Problem with fetch reduce phase

Doğacan Güney-3
Hi,

On 9/6/07, Ned Rockson <[hidden email]> wrote:

> (sorry if this is a repost, I'm not sure if it sent last time).
>
> I have a very strange, reproducible bug that shows up when running
> fetch across any number of documents >10000.  I'm running 47 map tasks
> and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> does the majority of the reduce phase, however there are always two
> segments that perpetually hang in the reduce > reduce phase.  What
> happens is the reducer gets to 85.xx% and then stops responding.  Once
> 10 minutes go by, a new worker starts the task, gets to the same
> 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> always segment 2 and segment 5 (out of 47 segments).
>
> I figured I could fix it by simply copying data from a different
> segment in and continuing on the next iteration, but low and behold
> the same exact problem happens in segment 2 and segment 5.
>
> I assume it's not IO problems because all of the nodes involved in
> these segments finish other reduce tasks in the same iteration with no
> problems.  Furthermore, I have seen this happen persistently over the
> last many iterations.  My last iteration had 400,000 (+/-) documents
> pulled down and I saw the same behavior.
>
> Does anyone have any suggestions?

Fetcher doesn't do anything interesting in reduce (after all, it is
just IdentityReducer) so this is very strange.

You may try adding some debug statements to write method in
FetcherOutputFormat (if you are using trunk, write method is at line
~84), and try to figure out if it is consistently getting stuck at a
(group of) particular url(s). If it always hangs on the same url, try
fetching that url alone and see if it still doesn't work.


>
> --
> Ned Rockson
> Discovery Engine
> 795 Folsom Street
> San Francisco, CA 94107
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Problem with fetch reduce phase

Andrzej Białecki-2
In reply to this post by Ned Rockson
Ned Rockson wrote:

> (sorry if this is a repost, I'm not sure if it sent last time).
>
> I have a very strange, reproducible bug that shows up when running
> fetch across any number of documents >10000.  I'm running 47 map tasks
> and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> does the majority of the reduce phase, however there are always two
> segments that perpetually hang in the reduce > reduce phase.  What
> happens is the reducer gets to 85.xx% and then stops responding.  Once
> 10 minutes go by, a new worker starts the task, gets to the same
> 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> always segment 2 and segment 5 (out of 47 segments).
>
> I figured I could fix it by simply copying data from a different
> segment in and continuing on the next iteration, but low and behold
> the same exact problem happens in segment 2 and segment 5.
>
> I assume it's not IO problems because all of the nodes involved in
> these segments finish other reduce tasks in the same iteration with no
> problems.  Furthermore, I have seen this happen persistently over the
> last many iterations.  My last iteration had 400,000 (+/-) documents
> pulled down and I saw the same behavior.
>
> Does anyone have any suggestions?
>

Yes. Most likely this is a problem with urlfilter-regex getting stuck on
an abnormal URL (such as e.g. extremely long url, or url that contains
control characters).

Please check the Jobtracker UI which task is stuck, and on which machine
it's executing. Log in to that machine, and identify what is the pid of
this task process, and then generate a thread dump (using 'kill
-SIGQUIT', which does NOT quit the process). If the thread dump shows
some threads being stuck in regex code then it's likely that this is the
problem.

The solution is to avoid urlfilter-regex, or to change the order of
urlfilters and put simpler filters in front of urlfilter-regex, in the
hope that they will eliminate abnormal urls before they are passed to
urlfilter-regex.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Problem with fetch reduce phase

Ned Rockson
So I ran a thread dump and got what I consider to be pretty
meaningless.  It doesn't seem to say I'm stuck in a regex filter,
although when I printed out the urls that were being printed by the
reducer, there was one that had some unprintable characters in the
URL.  Also, there were a lot of URLs that were severely malformed, so
I assume that could be a problem that I'm going to look into.  The
last URL that was printed (on both of the tasks) looked pretty
harmless though: a wiki entry and a .js page, so I assume there must
be a buffer that writes when it fills up.  Where is this buffer
located and would it be pretty easy to dump it to stdout rather than a
file for debug purposes?

Here is the thread dump:

"org.apache.hadoop.dfs.DFSClient$LeaseChecker@4b0ab323" daemon prio=1
tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting on condition
[0x0000000041367000..0x0000000041367b80] at
        java.lang.Thread.sleep(Native Method) at
        org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) at
        java.lang.Thread.run(Thread.java:595)
"Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80
nid=0x4ae5 waiting on condition
[0x0000000041165000..0x0000000041165c80] at
        java.lang.Thread.sleep(Native Method) at
        org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488)
at
        java.lang.Thread.run(Thread.java:595)
"IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1
tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait()
[0x0000000041064000..0x0000000041064d00] at
java.lang.Object.wait(Native Method) - waiting on
<0x00002b141d61d130>- (aorg.apache.hadoop.ipc.Client$Connection) at
java.lang.Object.wait(Object.java:474) at
        org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213)
- locked <0x00002b141d61d130> (a
                org.apache.hadoop.ipc.Client$Connection) at
        org.apache.hadoop.ipc.Client$Connection.run(Client.java:252)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1
tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition
[0x0000000040f63000..0x0000000040f63d80] at
        java.lang.Thread.sleep(Native Method) at
        org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401)
"Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1
runnable [0x0000000000000000..0x0000000000000000]
"CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0
waiting on condition [0x0000000000000000..0x0000000040c5f3e0]
"CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf
waiting on condition [0x0000000000000000..0x0000000040b5e460]
"AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade
waiting on condition [0x0000000000000000..0x0000000000000000]
"Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add
runnable [0x0000000000000000..0x0000000000000000]
"Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in
Object.wait() [0x000000004085c000..0x000000004085cd00] at
        java.lang.Object.wait(Native Method) - waiting on
<0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at
        java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked
<0x00002b141d606288> (a      java.lang.ref.ReferenceQueue$Lock) at
        java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at
        java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in
Object.wait()

On 9/6/07, Andrzej Bialecki <[hidden email]> wrote:

> Ned Rockson wrote:
> > (sorry if this is a repost, I'm not sure if it sent last time).
> >
> > I have a very strange, reproducible bug that shows up when running
> > fetch across any number of documents >10000.  I'm running 47 map tasks
> > and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> > does the majority of the reduce phase, however there are always two
> > segments that perpetually hang in the reduce > reduce phase.  What
> > happens is the reducer gets to 85.xx% and then stops responding.  Once
> > 10 minutes go by, a new worker starts the task, gets to the same
> > 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> > always segment 2 and segment 5 (out of 47 segments).
> >
> > I figured I could fix it by simply copying data from a different
> > segment in and continuing on the next iteration, but low and behold
> > the same exact problem happens in segment 2 and segment 5.
> >
> > I assume it's not IO problems because all of the nodes involved in
> > these segments finish other reduce tasks in the same iteration with no
> > problems.  Furthermore, I have seen this happen persistently over the
> > last many iterations.  My last iteration had 400,000 (+/-) documents
> > pulled down and I saw the same behavior.
> >
> > Does anyone have any suggestions?
> >
>
> Yes. Most likely this is a problem with urlfilter-regex getting stuck on
> an abnormal URL (such as e.g. extremely long url, or url that contains
> control characters).
>
> Please check the Jobtracker UI which task is stuck, and on which machine
> it's executing. Log in to that machine, and identify what is the pid of
> this task process, and then generate a thread dump (using 'kill
> -SIGQUIT', which does NOT quit the process). If the thread dump shows
> some threads being stuck in regex code then it's likely that this is the
> problem.
>
> The solution is to avoid urlfilter-regex, or to change the order of
> urlfilters and put simpler filters in front of urlfilter-regex, in the
> hope that they will eliminate abnormal urls before they are passed to
> urlfilter-regex.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with fetch reduce phase

Doğacan Güney-3
On 9/7/07, Ned Rockson <[hidden email]> wrote:

> So I ran a thread dump and got what I consider to be pretty
> meaningless.  It doesn't seem to say I'm stuck in a regex filter,
> although when I printed out the urls that were being printed by the
> reducer, there was one that had some unprintable characters in the
> URL.  Also, there were a lot of URLs that were severely malformed, so
> I assume that could be a problem that I'm going to look into.  The
> last URL that was printed (on both of the tasks) looked pretty
> harmless though: a wiki entry and a .js page, so I assume there must
> be a buffer that writes when it fills up.  Where is this buffer
> located and would it be pretty easy to dump it to stdout rather than a
> file for debug purposes?

I keep forgetting that people run fetch with parse. Can you try
running fetch with "-noParsing" option? If reduce phase has a problem
with urlfiltering, this should solve it as no-parsing-fetch's reduce
phase is just identity reducing.

>
> Here is the thread dump:
>
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@4b0ab323" daemon prio=1
> tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting       on condition
> [0x0000000041367000..0x0000000041367b80] at
>         java.lang.Thread.sleep(Native Method) at
>         org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) at
>         java.lang.Thread.run(Thread.java:595)
> "Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80
> nid=0x4ae5 waiting on condition
> [0x0000000041165000..0x0000000041165c80] at
>         java.lang.Thread.sleep(Native Method) at
>         org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488)
> at
>         java.lang.Thread.run(Thread.java:595)
> "IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1
> tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait()
> [0x0000000041064000..0x0000000041064d00] at
> java.lang.Object.wait(Native Method) - waiting on
> <0x00002b141d61d130>-   (aorg.apache.hadoop.ipc.Client$Connection) at
> java.lang.Object.wait(Object.java:474) at
>         org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213)
> - locked <0x00002b141d61d130> (a
>                 org.apache.hadoop.ipc.Client$Connection) at
>         org.apache.hadoop.ipc.Client$Connection.run(Client.java:252)
> "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1
> tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition
> [0x0000000040f63000..0x0000000040f63d80] at
>         java.lang.Thread.sleep(Native Method) at
>         org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401)
> "Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1
> runnable        [0x0000000000000000..0x0000000000000000]
> "CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0
> waiting on condition [0x0000000000000000..0x0000000040c5f3e0]
> "CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf
> waiting on condition [0x0000000000000000..0x0000000040b5e460]
> "AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade
> waiting on condition [0x0000000000000000..0x0000000000000000]
> "Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add
> runnable [0x0000000000000000..0x0000000000000000]
> "Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in
> Object.wait()   [0x000000004085c000..0x000000004085cd00] at
>         java.lang.Object.wait(Native Method) - waiting on
> <0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at
>         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked
> <0x00002b141d606288> (a                               java.lang.ref.ReferenceQueue$Lock) at
>         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at
>         java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in
> Object.wait()
>
> On 9/6/07, Andrzej Bialecki <[hidden email]> wrote:
> > Ned Rockson wrote:
> > > (sorry if this is a repost, I'm not sure if it sent last time).
> > >
> > > I have a very strange, reproducible bug that shows up when running
> > > fetch across any number of documents >10000.  I'm running 47 map tasks
> > > and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> > > does the majority of the reduce phase, however there are always two
> > > segments that perpetually hang in the reduce > reduce phase.  What
> > > happens is the reducer gets to 85.xx% and then stops responding.  Once
> > > 10 minutes go by, a new worker starts the task, gets to the same
> > > 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> > > always segment 2 and segment 5 (out of 47 segments).
> > >
> > > I figured I could fix it by simply copying data from a different
> > > segment in and continuing on the next iteration, but low and behold
> > > the same exact problem happens in segment 2 and segment 5.
> > >
> > > I assume it's not IO problems because all of the nodes involved in
> > > these segments finish other reduce tasks in the same iteration with no
> > > problems.  Furthermore, I have seen this happen persistently over the
> > > last many iterations.  My last iteration had 400,000 (+/-) documents
> > > pulled down and I saw the same behavior.
> > >
> > > Does anyone have any suggestions?
> > >
> >
> > Yes. Most likely this is a problem with urlfilter-regex getting stuck on
> > an abnormal URL (such as e.g. extremely long url, or url that contains
> > control characters).
> >
> > Please check the Jobtracker UI which task is stuck, and on which machine
> > it's executing. Log in to that machine, and identify what is the pid of
> > this task process, and then generate a thread dump (using 'kill
> > -SIGQUIT', which does NOT quit the process). If the thread dump shows
> > some threads being stuck in regex code then it's likely that this is the
> > problem.
> >
> > The solution is to avoid urlfilter-regex, or to change the order of
> > urlfilters and put simpler filters in front of urlfilter-regex, in the
> > hope that they will eliminate abnormal urls before they are passed to
> > urlfilter-regex.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Problem with fetch reduce phase

Ned Rockson
Oh great, I didn't know that was an option.  How would I go about
running the parse by itself?

On 9/7/07, Doğacan Güney <[hidden email]> wrote:

> On 9/7/07, Ned Rockson <[hidden email]> wrote:
> > So I ran a thread dump and got what I consider to be pretty
> > meaningless.  It doesn't seem to say I'm stuck in a regex filter,
> > although when I printed out the urls that were being printed by the
> > reducer, there was one that had some unprintable characters in the
> > URL.  Also, there were a lot of URLs that were severely malformed, so
> > I assume that could be a problem that I'm going to look into.  The
> > last URL that was printed (on both of the tasks) looked pretty
> > harmless though: a wiki entry and a .js page, so I assume there must
> > be a buffer that writes when it fills up.  Where is this buffer
> > located and would it be pretty easy to dump it to stdout rather than a
> > file for debug purposes?
>
> I keep forgetting that people run fetch with parse. Can you try
> running fetch with "-noParsing" option? If reduce phase has a problem
> with urlfiltering, this should solve it as no-parsing-fetch's reduce
> phase is just identity reducing.
>
> >
> > Here is the thread dump:
> >
> > "org.apache.hadoop.dfs.DFSClient$LeaseChecker@4b0ab323" daemon prio=1
> > tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting       on condition
> > [0x0000000041367000..0x0000000041367b80] at
> >         java.lang.Thread.sleep(Native Method) at
> >         org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) at
> >         java.lang.Thread.run(Thread.java:595)
> > "Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80
> > nid=0x4ae5 waiting on condition
> > [0x0000000041165000..0x0000000041165c80] at
> >         java.lang.Thread.sleep(Native Method) at
> >         org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488)
> > at
> >         java.lang.Thread.run(Thread.java:595)
> > "IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1
> > tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait()
> > [0x0000000041064000..0x0000000041064d00] at
> > java.lang.Object.wait(Native Method) - waiting on
> > <0x00002b141d61d130>-   (aorg.apache.hadoop.ipc.Client$Connection) at
> > java.lang.Object.wait(Object.java:474) at
> >         org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213)
> > - locked <0x00002b141d61d130> (a
> >                 org.apache.hadoop.ipc.Client$Connection) at
> >         org.apache.hadoop.ipc.Client$Connection.run(Client.java:252)
> > "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1
> > tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition
> > [0x0000000040f63000..0x0000000040f63d80] at
> >         java.lang.Thread.sleep(Native Method) at
> >         org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401)
> > "Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1
> > runnable        [0x0000000000000000..0x0000000000000000]
> > "CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0
> > waiting on condition [0x0000000000000000..0x0000000040c5f3e0]
> > "CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf
> > waiting on condition [0x0000000000000000..0x0000000040b5e460]
> > "AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade
> > waiting on condition [0x0000000000000000..0x0000000000000000]
> > "Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add
> > runnable [0x0000000000000000..0x0000000000000000]
> > "Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in
> > Object.wait()   [0x000000004085c000..0x000000004085cd00] at
> >         java.lang.Object.wait(Native Method) - waiting on
> > <0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at
> >         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked
> > <0x00002b141d606288> (a                               java.lang.ref.ReferenceQueue$Lock) at
> >         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at
> >         java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> > "Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in
> > Object.wait()
> >
> > On 9/6/07, Andrzej Bialecki <[hidden email]> wrote:
> > > Ned Rockson wrote:
> > > > (sorry if this is a repost, I'm not sure if it sent last time).
> > > >
> > > > I have a very strange, reproducible bug that shows up when running
> > > > fetch across any number of documents >10000.  I'm running 47 map tasks
> > > > and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> > > > does the majority of the reduce phase, however there are always two
> > > > segments that perpetually hang in the reduce > reduce phase.  What
> > > > happens is the reducer gets to 85.xx% and then stops responding.  Once
> > > > 10 minutes go by, a new worker starts the task, gets to the same
> > > > 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> > > > always segment 2 and segment 5 (out of 47 segments).
> > > >
> > > > I figured I could fix it by simply copying data from a different
> > > > segment in and continuing on the next iteration, but low and behold
> > > > the same exact problem happens in segment 2 and segment 5.
> > > >
> > > > I assume it's not IO problems because all of the nodes involved in
> > > > these segments finish other reduce tasks in the same iteration with no
> > > > problems.  Furthermore, I have seen this happen persistently over the
> > > > last many iterations.  My last iteration had 400,000 (+/-) documents
> > > > pulled down and I saw the same behavior.
> > > >
> > > > Does anyone have any suggestions?
> > > >
> > >
> > > Yes. Most likely this is a problem with urlfilter-regex getting stuck on
> > > an abnormal URL (such as e.g. extremely long url, or url that contains
> > > control characters).
> > >
> > > Please check the Jobtracker UI which task is stuck, and on which machine
> > > it's executing. Log in to that machine, and identify what is the pid of
> > > this task process, and then generate a thread dump (using 'kill
> > > -SIGQUIT', which does NOT quit the process). If the thread dump shows
> > > some threads being stuck in regex code then it's likely that this is the
> > > problem.
> > >
> > > The solution is to avoid urlfilter-regex, or to change the order of
> > > urlfilters and put simpler filters in front of urlfilter-regex, in the
> > > hope that they will eliminate abnormal urls before they are passed to
> > > urlfilter-regex.
> > >
> > >
> > > --
> > > Best regards,
> > > Andrzej Bialecki     <><
> > >   ___. ___ ___ ___ _ _   __________________________________
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > >
> > >
> >
>
>
> --
> Doğacan Güney
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with fetch reduce phase

Doğacan Güney-3
On 9/7/07, Ned Rockson <[hidden email]> wrote:
> Oh great, I didn't know that was an option.  How would I go about
> running the parse by itself?

bin/nutch parse <segment>

>
> On 9/7/07, Doğacan Güney <[hidden email]> wrote:
> > On 9/7/07, Ned Rockson <[hidden email]> wrote:
> > > So I ran a thread dump and got what I consider to be pretty
> > > meaningless.  It doesn't seem to say I'm stuck in a regex filter,
> > > although when I printed out the urls that were being printed by the
> > > reducer, there was one that had some unprintable characters in the
> > > URL.  Also, there were a lot of URLs that were severely malformed, so
> > > I assume that could be a problem that I'm going to look into.  The
> > > last URL that was printed (on both of the tasks) looked pretty
> > > harmless though: a wiki entry and a .js page, so I assume there must
> > > be a buffer that writes when it fills up.  Where is this buffer
> > > located and would it be pretty easy to dump it to stdout rather than a
> > > file for debug purposes?
> >
> > I keep forgetting that people run fetch with parse. Can you try
> > running fetch with "-noParsing" option? If reduce phase has a problem
> > with urlfiltering, this should solve it as no-parsing-fetch's reduce
> > phase is just identity reducing.
> >
> > >
> > > Here is the thread dump:
> > >
> > > "org.apache.hadoop.dfs.DFSClient$LeaseChecker@4b0ab323" daemon prio=1
> > > tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting       on condition
> > > [0x0000000041367000..0x0000000041367b80] at
> > >         java.lang.Thread.sleep(Native Method) at
> > >         org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) at
> > >         java.lang.Thread.run(Thread.java:595)
> > > "Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80
> > > nid=0x4ae5 waiting on condition
> > > [0x0000000041165000..0x0000000041165c80] at
> > >         java.lang.Thread.sleep(Native Method) at
> > >         org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488)
> > > at
> > >         java.lang.Thread.run(Thread.java:595)
> > > "IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1
> > > tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait()
> > > [0x0000000041064000..0x0000000041064d00] at
> > > java.lang.Object.wait(Native Method) - waiting on
> > > <0x00002b141d61d130>-   (aorg.apache.hadoop.ipc.Client$Connection) at
> > > java.lang.Object.wait(Object.java:474) at
> > >         org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213)
> > > - locked <0x00002b141d61d130> (a
> > >                 org.apache.hadoop.ipc.Client$Connection) at
> > >         org.apache.hadoop.ipc.Client$Connection.run(Client.java:252)
> > > "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1
> > > tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition
> > > [0x0000000040f63000..0x0000000040f63d80] at
> > >         java.lang.Thread.sleep(Native Method) at
> > >         org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401)
> > > "Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1
> > > runnable        [0x0000000000000000..0x0000000000000000]
> > > "CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0
> > > waiting on condition [0x0000000000000000..0x0000000040c5f3e0]
> > > "CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf
> > > waiting on condition [0x0000000000000000..0x0000000040b5e460]
> > > "AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade
> > > waiting on condition [0x0000000000000000..0x0000000000000000]
> > > "Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add
> > > runnable [0x0000000000000000..0x0000000000000000]
> > > "Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in
> > > Object.wait()   [0x000000004085c000..0x000000004085cd00] at
> > >         java.lang.Object.wait(Native Method) - waiting on
> > > <0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at
> > >         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked
> > > <0x00002b141d606288> (a                               java.lang.ref.ReferenceQueue$Lock) at
> > >         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at
> > >         java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> > > "Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in
> > > Object.wait()
> > >
> > > On 9/6/07, Andrzej Bialecki <[hidden email]> wrote:
> > > > Ned Rockson wrote:
> > > > > (sorry if this is a repost, I'm not sure if it sent last time).
> > > > >
> > > > > I have a very strange, reproducible bug that shows up when running
> > > > > fetch across any number of documents >10000.  I'm running 47 map tasks
> > > > > and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> > > > > does the majority of the reduce phase, however there are always two
> > > > > segments that perpetually hang in the reduce > reduce phase.  What
> > > > > happens is the reducer gets to 85.xx% and then stops responding.  Once
> > > > > 10 minutes go by, a new worker starts the task, gets to the same
> > > > > 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> > > > > always segment 2 and segment 5 (out of 47 segments).
> > > > >
> > > > > I figured I could fix it by simply copying data from a different
> > > > > segment in and continuing on the next iteration, but low and behold
> > > > > the same exact problem happens in segment 2 and segment 5.
> > > > >
> > > > > I assume it's not IO problems because all of the nodes involved in
> > > > > these segments finish other reduce tasks in the same iteration with no
> > > > > problems.  Furthermore, I have seen this happen persistently over the
> > > > > last many iterations.  My last iteration had 400,000 (+/-) documents
> > > > > pulled down and I saw the same behavior.
> > > > >
> > > > > Does anyone have any suggestions?
> > > > >
> > > >
> > > > Yes. Most likely this is a problem with urlfilter-regex getting stuck on
> > > > an abnormal URL (such as e.g. extremely long url, or url that contains
> > > > control characters).
> > > >
> > > > Please check the Jobtracker UI which task is stuck, and on which machine
> > > > it's executing. Log in to that machine, and identify what is the pid of
> > > > this task process, and then generate a thread dump (using 'kill
> > > > -SIGQUIT', which does NOT quit the process). If the thread dump shows
> > > > some threads being stuck in regex code then it's likely that this is the
> > > > problem.
> > > >
> > > > The solution is to avoid urlfilter-regex, or to change the order of
> > > > urlfilters and put simpler filters in front of urlfilter-regex, in the
> > > > hope that they will eliminate abnormal urls before they are passed to
> > > > urlfilter-regex.
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrzej Bialecki     <><
> > > >   ___. ___ ___ ___ _ _   __________________________________
> > > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > > http://www.sigram.com  Contact: info at sigram dot com
> > > >
> > > >
> > >
> >
> >
> > --
> > Doğacan Güney
> >
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Problem with fetch reduce phase

Emmanuel JOKE
In reply to this post by Ned Rockson
I had a similar pb once. I reduce my number of reduce task to 1.5 * nb of
node and It solves my pb.
I suggest to change your conf and run a fetch with max 36 reduce task.

> I have a very strange, reproducible bug that shows up when running
> fetch across any number of documents >10000.  I'm running 47 map tasks
> and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> does the majority of the reduce phase, however there are always two
> segments that perpetually hang in the reduce > reduce phase.  What
> happens is the reducer gets to 85.xx% and then stops responding.  Once
> 10 minutes go by, a new worker starts the task, gets to the same
> 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> always segment 2 and segment 5 (out of 47 segments).
>
> I figured I could fix it by simply copying data from a different
> segment in and continuing on the next iteration, but low and behold
> the same exact problem happens in segment 2 and segment 5.
>
> I assume it's not IO problems because all of the nodes involved in
> these segments finish other reduce tasks in the same iteration with no
> problems.  Furthermore, I have seen this happen persistently over the
> last many iterations.  My last iteration had 400,000 (+/-) documents
> pulled down and I saw the same behavior.
>
> Does anyone have any suggestions?
>
> --
> Ned Rockson
> Discovery Engine
> 795 Folsom Street
> San Francisco, CA 94107
>