All trackers exited on all nodes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

All trackers exited on all nodes

Rod Taylor-2
Over night the trackers appeared to have difficulties finding a block
from the datanode and eventually exited. The datanode reports serving
the block mentioned below successfully on several earlier occasions.

Instead of the tracker exiting I would have expected the current task to
be aborted and for it to continue on taking others. Below are the logs
from the problem with the block through to where the tracker was no
longer running.


051014 072850 No node available for block blk_-6130354727587666838
051014 072850 Could not obtain block from any node:
java.io.IOException: No live nodes contain current block
051014 072900 No node available for block blk_-6130354727587666838
051014 072900 Could not obtain block from any node:
java.io.IOException: No live nodes contain current block
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/mapred-default.xml
051014 072910 parsing /home/sitesell/local/tracker/task_m_xlwltz/job.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072910 Task task_m_xlwltz timed out.  Killing.
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/mapred-default.xml
051014 072910 parsing /home/sitesell/local/tracker/task_m_xlwltz/job.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072910 task_m_1mewgo done; removing files.
Exception in thread "main" java.util.ConcurrentModificationException
        at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
        at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
        at
org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
        at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
        at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072911 task_m_xlwltz  Child starting
051014 072911 Server connection on port 64071 from 192.168.100.13:
starting
051014 072911 task_m_xlwltz  Client connection to 0.0.0.0:64071:
starting
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/mapred-default.xml
051014 072911 task_m_xlwltz
parsing /home/sitesell/local/tracker/task_m_xlwltz/job.xml
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072911 task_m_xlwltz  Client connection to 192.168.100.14:5466:
starting
051014 072912 task_m_xlwltz
0.07144528% /user/sitesell/generate-temp-1200428493/part-00000:3924416
+230848
051014 072912 task_m_xlwltz
1.0% /user/sitesell/generate-temp-1200428493/part-00000:3924416+230848
051014 072912 Task task_m_xlwltz is done.
051014 072912 Task task_m_xlwltz is done.
051014 072912 Server connection on port 64071 from 192.168.100.13:
exiting
051014 072912 task_m_xlwltz done; removing files.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: All trackers exited on all nodes

Doug Cutting-2
Rod Taylor wrote:
> Over night the trackers appeared to have difficulties finding a block
> from the datanode and eventually exited. The datanode reports serving
> the block mentioned below successfully on several earlier occasions.

Were all of the datanodes still alive?  If so, then the problem is
probably that heartbeats from the datanodes to the namenode were
delayed, so that the namenode assumed the datanodes were dead.  This can
sometimes happen when the namenode becomes severely loaded.  What is in
the namenode's logs around this time?  Does it report some datanodes as
lost?

> Instead of the tracker exiting I would have expected the current task to
> be aborted and for it to continue on taking others. Below are the logs
> from the problem with the block through to where the tracker was no
> longer running.
[ ... ]
> Exception in thread "main" java.util.ConcurrentModificationException
>         at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
>         at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
>         at
> org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
>         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
>         at
> org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)

I have seen this before.  It is a bug in the TaskTracker that I have not
yet had time to fix.  It is triggered by an unresponsive jobtracker,
which times out the tasktracker, assuming it is dead.  Then, when the
tasktracker's heartbeat arrives, the jobtracker does not recognize it,
which causes the tasktracker to close and restart.  The bug is in the
tasktracker's close method, which uses an iterator in an unsafe way.

It should instead use something like:

   while (tasks.size() != 0) {
     TaskInProgress tip = (TaskInProgress)tasks.first();
     tip.jobHasFinished();
   }

What's in the jobtracker logs around this time?  Did it report this
tasktracker as lost?

Was this all by chance at the start of an Indexer job?  That's when I've
seen this sort of thing.  This job has more input files than any other,
and my guess is that constructing the splits (which ties up the
jobtracker) can for some reason take longer than the timeout, although
it really shouldn't...

Doug


Reply | Threaded
Open this post in threaded view
|

Re: All trackers exited on all nodes

Rod Taylor-2
On Mon, 2005-10-17 at 11:06 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > Over night the trackers appeared to have difficulties finding a block
> > from the datanode and eventually exited. The datanode reports serving
> > the block mentioned below successfully on several earlier occasions.
>
> Were all of the datanodes still alive?  If so, then the problem is
> probably that heartbeats from the datanodes to the namenode were
> delayed, so that the namenode assumed the datanodes were dead.  This can

Yes. All of the datanodes, namenode and jobtracker were alive and I
don't recall them having any indication of errors in the logs but the
logs have been rotated since and I don't have them anymore.

The machine the namenode is running on does have very high load at
times. Do you recommend a separate box for the namenode and jobtracker
which runs strictly those items?

> sometimes happen when the namenode becomes severely loaded.  What is in
> the namenode's logs around this time?  Does it report some datanodes as
> lost?

Sorry, I don't have the logs anymore. I will be sure to check for it
next time the problem occurs.

> > Instead of the tracker exiting I would have expected the current task to
> > be aborted and for it to continue on taking others. Below are the logs
> > from the problem with the block through to where the tracker was no
> > longer running.
> [ ... ]
> > Exception in thread "main" java.util.ConcurrentModificationException
> >         at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
> >         at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
> >         at
> > org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
> >         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
> >         at
> > org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)
>
> I have seen this before.  It is a bug in the TaskTracker that I have not
> yet had time to fix.  It is triggered by an unresponsive jobtracker,
> which times out the tasktracker, assuming it is dead.  Then, when the
> tasktracker's heartbeat arrives, the jobtracker does not recognize it,
> which causes the tasktracker to close and restart.  The bug is in the
> tasktracker's close method, which uses an iterator in an unsafe way.
>
> It should instead use something like:
>
>    while (tasks.size() != 0) {
>      TaskInProgress tip = (TaskInProgress)tasks.first();
>      tip.jobHasFinished();
>    }
>
> What's in the jobtracker logs around this time?  Did it report this
> tasktracker as lost?

The jobtracker did not indicate such a thing (via an exception anyway).
Tasktracker connections seem to be established and disconnected from
fairly frequently. Perhaps this is what you mean?

> Was this all by chance at the start of an Indexer job?  That's when I've
> seen this sort of thing.  This job has more input files than any other,

Middle of a fetch actually.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: All trackers exited on all nodes

Doug Cutting-2
Rod Taylor wrote:
> The machine the namenode is running on does have very high load at
> times. Do you recommend a separate box for the namenode and jobtracker
> which runs strictly those items?

That would be optimal, but it shouldn't be required.  If a tasktracker
or datanode is sluggish then its impact is small, but if the jobtracker
or namenode become sluggish the impact is systemic.  That said, so long
as these don't crash, things should work.  The problem is that the code
paths for recovery when namenodes and jobtrackers are sluggish have not
been tested as much.

>>What's in the jobtracker logs around this time?  Did it report this
>>tasktracker as lost?
>
> The jobtracker did not indicate such a thing (via an exception anyway).
> Tasktracker connections seem to be established and disconnected from
> fairly frequently. Perhaps this is what you mean?

No, there's a "lost tracker" message when the jobtracker times out a
tasktracker.  These are bad, since the jobtracker then assumes that all
of the temporary map data at that tasktracker is gone, and re-schedules
those map tasks.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: All trackers exited on all nodes

Rod Taylor-2
On Mon, 2005-10-17 at 11:46 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > The machine the namenode is running on does have very high load at
> > times. Do you recommend a separate box for the namenode and jobtracker
> > which runs strictly those items?
>
> That would be optimal, but it shouldn't be required.  If a tasktracker
> or datanode is sluggish then its impact is small, but if the jobtracker
> or namenode become sluggish the impact is systemic.  That said, so long

Machine ordered. Thanks.

> >>What's in the jobtracker logs around this time?  Did it report this
> >>tasktracker as lost?
> >
> > The jobtracker did not indicate such a thing (via an exception anyway).
> > Tasktracker connections seem to be established and disconnected from
> > fairly frequently. Perhaps this is what you mean?
>
> No, there's a "lost tracker" message when the jobtracker times out a
> tasktracker.  These are bad, since the jobtracker then assumes that all
> of the temporary map data at that tasktracker is gone, and re-schedules
> those map tasks.

Ahh. I recall seeing those in the past when tasktrackers actually were
dying (you fixed that bug). No, this type of action was not taking
place.

--
Rod Taylor <[hidden email]>