[jira] Created: (HADOOP-1475) local filecache disappears

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-1475) local filecache disappears

Tim Allison (Jira)
local filecache disappears
--------------------------

                 Key: HADOOP-1475
                 URL: https://issues.apache.org/jira/browse/HADOOP-1475
             Project: Hadoop
          Issue Type: Bug
    Affects Versions: 0.13.0
            Reporter: Christian Kunz
            Priority: Blocker


All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.

It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.

Side issue is;
why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
13 in the 1st hour of the job
18 in the 2nd hour
33 in the 3rd hour
Then the job failed.

E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.

I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?



2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070

2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075

2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082

2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085

2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098

2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106

2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108

2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112

2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114

2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116

2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117

2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120

2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174




--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1475) local filecache disappears

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-1475:
-----------------------------------

    Component/s: mapred

> local filecache disappears
> --------------------------
>
>                 Key: HADOOP-1475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1475
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.
> It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.
> Side issue is;
> why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
> 13 in the 1st hour of the job
> 18 in the 2nd hour
> 33 in the 3rd hour
> Then the job failed.
> E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.
> I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?
> 2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
> 2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070
> 2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
> 2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075
> 2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
> 2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082
> 2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
> 2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085
> 2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
> 2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098
> 2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
> 2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106
> 2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
> 2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108
> 2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
> 2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112
> 2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
> 2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114
> 2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
> 2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116
> 2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
> 2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117
> 2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
> 2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120
> 2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
> 2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1475) local filecache disappears

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-1475:
----------------------------------

    Attachment: dist-cache-purge.patch

This patch clears the cache before the task tracker reinitializes itself. This prevents the file cache from thinking it has files that it in fact doesn't have, because the backing files have been deleted in the reinitialization.

> local filecache disappears
> --------------------------
>
>                 Key: HADOOP-1475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1475
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Christian Kunz
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: dist-cache-purge.patch
>
>
> All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.
> It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.
> Side issue is;
> why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
> 13 in the 1st hour of the job
> 18 in the 2nd hour
> 33 in the 3rd hour
> Then the job failed.
> E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.
> I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?
> 2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
> 2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070
> 2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
> 2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075
> 2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
> 2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082
> 2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
> 2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085
> 2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
> 2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098
> 2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
> 2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106
> 2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
> 2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108
> 2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
> 2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112
> 2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
> 2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114
> 2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
> 2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116
> 2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
> 2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117
> 2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
> 2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120
> 2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
> 2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1475) local filecache disappears

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-1475:
----------------------------------

    Fix Version/s: 0.14.0
         Assignee: Owen O'Malley
           Status: Patch Available  (was: Open)

> local filecache disappears
> --------------------------
>
>                 Key: HADOOP-1475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1475
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Christian Kunz
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: dist-cache-purge.patch
>
>
> All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.
> It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.
> Side issue is;
> why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
> 13 in the 1st hour of the job
> 18 in the 2nd hour
> 33 in the 3rd hour
> Then the job failed.
> E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.
> I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?
> 2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
> 2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070
> 2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
> 2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075
> 2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
> 2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082
> 2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
> 2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085
> 2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
> 2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098
> 2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
> 2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106
> 2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
> 2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108
> 2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
> 2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112
> 2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
> 2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114
> 2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
> 2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116
> 2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
> 2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117
> 2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
> 2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120
> 2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
> 2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1475) local filecache disappears

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505751 ]

Hadoop QA commented on HADOOP-1475:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12359977/dist-cache-purge.patch against trunk revision r547811.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/294/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/294/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> local filecache disappears
> --------------------------
>
>                 Key: HADOOP-1475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1475
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Christian Kunz
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: dist-cache-purge.patch
>
>
> All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.
> It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.
> Side issue is;
> why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
> 13 in the 1st hour of the job
> 18 in the 2nd hour
> 33 in the 3rd hour
> Then the job failed.
> E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.
> I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?
> 2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
> 2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070
> 2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
> 2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075
> 2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
> 2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082
> 2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
> 2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085
> 2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
> 2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098
> 2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
> 2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106
> 2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
> 2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108
> 2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
> 2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112
> 2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
> 2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114
> 2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
> 2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116
> 2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
> 2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117
> 2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
> 2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120
> 2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
> 2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1475) local filecache disappears

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1475:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Owen!

http://svn.apache.org/viewvc?view=rev&rev=549200

> local filecache disappears
> --------------------------
>
>                 Key: HADOOP-1475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1475
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Christian Kunz
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: dist-cache-purge.patch
>
>
> All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.
> It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.
> Side issue is;
> why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
> 13 in the 1st hour of the job
> 18 in the 2nd hour
> 33 in the 3rd hour
> Then the job failed.
> E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.
> I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?
> 2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
> 2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070
> 2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
> 2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075
> 2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
> 2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082
> 2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
> 2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085
> 2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
> 2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098
> 2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
> 2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106
> 2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
> 2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108
> 2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
> 2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112
> 2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
> 2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114
> 2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
> 2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116
> 2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
> 2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117
> 2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
> 2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120
> 2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
> 2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.