Where is the temp output data of a map or reduce tasks

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Where is the temp output data of a map or reduce tasks

xeonmailinglist-gmail
With MapReduce v2 (Yarn), the output data that comes out from a map or a reduce task is saved in the local disk or the HDFS when all the tasks finish.

Since tasks end at different times, I was expecting that the data were written as a task finish. For example, task 0 finish and so the output is written, but task 1 and task 2 are still running. Now task 2 finish the output is written, and task 1 is still running. Finally, task 1 finish and the last output is written. But this does not happen. The outputs only appear in the local disk or HDFS when all the tasks finish.

I want to access the task output as the data is being produced. Where is the output data before all the tasks finish?


After I have set these params in `mapred-site.xml`

    <property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
    <property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>

I still can't found where the intermediate output or the final output is saved as they are produced by the tasks.

I have listed all directories in `hdfs dfs -ls -R /` and in the `tmp` dir I have only found the job configuration files.

    drwx------   - root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
    -rw-r--r--   1 root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
    -rw-r--r--   1 root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
    -rw-r--r--  10 root supergroup     112872 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
    -rw-r--r--  10 root supergroup       6641 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
    -rw-r--r--   1 root supergroup        797 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
    -rw-r--r--   1 root supergroup      88675 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
    -rw-r--r--   1 root supergroup     439848 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
    -rw-r--r--   1 root supergroup     105176 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml

 Where is the output saved? I am talking about the output that it is stored as it is being produced by the tasks, and not the final output that comes when all map or reduce tasks finish.