Nutch inject fails on reduce

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Nutch inject fails on reduce

Evgeny Zhulenev
Hi.

I'm trying to run Nutch (from trunk) with Hadoop. When I inject urls to
nutch using such command:  bin/nutch inject crawl/crawldb
/user/nutch/urls/urllist.txt I get an exception.

Hadoop.log:

2008-04-03 17:24:36,359 WARN  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2008-04-03 17:24:41,241 WARN  mapred.TaskTracker - Error running child
java.lang.NullPointerException
    at java.util.Hashtable.get(Hashtable.java:334)
    at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1020)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:259)
    at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)

As I found in Hadoop source code it fails on:

Long penaltyEnd = penaltyBox.get(loc.getHost());

Where loc is MapOutputLocation loc = (MapOutputLocation)locIt.next(); -
localtion of output of Map task. And loc returns null for getHost(). Why it
could happen?

Also I found such information in log /logs/history/**job_name**.log:

Job JOBID="job_200804031723_0001" JOBNAME="inject
/user/nutch/urls/urllist.txt" USER="nutch" SUBMIT_TIME="1207229071595"
JOBCONF="/nutch/filesystem/mapreduce/system/job_200804031723_0001/job.xml"
Job JOBID="job_200804031723_0001" LAUNCH_TIME="1207229071699" TOTAL_MAPS="1"
TOTAL_REDUCES="1"
MapAttempt TASK_TYPE="MAP" TASKID="tip_200804031723_0001_m_000000"
TASK_ATTEMPT_ID="task_200804031723_0001_m_000000_0"
START_TIME="1207229074040" HOSTNAME="tracker_linux.100:localhost/
127.0.0.1:26784"
MapAttempt TASK_TYPE="MAP" TASKID="tip_200804031723_0001_m_000000"
TASK_ATTEMPT_ID="task_200804031723_0001_m_000000_0" TASK_STATUS="SUCCESS"
FINISH_TIME="1207229077454" HOSTNAME="tracker_linux.100:localhost/
127.0.0.1:26784"
Task TASKID="tip_200804031723_0001_m_000000" TASK_TYPE="MAP"
TASK_STATUS="SUCCESS" FINISH_TIME="1207229077454" COUNTERS="Map-Reduce
Framework.Map input records=2,Map-Reduce Framework.Map output
records=2,Map-Reduce Framework.Map input bytes=48,Map-Reduce Framework.Map
output bytes=106,Map-Reduce Framework.Combine input records=0,Map-Reduce
Framework.Combine output records=0"

As you can see Map task successfully finished. Input bytes 48 - size of my
urllist.txt file. 106 output bytes - I don't know what is it. But it'
strange that after Combine input and output records = 0. Maybe it is the
problem? Becouse of this impossible to locate map task output for Reduce
job? Or I'm wrong and here is everything ok.