I'm trying to run Nutch (from trunk) with Hadoop. When I inject urls to
nutch using such command: bin/nutch inject crawl/crawldb
/user/nutch/urls/urllist.txt I get an exception.
2008-04-03 17:24:36,359 WARN regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2008-04-03 17:24:41,241 WARN mapred.TaskTracker - Error running child
As I found in Hadoop source code it fails on:
Long penaltyEnd = penaltyBox.get(loc.getHost());
Where loc is MapOutputLocation loc = (MapOutputLocation)locIt.next(); -
localtion of output of Map task. And loc returns null for getHost(). Why it
Also I found such information in log /logs/history/**job_name**.log:
As you can see Map task successfully finished. Input bytes 48 - size of my
urllist.txt file. 106 output bytes - I don't know what is it. But it'
strange that after Combine input and output records = 0. Maybe it is the
problem? Becouse of this impossible to locate map task output for Reduce
job? Or I'm wrong and here is everything ok.