how to minimize reduce operations when using single machine

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

how to minimize reduce operations when using single machine

AJ Chen-2
I use 0.9-dev code and local file system to crawl on a single machine.
After fetching pages, nutch spends huge amount of time doing "reduce > sort"
and reduce "reduce > reduce". This is not necessary since it uses only the
local file system.  I'm not familiar with map-reduce code, but guess it may
be possible to control the number of map and reduce operations.  Is it
possible to configure nutch to break fetch job to only few sub-operations so
that there will be only 1 or few map and reduce opresation?  What setting or
code can be changed to minimize the time spent on map-reduce operations when
crawling with a single machine?

Thanks,
AJ