Best performance approach for single MP machine?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Best performance approach for single MP machine?

Doug Cook
Hi,

I've recently switched to 0.8 from 0.7, and after some initial fits and starts, I'm past the "get it working at all" stage to the "get reasonable performance" stage.

I've got a single machine with 4 CPUs and a lot of memory. URL fetching works great because it's (mostly) multithreaded. But as soon as I hit the reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and the phase can take days, leaving me vulnerable to losing everything should a process fail.

Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some help getting my configuration right. I've seen examples/tutorials of configurations for multiple machines; am I just "faking" multiple machines on my single node (will that work?) or is there a cleaner, simpler approach?

Alternatively, I was all excited to get an easy improvement with -numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it looks like -numFetchers has gone away, and though there was an 0.8 version patch, at a quick glance this didn't seem to have made it into the mainline source, and I don't see the value of trying to merge this in if there's a cleaner Hadoop-based approach.

Many thanks for any help.

Doug
hk-
Reply | Threaded
Open this post in threaded view
|

Re: Best performance approach for single MP machine?

hk-
 
http://www.mail-archive.com/nutch-user@.../msg02394.html

"
Teruhiko Kurosaka wrote:

    Can I use MapReduce to run Nutch on a multi CPU system?
     

Yes.


    I want to run the index job on two (or four) CPUs
    on a single system.  I'm not trying to distribute the job
    over multiple systems.

    If the MapReduce is the way to go,
    do I just specify config parameters like these:
    mapred.tasktracker.tasks.maxiumum=2
    mapred.job.tracker=localhost:9001
    mapred.reduce.tasks=2 (or 1?)

    and
    bin/start-all.sh

    ?
     

That should work. You'd probably want to set the default number of map
tasks to be a multiple of the number of CPUs, and the number of reduce
tasks to be exactly the number of cpus.

Don't use start-all.sh, but rather just:

bin/nutch-daemon.sh start tasktracker
bin/nutch-daemon.sh start jobtracker


    Must I use NDFS for MapReduce?
     

No.

Doug

"





Doug Cook wrote:

> Hi,
>
> I've recently switched to 0.8 from 0.7, and after some initial fits and
> starts, I'm past the "get it working at all" stage to the "get reasonable
> performance" stage.
>
> I've got a single machine with 4 CPUs and a lot of memory. URL fetching
> works great because it's (mostly) multithreaded. But as soon as I hit the
> reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and
> the phase can take days, leaving me vulnerable to losing everything should a
> process fail.
>
> Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some
> help getting my configuration right. I've seen examples/tutorials of
> configurations for multiple machines; am I just "faking" multiple machines
> on my single node (will that work?) or is there a cleaner, simpler approach?
>
> Alternatively, I was all excited to get an easy improvement with
> -numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it
> looks like -numFetchers has gone away, and though there was an 0.8 version
> patch, at a quick glance this didn't seem to have made it into the mainline
> source, and I don't see the value of trying to merge this in if there's a
> cleaner Hadoop-based approach.
>
> Many thanks for any help.
>
> Doug
>  

Reply | Threaded
Open this post in threaded view
|

Re: Best performance approach for single MP machine?

Doug Cook
Thanks, Håvard (and Doug, in the original email).

Those pointers, plus a few other tips from elsewhere, did the trick. I'm now up and running with all CPUs.

One thing I found along the way was that if I did not set mapred.child.heap.size, I would run out of heap space in initialization of inject with even a small URL list. Is this normal? If so, why not have a reasonable default for heap.size? If this is not normal, is it indicative of something else I might have misconfigured?

In any case, I'm running now, just curious (and would like for others to avoid having to "discover" this).

-Doug
Reply | Threaded
Open this post in threaded view
|

Re: Best performance approach for single MP machine?

Thomas Delnoij-3
Hi Doug,

is it possible you could post your hadoop-site.xml? I would like to
accomplish the same.

Rgrds. Thomas

On 7/21/06, Doug Cook <[hidden email]> wrote:

>
> Thanks, Håvard (and Doug, in the original email).
>
> Those pointers, plus a few other tips from elsewhere, did the trick. I'm now
> up and running with all CPUs.
>
> One thing I found along the way was that if I did not set
> mapred.child.heap.size, I would run out of heap space in initialization of
> inject with even a small URL list. Is this normal? If so, why not have a
> reasonable default for heap.size? If this is not normal, is it indicative of
> something else I might have misconfigured?
>
> In any case, I'm running now, just curious (and would like for others to
> avoid having to "discover" this).
>
> -Doug
> --
> View this message in context: http://www.nabble.com/Best-performance-approach-for-single-MP-machine--tf1970539.html#a5430453
> Sent from the Nutch - User forum at Nabble.com.
>
>