need help to speed up map-reduce

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

need help to speed up map-reduce

AJ Chen-2
Sorry for repeating this question. But, I have to find a solution, otherwise
the crawling is too slow to be practical.  I'm using nutch 0.9-dev on one
linux server to crawl millions of pages.  The fetching itself is reasonable,
but the map-reduce operations is killing the performance. For example,
fetching takes 10 hours and map-reduce also takes 10 hours, which makes the
overall performance very slow. Can anyone share experience on how to speed
up map-reduce for single server crawling?  Single server uses local file
system. It should spend very little time in doing map and reduce, isn't it
right?

Thanks,
--
AJ Chen, PhD
http://web2express.org
Reply | Threaded
Open this post in threaded view
|

Re: need help to speed up map-reduce

Doug Cook
I've been planning to spend some time looking at this, but haven't gotten round to it yet -- I see the same (serious) performance problems on a single machine setup -- reduce takes quite a bit longer than the fetch (map) operation in my case, and this is on a very fast 4-CPU machine with a ton of memory. It just doesn't seem like it should take this long. I'm using 0.8 + some patches & local mods.

If you find some things, please let me know. Likewise, when I get round to it, I will post my findings.

Thanks,

Doug


AJ Chen-2 wrote
Sorry for repeating this question. But, I have to find a solution, otherwise
the crawling is too slow to be practical.  I'm using nutch 0.9-dev on one
linux server to crawl millions of pages.  The fetching itself is reasonable,
but the map-reduce operations is killing the performance. For example,
fetching takes 10 hours and map-reduce also takes 10 hours, which makes the
overall performance very slow. Can anyone share experience on how to speed
up map-reduce for single server crawling?  Single server uses local file
system. It should spend very little time in doing map and reduce, isn't it
right?

Thanks,
--
AJ Chen, PhD
http://web2express.org
Reply | Threaded
Open this post in threaded view
|

Re: need help to speed up map-reduce

Uroš Gruber-2
Doug Cook wrote:

> I've been planning to spend some time looking at this, but haven't gotten
> round to it yet -- I see the same (serious) performance problems on a single
> machine setup -- reduce takes quite a bit longer than the fetch (map)
> operation in my case, and this is on a very fast 4-CPU machine with a ton of
> memory. It just doesn't seem like it should take this long. I'm using 0.8 +
> some patches & local mods.
>
> If you find some things, please let me know. Likewise, when I get round to
> it, I will post my findings.
>
>  
I was talking about slownes months ago, so I'm glad someone else have
the same problems. We also have single machine and reduce task takes
hours to complete. Funny thing is that CPU is loaded 100% but when we do
search on this server there is no difference in speed. But still It
would be great if things go faster.

When fetching I have 20 to 30 pages per sec. But then I have to wait for
reduce task to finish. I try use debug loging and only thing I can see
is about 1 to 3 seconds between reduce log msgs. I know that map/reduce
is meant to use with multiple nodes.

regards

Uros

> Thanks,
>
> Doug
>
>
>
> AJ Chen-2 wrote:
>  
>> Sorry for repeating this question. But, I have to find a solution,
>> otherwise
>> the crawling is too slow to be practical.  I'm using nutch 0.9-dev on one
>> linux server to crawl millions of pages.  The fetching itself is
>> reasonable,
>> but the map-reduce operations is killing the performance. For example,
>> fetching takes 10 hours and map-reduce also takes 10 hours, which makes
>> the
>> overall performance very slow. Can anyone share experience on how to speed
>> up map-reduce for single server crawling?  Single server uses local file
>> system. It should spend very little time in doing map and reduce, isn't it
>> right?
>>
>> Thanks,
>> --
>> AJ Chen, PhD
>> http://web2express.org
>>
>>
>>    
>
>