mapred -numFetchers gone?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

mapred -numFetchers gone?

Rod Taylor-2
I used to use -numFetchers to break a single fetch into multiple blocks
allowing to easy retries as well as some overlap with generate and
update.

For example:

generate -numFetchers 4 (blocks 1 through 4)
fetch block1 & fetch block2   (2 threads)
updatedb block1 block2 & fetch block3   (2 threads)
generate -numFetchers 4 (blocks 5 through 8) & fetch block4  (2 threads)
fetch block5 & fetch block6 (2 threads)
updatedb block3 block4 block5 block6 & fetch block7 (2 threads)
generate -numFetchers 4 & fetch block8 (2 threads)

That is, I would have the generate/update cycle dependent on the success
of 50% of the queued fetchers, which meant the other 50% of fetchers was
available to retrieve data while the previous group was going through
the update/generate phase.

I managed to sustain 30Mb/sec this way (sustained meaning 24/7
downloading) until I hit about 150M pages.

With -numFetchers gone it appears I require a generate/update for each
fetch which serializes the process.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: mapred -numFetchers gone?

Michael Ji
I believe it is not difficult to change code in
generate segment to break a signle fetchlist to
multiple segments

Michael Ji,

--- Rod Taylor <[hidden email]> wrote:

> I used to use -numFetchers to break a single fetch
> into multiple blocks
> allowing to easy retries as well as some overlap
> with generate and
> update.
>
> For example:
>
> generate -numFetchers 4 (blocks 1 through 4)
> fetch block1 & fetch block2   (2 threads)
> updatedb block1 block2 & fetch block3   (2 threads)
> generate -numFetchers 4 (blocks 5 through 8) & fetch
> block4  (2 threads)
> fetch block5 & fetch block6 (2 threads)
> updatedb block3 block4 block5 block6 & fetch block7
> (2 threads)
> generate -numFetchers 4 & fetch block8 (2 threads)
>
> That is, I would have the generate/update cycle
> dependent on the success
> of 50% of the queued fetchers, which meant the other
> 50% of fetchers was
> available to retrieve data while the previous group
> was going through
> the update/generate phase.
>
> I managed to sustain 30Mb/sec this way (sustained
> meaning 24/7
> downloading) until I hit about 150M pages.
>
> With -numFetchers gone it appears I require a
> generate/update for each
> fetch which serializes the process.
>
> --
> Rod Taylor <[hidden email]>
>
>



               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

Re: mapred -numFetchers gone?

Doug Cutting-2
In reply to this post by Rod Taylor-2
Rod Taylor wrote:
> With -numFetchers gone it appears I require a generate/update for each
> fetch which serializes the process.

That's correct.  It would be possible to implement something like the
former behaviour by (as before) setting page's nextFetch date to a week
out when they're added to a fetchlist.  But, in mapreduce, dbupdate and
generate are much faster, both since the crawldb doesn't have links (and
is thus a lot smaller) and the crawldb update is distributed, so the
downtime between fetcher cycles is much less and this technique may not
be required.  Previously dbupdate took nearly as long as fetches, so
parallelizing these made a big difference.  But now, in my experience,
the dbupdate/generate overhead is more like 10-20%.  With mapreduce,
what percent of the time do you find that you're not fetching?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: mapred -numFetchers gone?

Rod Taylor-2
On Fri, 2005-09-30 at 21:31 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > With -numFetchers gone it appears I require a generate/update for each
> > fetch which serializes the process.

> parallelizing these made a big difference.  But now, in my experience,
> the dbupdate/generate overhead is more like 10-20%.  With mapreduce,
> what percent of the time do you find that you're not fetching?

At this moment I have an overloaded router causing communication
problems between systems. So I get a ton of socket timeouts which can
cause reduce %age complete to go backward.

I'll get back to you when I have a few hundred million more pages and a
corrected network.

--
Rod Taylor <[hidden email]>