Guide to speeding up Map Reduce on single machine setup

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Guide to speeding up Map Reduce on single machine setup

Benjamin Higgins
I'd like to know what are all the known techniques for speeding up MapReduce
for a single user machine.

So far, I know of this patch:

http://issues.apache.org/jira/browse/NUTCH-395

I also am reading that changing hadoop-site.xml can help, but I don't know
what changes to make.

Please add anything you've found that will help.  I am considering going
back to 0.7 if I can't get Nutch to go faster.  In my case I am also
crawling just a single site.

Ben
Reply | Threaded
Open this post in threaded view
|

Re: Guide to speeding up Map Reduce on single machine setup

Doug Cook

If you are doing a lot of URL filtering with regular expressions, this can take a massive amount of time in reduce. There may be some speedups possible, depending upon your usage patterns; some are as simple as config changes, others will take a patch (which I haven't contributed back yet, but will).

Let me know if you do a lot of filtering, and I'll post a longer list of suggestions.

     -Doug

Benjamin Higgins wrote
I'd like to know what are all the known techniques for speeding up MapReduce
for a single user machine.

So far, I know of this patch:

http://issues.apache.org/jira/browse/NUTCH-395

I also am reading that changing hadoop-site.xml can help, but I don't know
what changes to make.

Please add anything you've found that will help.  I am considering going
back to 0.7 if I can't get Nutch to go faster.  In my case I am also
crawling just a single site.

Ben
Reply | Threaded
Open this post in threaded view
|

Re: Guide to speeding up Map Reduce on single machine setup

Benjamin Higgins
I don't do too much filtering but I'd appreciate any tips regardless.

Generator also takes a long time, after saying this:

"Generator: Selecting best-scoring urls due for fetch."

The sad thing is that I want to do ALL the URLs that I have.  Is there some
way to simply skip this process of selecting the best scoring urls first?

On 11/21/06, Doug Cook <[hidden email]> wrote:

>
>
>
> If you are doing a lot of URL filtering with regular expressions, this can
> take a massive amount of time in reduce. There may be some speedups
> possible, depending upon your usage patterns; some are as simple as config
> changes, others will take a patch (which I haven't contributed back yet,
> but
> will).
>
> Let me know if you do a lot of filtering, and I'll post a longer list of
> suggestions.
>
>      -Doug
>
>
> Benjamin Higgins wrote:
> >
> > I'd like to know what are all the known techniques for speeding up
> > MapReduce
> > for a single user machine.
> >
> > So far, I know of this patch:
> >
> > http://issues.apache.org/jira/browse/NUTCH-395
> >
> > I also am reading that changing hadoop-site.xml can help, but I don't
> know
> > what changes to make.
> >
> > Please add anything you've found that will help.  I am considering going
> > back to 0.7 if I can't get Nutch to go faster.  In my case I am also
> > crawling just a single site.
> >
> > Ben
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Guide-to-speeding-up-Map-Reduce-on-single-machine-setup-tf2680869.html#a7479019
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Guide to speeding up Map Reduce on single machine setup

Zaheed Haque
In reply to this post by Doug Cook
On 11/21/06, Doug Cook <[hidden email]> wrote:

>
>
> If you are doing a lot of URL filtering with regular expressions, this can
> take a massive amount of time in reduce. There may be some speedups
> possible, depending upon your usage patterns; some are as simple as config
> changes, others will take a patch (which I haven't contributed back yet, but
> will).
>
> Let me know if you do a lot of filtering, and I'll post a longer list of
> suggestions.

Yes, I like to know please.

>      -Doug
>
>
> Benjamin Higgins wrote:
> >
> > I'd like to know what are all the known techniques for speeding up
> > MapReduce
> > for a single user machine.
> >
> > So far, I know of this patch:
> >
> > http://issues.apache.org/jira/browse/NUTCH-395
> >
> > I also am reading that changing hadoop-site.xml can help, but I don't know
> > what changes to make.
> >
> > Please add anything you've found that will help.  I am considering going
> > back to 0.7 if I can't get Nutch to go faster.  In my case I am also
> > crawling just a single site.
> >
> > Ben
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Guide-to-speeding-up-Map-Reduce-on-single-machine-setup-tf2680869.html#a7479019
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>