RE: Best and economical way of setting hadoop cluster for distributed crawling

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: Best and economical way of setting hadoop cluster for distributed crawling

Markus Jelsma-2
Hello Sachin,

Nutch can run on Amazon AWS without trouble, and probably on any Hadoop based provider. This is the most expensive option you have.

Cheaper would be to rent some servers and install Hadoop yourself, getting it up and running by hand on some servers will take the better part of a day.

The cheapest and easiest, and in almost all cases the best option, is not to run Nutch on Hadoop and stay local. A local Nutch can easily handle a couple of million URLs. So unless you want to crawl many different domains and expect 10M+ URLs, stay local.

When we first started our business almost a decade ago we rented VPSs first and then physical machines. This ran fine for some years but when we had the option to make some good investments, we bought our own hardware and have been scaling up the cluster ever since. And with the previous and most recent AMD based servers processing power became increasingly cheaper.

If you need to scale up for long term, getting your own hardware is indeed the best option.

Regards,
Markus
 
 
-----Original message-----

> From:Sachin Mittal <[hidden email]>
> Sent: Tuesday 22nd October 2019 15:59
> To: [hidden email]
> Subject: Best and economical way of setting hadoop cluster for distributed crawling
>
> Hi,
> I have been running nutch in local mode and so far I am able to have a good
> understanding on how it all works.
>
> I wanted to start with distributed crawling using some public cloud
> provider.
>
> I just wanted to know if fellow users have any experience in setting up
> nutch for distributed crawling.
>
> From nutch wiki I have some idea on what hardware requirements should be.
>
> I just wanted to know which of the public cloud providers (IaaS or PaaS)
> are good to setup hadoop clusters on. Basically ones on which it is easy to
> setup/manage the cluster and ones which are easy on budget.
>
> Please let me know if you folks have any insights based on your experiences.
>
> Thanks and Regards
> Sachin
>
Reply | Threaded
Open this post in threaded view
|

Re: Best and economical way of setting hadoop cluster for distributed crawling

Sachin Mittal
Hi,
I understood the point.
I would also like to run nutch on my local machine.

So far I am running in standalone mode with default crawl script where
fetch time limit is 180 minutes.
What I have observed is that it usually fetches, parses and indexes 1800
web pages.
I am basically fetching the entire page and fetch process is one that takes
maximum time.

I have a i7 processor with 16GB of RAM.

How can I increase the throughput here?
What I have understood here is that in local mode there is only one thread
doing the fetch?

I guess I would need multiple threads running in parallel.
Would running nutch in pseudo distributed mode and answer here?
It will then run multiple fetchers and I can increase my throughput.

Please let me know.

Thanks
Sachin






On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <[hidden email]>
wrote:

> Hello Sachin,
>
> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> based provider. This is the most expensive option you have.
>
> Cheaper would be to rent some servers and install Hadoop yourself, getting
> it up and running by hand on some servers will take the better part of a
> day.
>
> The cheapest and easiest, and in almost all cases the best option, is not
> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> couple of million URLs. So unless you want to crawl many different domains
> and expect 10M+ URLs, stay local.
>
> When we first started our business almost a decade ago we rented VPSs
> first and then physical machines. This ran fine for some years but when we
> had the option to make some good investments, we bought our own hardware
> and have been scaling up the cluster ever since. And with the previous and
> most recent AMD based servers processing power became increasingly cheaper.
>
> If you need to scale up for long term, getting your own hardware is indeed
> the best option.
>
> Regards,
> Markus
>
>
> -----Original message-----
> > From:Sachin Mittal <[hidden email]>
> > Sent: Tuesday 22nd October 2019 15:59
> > To: [hidden email]
> > Subject: Best and economical way of setting hadoop cluster for
> distributed crawling
> >
> > Hi,
> > I have been running nutch in local mode and so far I am able to have a
> good
> > understanding on how it all works.
> >
> > I wanted to start with distributed crawling using some public cloud
> > provider.
> >
> > I just wanted to know if fellow users have any experience in setting up
> > nutch for distributed crawling.
> >
> > From nutch wiki I have some idea on what hardware requirements should be.
> >
> > I just wanted to know which of the public cloud providers (IaaS or PaaS)
> > are good to setup hadoop clusters on. Basically ones on which it is easy
> to
> > setup/manage the cluster and ones which are easy on budget.
> >
> > Please let me know if you folks have any insights based on your
> experiences.
> >
> > Thanks and Regards
> > Sachin
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Best and economical way of setting hadoop cluster for distributed crawling

Sebastian Nagel-2
Hi Sachin,

> What I have observed is that it usually fetches, parses and indexes
> 1800 web pages.

This means 10 pages per minute.

How are the 1800 pages distributed over hosts?

The default delay between successive fetches to the same host is
5 seconds. If all pages belong to the same host, the crawler is
waiting 50 sec. every minute and the fetching is done in the remaining
10 sec.

If you have the explicit permission to access the host(s) aggressively, you can decrease the delay
(fetcher.server.delay) or even fetch in parallel from the same host (fetcher.threads.per.queue).
Otherwise, please keep the delay as is and be patient and polite! You also risk to get blocked by
the web admin.

> What I have understood here is that in local mode there is only one
> thread doing the fetch?

No. The number of parallel threads used in bin/crawl is 50.
 --num-threads <num_threads>
    Number of threads for fetching / sitemap processing [default: 50]

I can only second Markus: local mode is sufficient unless you're crawling
- significantly more than 10M+ URLs
- from 1000+ domains

With less domains/hosts there's nothing to distribute because all
URLs of one domain/host are processed in one fetcher task to ensure
politeness.

Best,
Sebastian

On 11/1/19 6:53 AM, Sachin Mittal wrote:

> Hi,
> I understood the point.
> I would also like to run nutch on my local machine.
>
> So far I am running in standalone mode with default crawl script where
> fetch time limit is 180 minutes.
> What I have observed is that it usually fetches, parses and indexes 1800
> web pages.
> I am basically fetching the entire page and fetch process is one that takes
> maximum time.
>
> I have a i7 processor with 16GB of RAM.
>
> How can I increase the throughput here?
> What I have understood here is that in local mode there is only one thread
> doing the fetch?
>
> I guess I would need multiple threads running in parallel.
> Would running nutch in pseudo distributed mode and answer here?
> It will then run multiple fetchers and I can increase my throughput.
>
> Please let me know.
>
> Thanks
> Sachin
>
>
>
>
>
>
> On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <[hidden email]>
> wrote:
>
>> Hello Sachin,
>>
>> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
>> based provider. This is the most expensive option you have.
>>
>> Cheaper would be to rent some servers and install Hadoop yourself, getting
>> it up and running by hand on some servers will take the better part of a
>> day.
>>
>> The cheapest and easiest, and in almost all cases the best option, is not
>> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
>> couple of million URLs. So unless you want to crawl many different domains
>> and expect 10M+ URLs, stay local.
>>
>> When we first started our business almost a decade ago we rented VPSs
>> first and then physical machines. This ran fine for some years but when we
>> had the option to make some good investments, we bought our own hardware
>> and have been scaling up the cluster ever since. And with the previous and
>> most recent AMD based servers processing power became increasingly cheaper.
>>
>> If you need to scale up for long term, getting your own hardware is indeed
>> the best option.
>>
>> Regards,
>> Markus
>>
>>
>> -----Original message-----
>>> From:Sachin Mittal <[hidden email]>
>>> Sent: Tuesday 22nd October 2019 15:59
>>> To: [hidden email]
>>> Subject: Best and economical way of setting hadoop cluster for
>> distributed crawling
>>>
>>> Hi,
>>> I have been running nutch in local mode and so far I am able to have a
>> good
>>> understanding on how it all works.
>>>
>>> I wanted to start with distributed crawling using some public cloud
>>> provider.
>>>
>>> I just wanted to know if fellow users have any experience in setting up
>>> nutch for distributed crawling.
>>>
>>> From nutch wiki I have some idea on what hardware requirements should be.
>>>
>>> I just wanted to know which of the public cloud providers (IaaS or PaaS)
>>> are good to setup hadoop clusters on. Basically ones on which it is easy
>> to
>>> setup/manage the cluster and ones which are easy on budget.
>>>
>>> Please let me know if you folks have any insights based on your
>> experiences.
>>>
>>> Thanks and Regards
>>> Sachin
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

RE: Best and economical way of setting hadoop cluster for distributed crawling

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello Sachin,

You might want to check out the fetcher.* settings in your configuration. They control how many threads in total, how they are queued, what the delay between fetchers is, how many threads per queue etc.

Keep in mind, if you do not own the server or have no explicit permission, it is wise not to over do it (the default settings are recommended) you can easily bring down a website using Nutch in local mode.

Regards,
Markus
 
 
-----Original message-----

> From:Sachin Mittal <[hidden email]>
> Sent: Friday 1st November 2019 6:53
> To: [hidden email]
> Subject: Re: Best and economical way of setting hadoop cluster for distributed crawling
>
> Hi,
> I understood the point.
> I would also like to run nutch on my local machine.
>
> So far I am running in standalone mode with default crawl script where
> fetch time limit is 180 minutes.
> What I have observed is that it usually fetches, parses and indexes 1800
> web pages.
> I am basically fetching the entire page and fetch process is one that takes
> maximum time.
>
> I have a i7 processor with 16GB of RAM.
>
> How can I increase the throughput here?
> What I have understood here is that in local mode there is only one thread
> doing the fetch?
>
> I guess I would need multiple threads running in parallel.
> Would running nutch in pseudo distributed mode and answer here?
> It will then run multiple fetchers and I can increase my throughput.
>
> Please let me know.
>
> Thanks
> Sachin
>
>
>
>
>
>
> On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <[hidden email]>
> wrote:
>
> > Hello Sachin,
> >
> > Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> > based provider. This is the most expensive option you have.
> >
> > Cheaper would be to rent some servers and install Hadoop yourself, getting
> > it up and running by hand on some servers will take the better part of a
> > day.
> >
> > The cheapest and easiest, and in almost all cases the best option, is not
> > to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> > couple of million URLs. So unless you want to crawl many different domains
> > and expect 10M+ URLs, stay local.
> >
> > When we first started our business almost a decade ago we rented VPSs
> > first and then physical machines. This ran fine for some years but when we
> > had the option to make some good investments, we bought our own hardware
> > and have been scaling up the cluster ever since. And with the previous and
> > most recent AMD based servers processing power became increasingly cheaper.
> >
> > If you need to scale up for long term, getting your own hardware is indeed
> > the best option.
> >
> > Regards,
> > Markus
> >
> >
> > -----Original message-----
> > > From:Sachin Mittal <[hidden email]>
> > > Sent: Tuesday 22nd October 2019 15:59
> > > To: [hidden email]
> > > Subject: Best and economical way of setting hadoop cluster for
> > distributed crawling
> > >
> > > Hi,
> > > I have been running nutch in local mode and so far I am able to have a
> > good
> > > understanding on how it all works.
> > >
> > > I wanted to start with distributed crawling using some public cloud
> > > provider.
> > >
> > > I just wanted to know if fellow users have any experience in setting up
> > > nutch for distributed crawling.
> > >
> > > From nutch wiki I have some idea on what hardware requirements should be.
> > >
> > > I just wanted to know which of the public cloud providers (IaaS or PaaS)
> > > are good to setup hadoop clusters on. Basically ones on which it is easy
> > to
> > > setup/manage the cluster and ones which are easy on budget.
> > >
> > > Please let me know if you folks have any insights based on your
> > experiences.
> > >
> > > Thanks and Regards
> > > Sachin
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Best and economical way of setting hadoop cluster for distributed crawling

Sachin Mittal
In reply to this post by Sebastian Nagel-2
OK understood.
I am using nutch defaults and they are set optimally especially for polite
crawling.
I am indeed right now crawling just one host and given the defaults the
throughput is what it should be.

Yes one need not to be aggressive here and just be patient.

I think no where in near future I would have over 10M urls to crawl for
1000s of host and local crawling is just fine in my case.
So I would just continue the way it is right now.

Thanks
Sachin




On Fri, Nov 1, 2019 at 7:36 PM Sebastian Nagel
<[hidden email]> wrote:

> Hi Sachin,
>
> > What I have observed is that it usually fetches, parses and indexes
> > 1800 web pages.
>
> This means 10 pages per minute.
>
> How are the 1800 pages distributed over hosts?
>
> The default delay between successive fetches to the same host is
> 5 seconds. If all pages belong to the same host, the crawler is
> waiting 50 sec. every minute and the fetching is done in the remaining
> 10 sec.
>
> If you have the explicit permission to access the host(s) aggressively,
> you can decrease the delay
> (fetcher.server.delay) or even fetch in parallel from the same host
> (fetcher.threads.per.queue).
> Otherwise, please keep the delay as is and be patient and polite! You also
> risk to get blocked by
> the web admin.
>
> > What I have understood here is that in local mode there is only one
> > thread doing the fetch?
>
> No. The number of parallel threads used in bin/crawl is 50.
>  --num-threads <num_threads>
>     Number of threads for fetching / sitemap processing [default: 50]
>
> I can only second Markus: local mode is sufficient unless you're crawling
> - significantly more than 10M+ URLs
> - from 1000+ domains
>
> With less domains/hosts there's nothing to distribute because all
> URLs of one domain/host are processed in one fetcher task to ensure
> politeness.
>
> Best,
> Sebastian
>
> On 11/1/19 6:53 AM, Sachin Mittal wrote:
> > Hi,
> > I understood the point.
> > I would also like to run nutch on my local machine.
> >
> > So far I am running in standalone mode with default crawl script where
> > fetch time limit is 180 minutes.
> > What I have observed is that it usually fetches, parses and indexes 1800
> > web pages.
> > I am basically fetching the entire page and fetch process is one that
> takes
> > maximum time.
> >
> > I have a i7 processor with 16GB of RAM.
> >
> > How can I increase the throughput here?
> > What I have understood here is that in local mode there is only one
> thread
> > doing the fetch?
> >
> > I guess I would need multiple threads running in parallel.
> > Would running nutch in pseudo distributed mode and answer here?
> > It will then run multiple fetchers and I can increase my throughput.
> >
> > Please let me know.
> >
> > Thanks
> > Sachin
> >
> >
> >
> >
> >
> >
> > On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <
> [hidden email]>
> > wrote:
> >
> >> Hello Sachin,
> >>
> >> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> >> based provider. This is the most expensive option you have.
> >>
> >> Cheaper would be to rent some servers and install Hadoop yourself,
> getting
> >> it up and running by hand on some servers will take the better part of a
> >> day.
> >>
> >> The cheapest and easiest, and in almost all cases the best option, is
> not
> >> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> >> couple of million URLs. So unless you want to crawl many different
> domains
> >> and expect 10M+ URLs, stay local.
> >>
> >> When we first started our business almost a decade ago we rented VPSs
> >> first and then physical machines. This ran fine for some years but when
> we
> >> had the option to make some good investments, we bought our own hardware
> >> and have been scaling up the cluster ever since. And with the previous
> and
> >> most recent AMD based servers processing power became increasingly
> cheaper.
> >>
> >> If you need to scale up for long term, getting your own hardware is
> indeed
> >> the best option.
> >>
> >> Regards,
> >> Markus
> >>
> >>
> >> -----Original message-----
> >>> From:Sachin Mittal <[hidden email]>
> >>> Sent: Tuesday 22nd October 2019 15:59
> >>> To: [hidden email]
> >>> Subject: Best and economical way of setting hadoop cluster for
> >> distributed crawling
> >>>
> >>> Hi,
> >>> I have been running nutch in local mode and so far I am able to have a
> >> good
> >>> understanding on how it all works.
> >>>
> >>> I wanted to start with distributed crawling using some public cloud
> >>> provider.
> >>>
> >>> I just wanted to know if fellow users have any experience in setting up
> >>> nutch for distributed crawling.
> >>>
> >>> From nutch wiki I have some idea on what hardware requirements should
> be.
> >>>
> >>> I just wanted to know which of the public cloud providers (IaaS or
> PaaS)
> >>> are good to setup hadoop clusters on. Basically ones on which it is
> easy
> >> to
> >>> setup/manage the cluster and ones which are easy on budget.
> >>>
> >>> Please let me know if you folks have any insights based on your
> >> experiences.
> >>>
> >>> Thanks and Regards
> >>> Sachin
> >>>
> >>
> >
>
>