Crawling the entire web -- what's involved?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawling the entire web -- what's involved?

Chris-10
This is a big picture question on what kind of money and effort it would
require to do a full web crawl. By "full web crawl" I mean fetching the
top four billion or so pages and keeping them reasonably fresh, with
most pages no more than a month out of date.

I know this is a huge undertaking. I just want to get ballpark numbers
on the required number of servers and required bandwidth.

Also, is it even possible to do with Nutch? How much custom coding would
  be required? Are there other crawlers that may be appropriate, like
Heretrix?

We're looking into doing a giant text mining app. We'd like to have a
large database of web pages available for analysis. All we need to do is
fetch and store the pages. We're not talking about running a search
engine on top of it.


Reply | Threaded
Open this post in threaded view
|

Re: Crawling the entire web -- what's involved?

ken-32
> This is a big picture question on what kind of money and effort it would
> require to do a full web crawl. By "full web crawl" I mean fetching the
> top four billion or so pages and keeping them reasonably fresh, with
> most pages no more than a month out of date.
>
> I know this is a huge undertaking. I just want to get ballpark numbers
> on the required number of servers and required bandwidth.
>
> Also, is it even possible to do with Nutch? How much custom coding would
>   be required? Are there other crawlers that may be appropriate, like
> Heretrix?
>
> We're looking into doing a giant text mining app. We'd like to have a
> large database of web pages available for analysis. All we need to do is
> fetch and store the pages. We're not talking about running a search
> engine on top of it.
>
I believe the last count of the number of servers that Google has is
200,000+.
That should give you an indication of the magnitude of crawling the whole
web.

Reply | Threaded
Open this post in threaded view
|

Re: Crawling the entire web -- what's involved?

waterwheel
In reply to this post by Chris-10
Well, just very roughly:
4billion pages X 20K per page / 1000K per meg / 1000 megs per gig =
80,000 gigs of data transfer every month.

100mbs connection /8 megabits per megabyte * 60 seconds in a minute *
60seconds in an hour*24 hours in a day *30 hours in a month=32,400 gigs
per month.

So you'd need about 3 full 100mbs connections running at 100% capacity,
24/7.  Which as you noted is a huge undertaking.

As a second indicator of the scale, IIRC Doug Cutting posted a while ago
that he downloaded and indexed 50 million pages in a day or two with
about 10 servers.

We download about 100,000 pages per hour on a dedicated 10mbs
connection.  Nutch will definitely fill more than a 10mbs connection
though, I scaled the system back to only use 10mbs before I went broke :).

Hopefully those three points will give you an indication of scale.  Of
course you still have the problem of storing 80,000 gigs of data - and
even more vast problem of doing something with it.

You'll have to investigate further, but the next hurdle is that the data
isn't likely easily accesible in the format you want.

What you may consider doing is renting out someone else's data.  I think
Alexa or Ask (one of those two) has made their entire database available
for a fee based on cpu cycles or something though the data was three
months old.  Or you could try a smaller search engine like gigablast or
someone that might share their data with you for a fee.

-g.



It's possible nutch 0.8 will do this since it's set up for distributed
computing.

Chris wrote:

> This is a big picture question on what kind of money and effort it
> would require to do a full web crawl. By "full web crawl" I mean
> fetching the top four billion or so pages and keeping them reasonably
> fresh, with most pages no more than a month out of date.
>
> I know this is a huge undertaking. I just want to get ballpark numbers
> on the required number of servers and required bandwidth.
>
> Also, is it even possible to do with Nutch? How much custom coding
> would  be required? Are there other crawlers that may be appropriate,
> like Heretrix?
>
> We're looking into doing a giant text mining app. We'd like to have a
> large database of web pages available for analysis. All we need to do
> is fetch and store the pages. We're not talking about running a search
> engine on top of it.
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawling the entire web -- what's involved?

Uroš Gruber-2
Insurance Squared Inc. wrote:
> Well, just very roughly:
> 4billion pages X 20K per page / 1000K per meg / 1000 megs per gig =
> 80,000 gigs of data transfer every month.
>
> 100mbs connection /8 megabits per megabyte * 60 seconds in a minute *
> 60seconds in an hour*24 hours in a day *30 hours in a month=32,400
> gigs per month.
> So you'd need about 3 full 100mbs connections running at 100%
> capacity, 24/7.  Which as you noted is a huge undertaking.
True but I doubt that you need to download every page continuously and
because of that Nutch need sending Last-modified header and if it's get
response 304 Not Modified. I saw some patches around but I think they
are not in trunk yet.
> As a second indicator of the scale, IIRC Doug Cutting posted a while
> ago that he downloaded and indexed 50 million pages in a day or two
> with about 10 servers.
> We download about 100,000 pages per hour on a dedicated 10mbs
> connection.  Nutch will definitely fill more than a 10mbs connection
> though, I scaled the system back to only use 10mbs before I went broke
> :).
>
Could you please send config info and what hardware is used for
crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.

--
Uros

Reply | Threaded
Open this post in threaded view
|

Re: Crawling the entire web -- what's involved?

waterwheel


>
>> As a second indicator of the scale, IIRC Doug Cutting posted a while
>> ago that he downloaded and indexed 50 million pages in a day or two
>> with about 10 servers.
>> We download about 100,000 pages per hour on a dedicated 10mbs
>> connection.  Nutch will definitely fill more than a 10mbs connection
>> though, I scaled the system back to only use 10mbs before I went
>> broke :).
>>
> Could you please send config info and what hardware is used for
> crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.
>
> --

For that I'm using a Dell 1750 with dual Xeon's and 8gigs of ram.  
Though I can get the same with only a single p4 processor.  You've
likely got one of two issues.  First is you don't actually have a 100mbs
connection; somewhere there's a bottleneck.  Secondly, watch the limit
on the size of the  files you crawl. I think we limit our file size to
64K.  If you have that limit too big you end up spending all day
downloading 10meg pdf's; that'll really slow things down.


Reply | Threaded
Open this post in threaded view
|

Re: Crawling the entire web -- what's involved?

Uroš Gruber-2
Insurance Squared Inc. wrote:

>
>
>>
>>> As a second indicator of the scale, IIRC Doug Cutting posted a while
>>> ago that he downloaded and indexed 50 million pages in a day or two
>>> with about 10 servers.
>>> We download about 100,000 pages per hour on a dedicated 10mbs
>>> connection.  Nutch will definitely fill more than a 10mbs connection
>>> though, I scaled the system back to only use 10mbs before I went
>>> broke :).
>>>
>> Could you please send config info and what hardware is used for
>> crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.
>>
>> --
>
> For that I'm using a Dell 1750 with dual Xeon's and 8gigs of ram.  
> Though I can get the same with only a single p4 processor.  You've
> likely got one of two issues.  First is you don't actually have a
> 100mbs connection; somewhere there's a bottleneck.  Secondly, watch
> the limit on the size of the  files you crawl. I think we limit our
> file size to 64K.  If you have that limit too big you end up spending
> all day downloading 10meg pdf's; that'll really slow things down.
>
Nice server. We've add more power to disks but I think CPU is real
bottleneck. When doing MapReduce server is running 97%.

file.content.limit is set to 65536, http.content.limit is the same. Can
you post nutch-site.xml values. I'm specially curious  about  number of
threads (total, per server), limits, delays etc.

Thanks

--
Uros