Resources required for whole web crawl?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Resources required for whole web crawl?

Chris-10
This is a big question...

What kind of resources are required for doing a crawl of the whole web? I'm just looking for ballpark numbers -- servers, bandwidth, cost, etc.

Assumptions:

"Whole web" means roughly the same number of pages crawled by second or third-tier search engines (which is what we're thinking about building). I'm not sure how many pages that is. 10 billion, maybe?

Timeframe: the crawl should take the same amount of time that the minor search engines take. Maybe a month or two? Fast-changing sites refreshed more frequently, static sites less so.

Cost: we could get a rough idea if we knew the number of servers, amount of disk per server, and the required bandwidth. It's not too tough to find the cost of renting the cabinets in a data center to do this.

Another big cost would be the engineers to build and maintain it. Perhaps two or three people, full time, supplemented by 24x7 data center support?

I know I'm leaving out a lot of variables, but I'm really just looking for order-of-magnitude numbers. Replies from people who have actually done it, with their actual experiences, would be greatly appreciated.

Reply | Threaded
Open this post in threaded view
|

Re: Resources required for whole web crawl?

Dennis Kubes-2
10 billion = 2000 search servers + ~500-1000 processing machines +
100Mbps line.  Ballpark 3-4 million for the servers with a 30-50K+ a
month bandwidth and electricity cost for datacenters.  I am assuming 5M
pages per search server.

Search servers would be small disk space but large ram, say 8G+ each.
Processing machines would be 500G+ disks, more likely the newer 2-3x 1T
disks and 8G ECC memory with multi-core (probably quad core) processors.

At a previous company we did 100M page system for 35K in hardware and
2,500 a month hosting charges.  Second tier search engine now have
somewhere around 4B pages.

Hope this helps.

Dennis



Shef wrote:

> This is a big question...
>
> What kind of resources are required for doing a crawl of the whole web? I'm
> just looking for ballpark numbers -- servers, bandwidth, cost, etc.
>
> Assumptions:
>
> "Whole web" means roughly the same number of pages crawled by second or
> third-tier search engines (which is what we're thinking about building). I'm
> not sure how many pages that is. 10 billion, maybe?
>
> Timeframe: the crawl should take the same amount of time that the minor
> search engines take. Maybe a month or two? Fast-changing sites refreshed
> more frequently, static sites less so.
>
> Cost: we could get a rough idea if we knew the number of servers, amount of
> disk per server, and the required bandwidth. It's not too tough to find the
> cost of renting the cabinets in a data center to do this.
>
> Another big cost would be the engineers to build and maintain it. Perhaps
> two or three people, full time, supplemented by 24x7 data center support?
>
> I know I'm leaving out a lot of variables, but I'm really just looking for
> order-of-magnitude numbers. Replies from people who have actually done it,
> with their actual experiences, would be greatly appreciated.
>
>