Hardware Requirements

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Hardware Requirements

Stefan Will-2
Hi all,

I¹m building a vertical search index of about 300-500 million web pages
(mostly articles). So I¹m trying to figure out what kind of hardware I need
for both backend crawl/index build servers and the frontend search servers.
I would assume I¹ll need to spider about 5 million pages per day.

From my own experience with Lucene I think  that a single processor should
be able to serve about 10 million pages at an acceptable rate (~5 queries
per second). So my current assumption is that 5 dual quad core servers with
32 GB of RAM each and a total of 5 TB of Disk on a SAN should be about what
I need to process queries. Does that seem about right ? Or should I opt for
local drives (if yes, how many ? Striped ?) or more or fewer cores per
server ?

I am unsure about what I need for the Hadoop cluster for building/updating
the index though. According to this (pre-Hadoop) article:

http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=144

I¹d need a single processor server with 1 GB of RAM and 1 TB of storage
across 8 drives on a RAID controller to handle 100 million pages. But that
is pre-Hadoop. What is the current best practice ? 1GB of RAM also sounds
rather small to me and I would think you¹d need more than one processor per
100 million pages.  Does it make sense to get a few servers with many cores,
or a lot of servers with single or dual core processors ? And what specs ?

I¹d love to hear what kind of hardware other people are running this size of
index on.

Thanks,
Stefan