I¹m building a vertical search index of about 300-500 million web pages
(mostly articles). So I¹m trying to figure out what kind of hardware I need
for both backend crawl/index build servers and the frontend search servers.
I would assume I¹ll need to spider about 5 million pages per day.
From my own experience with Lucene I think that a single processor should
be able to serve about 10 million pages at an acceptable rate (~5 queries
per second). So my current assumption is that 5 dual quad core servers with
32 GB of RAM each and a total of 5 TB of Disk on a SAN should be about what
I need to process queries. Does that seem about right ? Or should I opt for
local drives (if yes, how many ? Striped ?) or more or fewer cores per
I am unsure about what I need for the Hadoop cluster for building/updating
the index though. According to this (pre-Hadoop) article:
I¹d need a single processor server with 1 GB of RAM and 1 TB of storage
across 8 drives on a RAID controller to handle 100 million pages. But that
is pre-Hadoop. What is the current best practice ? 1GB of RAM also sounds
rather small to me and I would think you¹d need more than one processor per
100 million pages. Does it make sense to get a few servers with many cores,
or a lot of servers with single or dual core processors ? And what specs ?
I¹d love to hear what kind of hardware other people are running this size of