I once again need some advice here. I have 4 dual proc quad core 1.8 xeon servers, each server has 4gb ram and runs linux. I am using nutch svn (build #334 i think) and am using hadoop dfs. I need to know what parameters I can set to get the optimal performance from these servers. I have a seed list of about 10,000 urls (ignore external link will be set to true). My goal is to crawl in the shortest period of time. Furthermore I intend to run one crawl (depth 5) and thus have one index.
What advice would you give in terms of this approach and also in terms of nutch/hadoop variables/parameters and their settings.