Wondering if anyone would be willing to share
optimizations/configurations they've done for the whole web crawling
strategy. I'm using a Dual CPU system with 4GB of ram and the
performance has been lacking. This is for a large academic domain with
several (hundreds) or sub-domains and treating it as a whole web crawl
1) What JVM are you using for SMP (Fedora Core 4)? Is there a JVM (with
OS) where the underlying thread management will take full advantage of
both CPUs? It appears SUN is locking nutch into one CPU.
2) What have you done for memory management? 4GB of RAM affords the JVM
to grab a large memory slice but with top 10K - 50K URL segments the box
will grind to a halt.
3) How are you scripting the processes of fetch, dedup, analyze,
refetch, etc... The useful scripts from the WIKI are a good starting
point but I'm wondering if there is a more advanced/optimized
configuration someone is using.
3a) Specifically, how are you handling/scripting the creation, fetching,
merging of segments? What sizes? Using topN or other method?