New to nutch, seem to be problems

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

New to nutch, seem to be problems



    My configuration and stats are at the end of this email.  I have set up nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well, but problems started after this.

    1. Generate takes many hours to complete.  It doesn't matter whether I generate 1 million or 1000 items, it takes about 5 hours to complete.  Is this normal?

    2. Fetch works great, until it is done.  It then freezes up indefinitely.  It can fetch 1000000 pages in about 12 hours, and all the fetched content is in /tmp, but then it just sits there, not returning to the command line.  I have let it sit for about 12 hours and eventually broke down and cancelled it.  If I try to undate the database it of course fails.

    3. Fetch2 runs very slowly, even though I am using 80 threads, I only download an object per every few seconds (1 every 5 or 10 seconds).  From the log, I can see that almost always 79 or 80 threads are spinWaiting.

    4. I can't tell if fetch2 freezes like fetch does, as I haven't been able to wait the many days it will take to go through a full fetch with fetch2.


    Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.

    The ethernet connection has a dedicated 1gb connection to the web, so certainly that isn't a problem.

    I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

    I seeded with urls from the opendirectory, 100000.  I first ran a pass to load all 100000, then took the topN=1million (10 times larger than the first set of urls).  The first pass had no problem, the second pass (and beyond) is where the problems began.