I've spent hours searching around trying to solve this and it's starting to
drive me a little nuts. You all might be my last hope in staying out of a
I have one small site I'm trying to crawl. The site is a handful of
different JSPs that are essentially templates for people's profiles. The
different profile pages are generated by passing a uri parameter. Nutch is
actually doing a fine job of crawling the smaller pages, but the main index
is causing trouble.
The main index has a single list of 772 links in alphabetical order like
Nutch fetches about the first 90-110 (usually all the A's and B's) but
that's it. I got real excited when I found the db.max.outlinks.per.page
setting was at a default of 100. However, changing that to -1 or a high
value doesn't fix the problem. When I change it to a small value, like 15,
the fetcher grabs even fewer links, so it is definitely working.
There is also a setting for the maximal number of bytes to fetch. If your main index page is large, maybe it's just getting cut off because of that. The property has "content" in the name, I believe, so look for that in nutch-default.xml.