While testing out Nutch, I've discovered several issues with hangs
inside of specific parsers, and realized that the Fetcher code has no
concept of timeout on a thread. From experience in doing whole web
crawls, I've found this to be an essential feature for long-term
stability (read hands-off production crawling for large indices)
As I'm coming into this codebase new, does the idea of a Fetch thread
timeout exist (not just HTTP timeout) for a bad parser? If so, how
would I use set it? If not, and looking at the code I believe this to
be true, any issue with adding it?
Saw mentions from Doug Cutting on nutch-general on Oct 29th 2005
"Also, the mapred fetcher has been changed to succeed even when threads
hang. Perhaps we should change the 0.7 fetcher similarly? I think we
should probably go even farther, and kill threads which take longer than
a timeout to process a url. Thread.stop() is theoretically unsafe, but
I've used it in the past for this sort of thing and never traced
subsequent problems back to it... "
Would agree with doug on this being "unsafe" but used it on large
sites. At the very least restarting the fetcher (can this be done)
after this point would help get through the list.