Nutch merge problem after fetch is aborted with hung threads.
Re-posting to dev list after no response in user list.
---------- Forwarded message ----------
From: Lukas Vlcek <[hidden email]>
Date: Jan 19, 2006 8:42 AM
Subject: Nutch merge problem after fetch is aborted with hung threads.
To: [hidden email]
I am facing an interesting problem. I am crawling in iterative cycles
and it works fine until one of fetch cycles is prematurely terminated
due to timeout - which result in this message to be written into log
file [Aborting with 3 hung threads.]; (I am using 3 threads).
And lets say that this fetch fetched only 101 pages (out of 500)
before it was terminated.
Then the problem is that I can see only 101 pages in merged index no
matter how many pages were fetched in previous cycles. Is seems to me
that it is not possible to build healthy merged index if one of
Then if I open index with Luke it shows that the total number of
documents is only 101.
Here are details:
My script looks like the following example:
--- start ---
So once fetch operation is terminated then the rest of the tasks is
executed anyway (updatedb, indexing ...). Also is seems to me that in
this case it doesn't matter if I execute merge at the end of every
cycle or just once after desired crawl depth is reached.