Wrong 'Next Fetch' Date

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Wrong 'Next Fetch' Date

Two days ago I posted this message below to the nutch-user list already.
Because nobody answered yet I think this is more an developer than an
user issue.
(for me it seems to be a bug).
I would like to discuss it with a nutch developer.


just a view days ago we started to use Nutch (0.7.1).
It's really nice and I would like to see it evolve.

Here's my issue/question:

While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html
failed with: java.lang.Exception:
org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry
That seems to be ok and indicates some network problems.

The problem is that the entry in the Webdb shows the following:

Page 4: Version: 4
URL: http://www.test-domain.de/crawl_html/page_2.html
ID: b360ec931855b0420776909bd96557c0
Next fetch: Sun Aug 17 07:12:55 CET 292278994
Retries since fetch: 0
Retry interval: 0 days

The 'Next fetch' date is set to the year '292278994'.
Probably I wouldn't be able to see the refetch alive. ;)

What's wrong here? I hope it's not my lifespan. ;)
A page that couldn't be crawled because of networks-problems,
should be refetched with the next crawl (== set next fetch date to the
next day).

I'm just using standard api of nutch 0.7.1 like:

WebDBWriter webdb = new WebDBWriter(fileSystem, new File(dbPath));
UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, true, -1);
tool.updateForSegment(fileSystem, lseg);