-refetchonly investigation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

-refetchonly investigation

Piotr Kosiorowski
Hello,
I started to investigate -reftechonly flag because of some questions on
nutch-user mailing list. I was sure it works as described in one of the
emails on the list:
"-refetchonly generates you an segment(FetchList) that only contains the
urls that need to be refetched based on your refetch interval.
Right, new discovered links are not in the fetchlist that will be
generated by using this option."

But after reading the code and performing some experiments it looks like
it is not true.

I have inserted 1 url into WebDB - http://lucene.apache.org/nutch/.
I have generated the segment, fetched it, updated db.
There are 21 pages in WebDB after update.
When I do:
bin/nutch generate db segments/ -refetchonly
a new segment is created that contains 20 pages in fetchlist.
http://lucene.apache.org/nutch/ page is missing - as it should be
because it has nextFetchTime greater than now. But all other new pages
are genarated into fetchlist.

They are not fetched when I run "bin/nutch fetch" because they all have
fetch flag set to false so fetcher does not even try to fetch it.
During update they are handled as "pageContentsUnchanged".
So in fact they are not fetched but their nextFetchTime is updated - I
am not sure why such feature might be useful.
They also take space in segment so it affects fetchlists generated with
-topN option.

So in my opinion this behavior is not correct.
I would suggest performing following steps:
1) if we simply skip the page during  fetchlist generation - everything
should run without problems and users would get expected behavior - I
can prepare such patch (after finishing with others on my nutch patch
list :)).
2) http://issues.apache.org/jira/browse/NUTCH-49 - patch presented in
this place will have exactly the same problem (but working in opposite
direction) - while preparing patch for 1) I can take it into account.
3) FetchListEntry.fetch field - I cannot find other things this field is
responsible for right now. I will look deeper but at the moment I think
this field can be removed from this object making fetchlist size smaller
on disk (always a good thing) and removing handling of this field from
fetcher and updatedb.

Maybe I am missing some important aspects of this issue so please
correct me if I am wrong before I start coding.

Regards,
Piotr



Reply | Threaded
Open this post in threaded view
|

Re: -refetchonly investigation

Doug Cutting-2
Piotr Kosiorowski wrote:
> I started to investigate -reftechonly flag because of some questions on
> nutch-user mailing list. I was sure it works as described in one of the
> emails on the list:
> "-refetchonly generates you an segment(FetchList) that only contains the
> urls that need to be refetched based on your refetch interval.
> Right, new discovered links are not in the fetchlist that will be
> generated by using this option."

The original rationale for the "-refetchonly" option was to permit
indexing of all of the urls known the the database, with anchor text,
but without fetching them.  Thus one can, e.g., provide an index of 10M
urls while only actually fetching 1M urls.  I have never actually used
this feature myseufl.  I don't know whether other folks have ever used
it sucessfully, nor whether such a feature is in fact desired.

Doug