[jira] Created: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
Adaptive re-fetch interval. Detecting umodified content
-------------------------------------------------------

         Key: NUTCH-61
         URL: http://issues.apache.org/jira/browse/NUTCH-61
     Project: Nutch
        Type: New Feature
  Components: fetcher  
    Reporter: Andrzej Bialecki
 Assigned to: Andrzej Bialecki  


Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.

Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-61?page=all ]

Andrzej Bialecki  updated NUTCH-61:
-----------------------------------

    Attachment: 20050606.diff

The first round:

* change Page to use a 1-byte float, representing fetchInterval in seconds.

* implement a pluggable FetchSchedule, which adjusts fetchInterval and nextFetchTime

* change FetchListTool and UpdateDatabaseTool to use them. NOTE: it appears there was a bug in FetchListTool, where the fetchlist entries recorded in segments would have their fetchTime increased by 1 week. This is not needed, only pages in WebDB need this.

* improve status reporting throughout all plugins.

* change plugins to detect if the content is unchanged. If possible, plugins will not fetch such content, but in any case they will set their status accordingly.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

Andrzej Białecki-2
Andrzej Bialecki (JIRA) wrote:

> * improve status reporting throughout all plugins.

Please note, that this is an incompatble change between the
ProtocolStatus implemented patchset in NUTCH-54 and here, so if you
created some segments in between, you will need to refetch them.

I didn't feel that the last version of ProtocolStatus was in such a long
use (3 days?) to merit changing it into a VersionedWritable (and adding
another byte to its size).

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

luti
In reply to this post by JIRA jira@apache.org
Dear Andrzej,

This is very interesting patch, but I have a question:
- If page isn't modified, you don't refetch page.
- If you don't refrech page, there is in old segments?
- If it's in old segments, the segments data will be increasses, how to
analize that which segment is deletable?
- If it's in the old segments, the useable index will be larger and
larger. Because there is a limitation: optimal 2Kbyte RAM / page -> this
will decrease performance or increasse RAM usage?

Sorry my performance question, this patch is very interesting and usable.

Thanks for your your answer,
    Ferenc

Andrzej Bialecki (JIRA) wrotte:

>Adaptive re-fetch interval. Detecting umodified content
>-------------------------------------------------------
>
>         Key: NUTCH-61
>         URL: http://issues.apache.org/jira/browse/NUTCH-61
>     Project: Nutch
>        Type: New Feature
>  Components: fetcher  
>    Reporter: Andrzej Bialecki
> Assigned to: Andrzej Bialecki  
>
>
>Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
>
>Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

Andrzej Białecki-2
[hidden email] wrote:
> Dear Andrzej,
>
> This is very interesting patch, but I have a question:
> - If page isn't modified, you don't refetch page.

Correct - that's the whole point of this patch.

> - If you don't refrech page, there is in old segments?

Yes, the page should be in some old segment. This question brings an
interesting dilemma - should I add an option to forcefully refetch a
page, in case you lost the old segment data? Hmmm... In the current code
there is an option "-adddays", but with adjustable interval this doesn't
make much sense.

> - If it's in old segments, the segments data will be increasses, how to
> analize that which segment is deletable?

Well, there is no good answer to that even with the current code... You
can use mergesegs tool to keep only the latest versions of pages. But I
agree, this patch make the problem of handling old segments more serious
  - how to "phase out" older segments.

> - If it's in the old segments, the useable index will be larger and
> larger. Because there is a limitation: optimal 2Kbyte RAM / page -> this
> will decrease performance or increasse RAM usage?

The index (Lucene index) will not be larger - the deduplication process
takes care of that. Only the latest version of the content will show up
in the index, and for identical content only the one reachable via the
shortest URL.

>
> Sorry my performance question, this patch is very interesting and usable.

Thanks for good, thought-provoking questions!

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361131 ]

raghavendra prabhu commented on NUTCH-61:
-----------------------------------------

Will the same thing work for a filesystem

For a file system , We can directly get the modified date store it in the db

The plugins will have a look at the content date and if it is different they will index it

Otherwise they will not fetch it

This can be a solution for file based content

(The thing is it does away entirely with fetch interval and takes decision only based upon file modification date)

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361133 ]

Andrzej Bialecki  commented on NUTCH-61:
----------------------------------------

This patch already supports this. Anyway, it needs to be significantly re-worked to fit into the current development version.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361302 ]

byron miller commented on NUTCH-61:
-----------------------------------

Is there a patch modified for the current branch or should i take a stab at this?

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361311 ]

Andrzej Bialecki  commented on NUTCH-61:
----------------------------------------

I'm working on this, the patch will be available in a couple of days. I could use then your help with review and testing... ;-)

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361346 ]

byron miller commented on NUTCH-61:
-----------------------------------

Most definately! I'll be happy to give it a whirl!

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-61?page=all ]

Andrzej Bialecki  updated NUTCH-61:
-----------------------------------

    Attachment: 20051230.txt

Updated version for the latest mapred branch.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff, 20051230.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-61?page=all ]

Andrzej Bialecki  updated NUTCH-61:
-----------------------------------

    Attachment: 20060227.txt

This patch is updated to the current trunk/ . The default configuration works as before, and uses DefaultFetchSchedule.

If there are no major objections I will commit it shortly.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff, 20051230.txt, 20060227.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12368050 ]

Jerome Charron commented on NUTCH-61:
-------------------------------------

Not an objection, but a simple comment.
Why not making FetchSchedule a new ExtensionPoint and then DefaultFetchSchedule and AdaptiveFetchSchedule some fetch schedule plugins?


> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff, 20051230.txt, 20060227.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12368051 ]

Andrzej Bialecki  commented on NUTCH-61:
----------------------------------------

I contemplated this for a while, and then decided against it.

The main reason was that currently most of the "pluggable" extensions that result in running a single selected plugin are handled using a simple Factory pattern; as opposed to ChainedFilter pattern, where we use extension points.

I guess the original reason was that implementations would almost always consist of a single class, so it didn't make sense to complicate it and require the whole plugin infrastructure ... It would be the same in this case (just a single class), so I followed the same pattern.

It's easy to change this to use an extension point, if people prefer it this way.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff, 20051230.txt, 20060227.txt
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-61?page=all ]

Andrzej Bialecki  updated NUTCH-61:
-----------------------------------

    Attachment: nutch-61-417287.patch

This patch, besides bringing it up-to-date with the trunk/, also adds a maximum cap on fetch interval and a better strategy for merging records in Injector.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>          Key: NUTCH-61
>          URL: http://issues.apache.org/jira/browse/NUTCH-61
>      Project: Nutch
>         Type: New Feature

>   Components: fetcher
>     Reporter: Andrzej Bialecki
>     Assignee: Andrzej Bialecki
>  Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes.
> Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira