revisit time as a function of content type

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

revisit time as a function of content type

Christopher Laux
Hi all,

thanks for the last answer. I have a more advanced question, if you don't mind:

What is the easiest way to make revisit times depend on the http/html
content-type, e.g. I want to revisit "application/rss+xml" pages every
12 hours but "text/html" etc. can remain at 30 days?

Do I have to modify the generate and update functions or could plugins
handle this?

Thanks,
Chris
Reply | Threaded
Open this post in threaded view
|

Re: revisit time as a function of content type

reinhard
implement your own schedule class and set the property in the nutch-site.xml
in nutch-default.xml you have

<property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
  <description>The implementation of fetch schedule.
DefaultFetchSchedule simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
</property>

you can see in this class how to implement your own schedule class.

Christopher Laux schrieb:

> Hi all,
>
> thanks for the last answer. I have a more advanced question, if you don't mind:
>
> What is the easiest way to make revisit times depend on the http/html
> content-type, e.g. I want to revisit "application/rss+xml" pages every
> 12 hours but "text/html" etc. can remain at 30 days?
>
> Do I have to modify the generate and update functions or could plugins
> handle this?
>
> Thanks,
> Chris
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: revisit time as a function of content type

Christopher Laux
Thanks for that hint which answers my original question. For even
better performance, I would prefer set CrawlDatum's fetchinterval
depending on the parsed contents of say an XML feed file: if the last
entries are temporally close together I want a shorter fetchinterval
than if they lie apart. Where would the right place be to set that?

Cheers,
Chris


On Wed, Oct 6, 2010 at 10:14 AM, reinhard schwab <[hidden email]> wrote:

> implement your own schedule class and set the property in the nutch-site.xml
> in nutch-default.xml you have
>
> <property>
>  <name>db.fetch.schedule.class</name>
>  <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
>  <description>The implementation of fetch schedule.
> DefaultFetchSchedule simply
>  adds the original fetchInterval to the last fetch time, regardless of
>  page changes.</description>
> </property>
>
> you can see in this class how to implement your own schedule class.
>
> Christopher Laux schrieb:
>> Hi all,
>>
>> thanks for the last answer. I have a more advanced question, if you don't mind:
>>
>> What is the easiest way to make revisit times depend on the http/html
>> content-type, e.g. I want to revisit "application/rss+xml" pages every
>> 12 hours but "text/html" etc. can remain at 30 days?
>>
>> Do I have to modify the generate and update functions or could plugins
>> handle this?
>>
>> Thanks,
>> Chris
>>
>>
>
>