Date first indexed

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Date first indexed

Thomas Delnoij-3
I have worked through the
WritingPluginExample<http://wiki.apache.org/nutch/WritingPluginExample>example.
Now I am wondering if the following makes any sense. I would like
to store the date (yyyymmdd) the first time a Page was added to the Index. I
thought I could create a plugin that would add a date_indexed field. My
hesitation is what happens after the fetch interval, when the Page is
refetched.

What happens

- if the Page Content has changed? Is the Page updated (i.e. deleted and
added) in the index and would the date_indexed be recalculated (would be
ok.)
- if the Page hasn't changed? Is the Page also updated (would break the
meaning of the date_indexed field, not ok).

Or does this depend on how I organize my generate/fetch/update/index cycle,
i.e. if I merge my indexes or recreate them from scratch?

Rgrds, Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Date first indexed

Stefan Groschupf-2
Hey,
May the freshly added CrawlDatum.setMetaData can help you to store  
such informations.
However you need somehow to hack nutch code, since this is not stored  
until today yet there is no extension point for such a task.

HTH
Stefan

Am 13.02.2006 um 17:36 schrieb Thomas Delnoij:

> I have worked through the
> WritingPluginExample<http://wiki.apache.org/nutch/ 
> WritingPluginExample>example.
> Now I am wondering if the following makes any sense. I would like
> to store the date (yyyymmdd) the first time a Page was added to the  
> Index. I
> thought I could create a plugin that would add a date_indexed  
> field. My
> hesitation is what happens after the fetch interval, when the Page is
> refetched.
>
> What happens
>
> - if the Page Content has changed? Is the Page updated (i.e.  
> deleted and
> added) in the index and would the date_indexed be recalculated  
> (would be
> ok.)
> - if the Page hasn't changed? Is the Page also updated (would break  
> the
> meaning of the date_indexed field, not ok).
>
> Or does this depend on how I organize my generate/fetch/update/
> index cycle,
> i.e. if I merge my indexes or recreate them from scratch?
>
> Rgrds, Thomas

---------------------------------------------
George Orwel was an Optimist
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: Date first indexed

Thomas Delnoij-3
I am still using 0.7.1 - I think the CrawlDatum.setMetaData is only part of
the trunk.

Is it not possible to just "hack" the MoreIndexingFilter and calculate the
date_indexed field there (similar to how the lastModified field is
calculated), and add a DateIndexedQueryFilter to the
org.apache.nutch.searcher.more package?

Rgrds, Thomas

On 2/17/06, Stefan Groschupf <[hidden email]> wrote:

>
> Hey,
> May the freshly added CrawlDatum.setMetaData can help you to store
> such informations.
> However you need somehow to hack nutch code, since this is not stored
> until today yet there is no extension point for such a task.
>
> HTH
> Stefan
>
> Am 13.02.2006 um 17:36 schrieb Thomas Delnoij:
>
> > I have worked through the
> > WritingPluginExample<http://wiki.apache.org/nutch/
> > WritingPluginExample>example.
> > Now I am wondering if the following makes any sense. I would like
> > to store the date (yyyymmdd) the first time a Page was added to the
> > Index. I
> > thought I could create a plugin that would add a date_indexed
> > field. My
> > hesitation is what happens after the fetch interval, when the Page is
> > refetched.
> >
> > What happens
> >
> > - if the Page Content has changed? Is the Page updated (i.e.
> > deleted and
> > added) in the index and would the date_indexed be recalculated
> > (would be
> > ok.)
> > - if the Page hasn't changed? Is the Page also updated (would break
> > the
> > meaning of the date_indexed field, not ok).
> >
> > Or does this depend on how I organize my generate/fetch/update/
> > index cycle,
> > i.e. if I merge my indexes or recreate them from scratch?
> >
> > Rgrds, Thomas
>
> ---------------------------------------------
> George Orwel was an Optimist
> blog: http://www.find23.org
> company: http://www.media-style.com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Date first indexed

Thomas Delnoij-3
Ah, after reading up on the new metadata facility in JIRA, I think I
understand better what you mean - the metadata are added to the WebDB and
persisted across refetches. This way it is possible for the complete index
to be recreated from scratch while maintaing the first indexed date, which
otherwise would be lost, right?

Rgrds, Thomas

On 2/17/06, TDLN <[hidden email]> wrote:

>
> I am still using 0.7.1 - I think the CrawlDatum.setMetaData is only part
> of the trunk.
>
> Is it not possible to just "hack" the MoreIndexingFilter and calculate the
> date_indexed field there (similar to how the lastModified field is
> calculated), and add a DateIndexedQueryFilter to the
> org.apache.nutch.searcher.more package?
>
> Rgrds, Thomas
>
> On 2/17/06, Stefan Groschupf <[hidden email] > wrote:
> >
> > Hey,
> > May the freshly added CrawlDatum.setMetaData can help you to store
> > such informations.
> > However you need somehow to hack nutch code, since this is not stored
> > until today yet there is no extension point for such a task.
> >
> > HTH
> > Stefan
> >
> > Am 13.02.2006 um 17:36 schrieb Thomas Delnoij:
> >
> > > I have worked through the
> > > WritingPluginExample<http://wiki.apache.org/nutch/
> > > WritingPluginExample>example.
> > > Now I am wondering if the following makes any sense. I would like
> > > to store the date (yyyymmdd) the first time a Page was added to the
> > > Index. I
> > > thought I could create a plugin that would add a date_indexed
> > > field. My
> > > hesitation is what happens after the fetch interval, when the Page is
> > > refetched.
> > >
> > > What happens
> > >
> > > - if the Page Content has changed? Is the Page updated (i.e.
> > > deleted and
> > > added) in the index and would the date_indexed be recalculated
> > > (would be
> > > ok.)
> > > - if the Page hasn't changed? Is the Page also updated (would break
> > > the
> > > meaning of the date_indexed field, not ok).
> > >
> > > Or does this depend on how I organize my generate/fetch/update/
> > > index cycle,
> > > i.e. if I merge my indexes or recreate them from scratch?
> > >
> > > Rgrds, Thomas
> >
> > ---------------------------------------------
> > George Orwel was an Optimist
> > blog: http://www.find23.org
> > company: http://www.media-style.com
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Date first indexed

Raghavendra Prabhu
What you mentioned is correct?

It is stored in the db

Rgds
Prabhu


On 2/17/06, TDLN <[hidden email]> wrote:

>
> Ah, after reading up on the new metadata facility in JIRA, I think I
> understand better what you mean - the metadata are added to the WebDB and
> persisted across refetches. This way it is possible for the complete index
> to be recreated from scratch while maintaing the first indexed date, which
> otherwise would be lost, right?
>
> Rgrds, Thomas
>
> On 2/17/06, TDLN <[hidden email]> wrote:
> >
> > I am still using 0.7.1 - I think the CrawlDatum.setMetaData is only part
> > of the trunk.
> >
> > Is it not possible to just "hack" the MoreIndexingFilter and calculate
> the
> > date_indexed field there (similar to how the lastModified field is
> > calculated), and add a DateIndexedQueryFilter to the
> > org.apache.nutch.searcher.more package?
> >
> > Rgrds, Thomas
> >
> > On 2/17/06, Stefan Groschupf <[hidden email] > wrote:
> > >
> > > Hey,
> > > May the freshly added CrawlDatum.setMetaData can help you to store
> > > such informations.
> > > However you need somehow to hack nutch code, since this is not stored
> > > until today yet there is no extension point for such a task.
> > >
> > > HTH
> > > Stefan
> > >
> > > Am 13.02.2006 um 17:36 schrieb Thomas Delnoij:
> > >
> > > > I have worked through the
> > > > WritingPluginExample<http://wiki.apache.org/nutch/
> > > > WritingPluginExample>example.
> > > > Now I am wondering if the following makes any sense. I would like
> > > > to store the date (yyyymmdd) the first time a Page was added to the
> > > > Index. I
> > > > thought I could create a plugin that would add a date_indexed
> > > > field. My
> > > > hesitation is what happens after the fetch interval, when the Page
> is
> > > > refetched.
> > > >
> > > > What happens
> > > >
> > > > - if the Page Content has changed? Is the Page updated (i.e.
> > > > deleted and
> > > > added) in the index and would the date_indexed be recalculated
> > > > (would be
> > > > ok.)
> > > > - if the Page hasn't changed? Is the Page also updated (would break
> > > > the
> > > > meaning of the date_indexed field, not ok).
> > > >
> > > > Or does this depend on how I organize my generate/fetch/update/
> > > > index cycle,
> > > > i.e. if I merge my indexes or recreate them from scratch?
> > > >
> > > > Rgrds, Thomas
> > >
> > > ---------------------------------------------
> > > George Orwel was an Optimist
> > > blog: http://www.find23.org
> > > company: http://www.media-style.com
> > >
> > >
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Date first indexed

Stefan Groschupf-2
In reply to this post by Thomas Delnoij-3
> Ah, after reading up on the new metadata facility in JIRA, I think I
> understand better what you mean - the metadata are added to the  
> WebDB and
> persisted across refetches. This way it is possible for the  
> complete index
> to be recreated from scratch while maintaing the first indexed  
> date, which
> otherwise would be lost, right?

yes, welcome you would be the first user beside me that would use the  
new meta data support. :-D