Find "latest" document (before a certain date)

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Find "latest" document (before a certain date)

Per Lindberg-2
Hi!

I have an index containing the following fields

   "id" (not to be confused with the internal Lucene id)
   "version"
   "date"

The combination of "id" and "version" is unique,
i.e. there may be serveral versions of each document
with the same id.

The "date" field indicates when the version was created.

There is also a tokenized "content" field.

Now, I want to search the content, and return only the
LATEST found document with each id. To complicate
things a bit, I want the latest before a given date. In other
words, for each id pick only the one with the highest date
less than x.

I guess I *could* sort the result hits on id and version,
and then write code to scan throgh the hits and pick out
the latest for each id. But I have a hunch that there is a
simpler way.

Perhaps some homebrew Filter, or other trick. The query
syntax does not seem to support a question like "for each
vaule of the id field among the found hits, give me the one
with the highest date less than x"...

Cheers,
Per Lindberg



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Find "latest" document (before a certain date)

Karl Wettin

28 aug 2007 kl. 17.48 skrev Per Lindberg:

>
> Now, I want to search the content, and return only the
> LATEST found document with each id. To complicate
> things a bit, I want the latest before a given date. In other
> words, for each id pick only the one with the highest date
> less than x.

Given you added documents with version time stamp in chronological
order, how about using a RangeQuery and pick the hit with the
greatest document number?


--
karl



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: Find "latest" document (before a certain date)

Per Lindberg-2
 

> Från: Karl Wettin [mailto:[hidden email]]
> 28 aug 2007 kl. 17.48 skrev Per Lindberg:
>
> > Now, I want to search the content, and return only the
> > LATEST found document with each id. To complicate
> > things a bit, I want the latest before a given date. In other
> > words, for each id pick only the one with the highest date
> > less than x.
>
> Given you added documents with version time stamp in chronological
> order, how about using a RangeQuery and pick the hit with the
> greatest document number?

Yep, that did the trick! There seems to be no Filter that can do
the final picking of the highest date, so I had to do that after the
search.

I use IndexSearcher.search with a RangeFilter,
I presume that's just as efficient as a RangeQuery?

Thanks!
Per



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SV: Find "latest" document (before a certain date)

Tom Roberts LUXONLINE
Tom Roberts is out of the office until 3rd September 2007 and will get back to you on his return.

http://www.luxonline.org.uk
http://www.lux.org.uk




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: SV: Find "latest" document (before a certain date)

Tom Roberts LUXONLINE
In reply to this post by Per Lindberg-2
Tom Roberts is out of the office until 3rd September 2007 and will get back to you on his return.

http://www.luxonline.org.uk
http://www.lux.org.uk




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Find "latest" document (before a certain date)

Karl Wettin
In reply to this post by Per Lindberg-2

29 aug 2007 kl. 12.29 skrev Per Lindberg:

>> how about using a RangeQuery and pick the hit with the
>> greatest document number?
>
> Yep, that did the trick! There seems to be no Filter that can do
> the final picking of the highest date, so I had to do that after the
> search.
>
> I use IndexSearcher.search with a RangeFilter,
> I presume that's just as efficient as a RangeQuery?

It depends, espescially on how you use reuse the filter.

Benchmark to be sure


--
kalle


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

Per Lindberg-2

> Från: Karl Wettin [mailto:[hidden email]]

> 29 aug 2007 kl. 12.29 skrev Per Lindberg:
>
> >> how about using a RangeQuery and pick the hit with the
> >> greatest document number?
> >
> > Yep, that did the trick! There seems to be no Filter that can do
> > the final picking of the highest date, so I had to do that after the
> > search.
> >
> > I use IndexSearcher.search with a RangeFilter,
> > I presume that's just as efficient as a RangeQuery?
>
> It depends, espescially on how you use reuse the filter.

For each search request (it's a webapp) I currently create
a new IndexSearcher, new Filter and new Sort, call
searcher.search(query, filter, sorter) and later searcher.close().

The literature says that it is desirable to cache the IndexSearcher,
but there's no mention of the memory cost! Since it is said to
take a long time to create, I presume that the IndexSearcher
reads the index files and keeps a lot of stuff in memory, so
the thought of caching one for each HttpSession gives me bad vibes.

(Also keeping open files; the file locking scheme in NTFS
can prevent Tomcat from doing hot redeploy if the webapp
has open files).

> Benchmark to be sure

So far searches with Lucene feel astonishingly fast! Yay! :-)




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

Karl Wettin

29 aug 2007 kl. 14.32 skrev Per Lindberg:

> For each search request (it's a webapp) I currently create
> a new IndexSearcher, new Filter and new Sort, call
> searcher.search(query, filter, sorter) and later searcher.close().

You really want to reuse the IndexSearcher until new data has
been added to the index. I suppose the same thing goes for filters
and perhaps even sorts?

Start here:

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


--
kalle



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

Patrek
In reply to this post by Per Lindberg-2
Hi,

Answers in the text.
> For each search request (it's a webapp) I currently create
> a new IndexSearcher, new Filter and new Sort, call
> searcher.search(query, filter, sorter) and later searcher.close().
>
> The literature says that it is desirable to cache the IndexSearcher,
> but there's no mention of the memory cost! Since it is said to
> take a long time to create, I presume that the IndexSearcher
> reads the index files and keeps a lot of stuff in memory, so
> the thought of caching one for each HttpSession gives me bad vibes.

Why don't you put into the context scope
[servletContext.setAttribute("index", IndexSearcher)] ?

You can have it initialized upon startup with init() and cleanup on
shutdown with destroy()

Hope this helps.

Patrick

>
> (Also keeping open files; the file locking scheme in NTFS
> can prevent Tomcat from doing hot redeploy if the webapp
> has open files).
>
> > Benchmark to be sure
>
> So far searches with Lucene feel astonishingly fast! Yay! :-)
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

Per Lindberg-2
In reply to this post by Karl Wettin
Kalle and Patrick: many thanks for the suggestions!

Caching the IndexSearcher in the ServletContext sounds like a very good idea.
However, I have to index a number of databases, each with a different Lucene
index. So keeping an IndexSearcher for each may come with a prohibitive
memory cost. But as far as I can tell, speed is not a problem; creating a new
IndexSearcher for each new search is outweighed by HTTP protocol latency.

Thanks again!




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]