Finding the highest term in a field

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding the highest term in a field

Daniel Noll-3-2
Hi all.

If I want to find the lowest term in a field, I can do something like this:

    public Date computeEarliestDate(IndexReader reader) throws IOException {
        TermEnum terms = reader.terms(new Term("date", "00000000"));
        if (terms.term() == null || !"date".equals(terms.term().field()))
        {
            return new Date(); // some date before all data
        }

        return dateFormat.parse(terms.term().text());
    }

But what if I want to find the highest?  TermEnum can't step backwards.

I am working under these constraints:
    * It can't involve iterating every value in the TermEnum because
the number of documents is too large for that to be efficient.
    * It has to work with existing text indexes, so I can't cheat by
having another field which sorts in the other direction.

Is my best option to do a sort of binary search by getting the
TermEnum for different terms until I find a term where there are terms
higher than the term but no terms higher than the term for the next
day?

Daniel


--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Finding the highest term in a field

Yonik Seeley-2-2
On Wed, Nov 18, 2009 at 10:48 PM, Daniel Noll <[hidden email]> wrote:
> But what if I want to find the highest?  TermEnum can't step backwards.

I've also wanted to do the same. It's coming with the new flexible
indexing patch:
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764020#action_12764020

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Finding the highest term in a field

Daniel Noll-3-2
On Thu, Nov 19, 2009 at 16:01, Yonik Seeley <[hidden email]> wrote:
> On Wed, Nov 18, 2009 at 10:48 PM, Daniel Noll <[hidden email]> wrote:
>> But what if I want to find the highest?  TermEnum can't step backwards.
>
> I've also wanted to do the same. It's coming with the new flexible
> indexing patch:
> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764020#action_12764020

This sounds interesting.

I take it the existing numeric fields can't already do stuff like
this?  (We don't have access to them yet anyway for backwards
compatibility reasons, otherwise I would have looked into it.  But
next major version...)

For now I am writing a routine which subdivides the term space until
it thinks it's down to some size which is small enough to use
iteration instead of seeking (which seems to be in the realm of
100,000 ~ 1,000,000 terms -- but the hard thing is guessing how many
terms would be either side of the split.)

Daniel

--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Finding the highest term in a field

Yonik Seeley-2-2
On Thu, Nov 19, 2009 at 1:04 AM, Daniel Noll <[hidden email]> wrote:
> I take it the existing numeric fields can't already do stuff like
> this?

Nope, it's a fundamental limitation of the current TermEnums.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Finding the highest term in a field

Uwe Schindler
Hi Daniel, hi Yonik,

With NumericFields it would be possible to get faster to the really last
position in the TermEnum. It would be possible to iterate first over the
lowest precision terms until the end is reached. By that you know the prefix
of the last term. You can then place the TermEnum on the first term with the
same prefix, but the next better precision and iterate again. You do this
until you are in the highest precision. Depending on the precStep value you
can find the end much faster. E.g. with the default precStep of 4, each
precision needs to enumerate a theoretical maximum of 16 terms and then go
to the next lower prec. With 32 bit its, you need to do this 8 times, so you
need to iterate as maximum (but never in reality), 16*8 terms.

To implement this, you need much knowledge about NumericFields, but it is
possible with an very simple algorithm (simplier than the range splitter in
NumericUtils). If you like, I could possibly help you to implement this.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik
> Seeley
> Sent: Thursday, November 19, 2009 3:29 PM
> To: Daniel Noll
> Cc: [hidden email]
> Subject: Re: Finding the highest term in a field
>
> On Thu, Nov 19, 2009 at 1:04 AM, Daniel Noll <[hidden email]> wrote:
> > I take it the existing numeric fields can't already do stuff like
> > this?
>
> Nope, it's a fundamental limitation of the current TermEnums.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]