possible TermInfosReader speedup

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

possible TermInfosReader speedup

Earwin Burrfoot
Currently, when we're seeking a given Term, it does a binary search
across all term space, including terms belonging to other fields.
I propose augmenting fields file with two pointers (firstTerm,
lastTerm) for each field. That reduces range we need to search, and
instead of comparing Terms we only need to compare values.
How does that sound?

Also, on the other topic - how hard is it to boost
TermEnum.skipTo(term) speed to IndexReader.terms(term) level? Would be
nice for TrieRangeFilter and probably some other filters.

--
Kirill Zakharenko/Кирилл Захаренко ([hidden email])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: possible TermInfosReader speedup

Michael McCandless-2
On Wed, Apr 8, 2009 at 3:46 PM, Earwin Burrfoot <[hidden email]> wrote:

> Currently, when we're seeking a given Term, it does a binary search
> across all term space, including terms belonging to other fields.
> I propose augmenting fields file with two pointers (firstTerm,
> lastTerm) for each field. That reduces range we need to search, and
> instead of comparing Terms we only need to compare values.
> How does that sound?

That sounds great!  Wanna make a patch?

> Also, on the other topic - how hard is it to boost
> TermEnum.skipTo(term) speed to IndexReader.terms(term) level? Would be
> nice for TrieRangeFilter and probably some other filters.

I think all that's needed is to implement SegmentTermEnum.skipTo,
calling something like tis.terms(Term) but instead of returning a
cloned SegmentTermEnum, overwrite the one passed in?

Does TrieRangeFilter use TermEnum.skipTo?  If so, we should certainly fix this.

Also LUCENE-1458 has a more efficient terms index/dict implementation,
but it's probably still a ways off at this point... so if we can make
baby steps in the meantime, that'd be great.

See also this, for historical context:

  http://markmail.org/message/2e7kpvyi3bqtgjwt#query:lucene%20termenum%20skipto+page:1+mid:lb46mbbgpgbnnuxk+state:results

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: possible TermInfosReader speedup

Earwin Burrfoot
On Thu, Apr 9, 2009 at 00:14, Michael McCandless
<[hidden email]> wrote:
> On Wed, Apr 8, 2009 at 3:46 PM, Earwin Burrfoot <[hidden email]> wrote:
>
>> Currently, when we're seeking a given Term, it does a binary search
>> across all term space, including terms belonging to other fields.
>> I propose augmenting fields file with two pointers (firstTerm,
>> lastTerm) for each field. That reduces range we need to search, and
>> instead of comparing Terms we only need to compare values.
>> How does that sound?
> That sounds great!  Wanna make a patch?
Can try. But I'm not at all comfortable with these parts of Lucene,
will probably need help, at least with tests.

>> Also, on the other topic - how hard is it to boost
>> TermEnum.skipTo(term) speed to IndexReader.terms(term) level? Would be
>> nice for TrieRangeFilter and probably some other filters.
> I think all that's needed is to implement SegmentTermEnum.skipTo,
> calling something like tis.terms(Term) but instead of returning a
> cloned SegmentTermEnum, overwrite the one passed in?
I bet at least MultiSegmentReader.MultiTermEnum should be affected
too? (I'm looking at 2.3.2 sources)

> Does TrieRangeFilter use TermEnum.skipTo?  If so, we should certainly fix this.
It doesn't, but only because skipTo is so obviously slow + I have
another filter in my project that could use skipTo.

Refer to: https://issues.apache.org/jira/browse/LUCENE-1470?focusedCommentId=12651318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12651318
Uwe> I am fine with calling IndexReader.terms(Term) to use the cache
and faster seeking. The cost of creating new instances of TermEnums is
less than doing disk reads.
But other people (like me) might use mmapped indexes, so cost(new
TermEnum)/cost(index read) relation looks different for us.

> See also this, for historical context:
>  http://markmail.org/message/2e7kpvyi3bqtgjwt#query:lucene%20termenum%20skipto+page:1+mid:lb46mbbgpgbnnuxk+state:results
Darn! And api-wise it looks like a legitimate method :)

--
Kirill Zakharenko/Кирилл Захаренко ([hidden email])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: possible TermInfosReader speedup

Uwe Schindler
> >> Also, on the other topic - how hard is it to boost
> >> TermEnum.skipTo(term) speed to IndexReader.terms(term) level? Would be
> >> nice for TrieRangeFilter and probably some other filters.
> > I think all that's needed is to implement SegmentTermEnum.skipTo,
> > calling something like tis.terms(Term) but instead of returning a
> > cloned SegmentTermEnum, overwrite the one passed in?
> I bet at least MultiSegmentReader.MultiTermEnum should be affected
> too? (I'm looking at 2.3.2 sources)
>
> > Does TrieRangeFilter use TermEnum.skipTo?  If so, we should certainly
> fix this.
> It doesn't, but only because skipTo is so obviously slow + I have
> another filter in my project that could use skipTo.
>
> Refer to: https://issues.apache.org/jira/browse/LUCENE-
> 1470?focusedCommentId=12651318&page=com.atlassian.jira.plugin.system.issue
> tabpanels%3Acomment-tabpanel#action_12651318
> Uwe> I am fine with calling IndexReader.terms(Term) to use the cache
> and faster seeking. The cost of creating new instances of TermEnums is
> less than doing disk reads.

I am fascinated; you remember my question... :-)

Yes, if seekTo would work more performant, I could easily use it in
TrieRange and would be happy as noted before. Currently, a new TermEnum is
created on each sub-range. When TrieRange was committed and therefore
updated, for me it was (and still is) not clear, why skipTo may not be as
fast as a new TermEnum.

> But other people (like me) might use mmapped indexes, so cost(new
> TermEnum)/cost(index read) relation looks different for us.
>
> > See also this, for historical context:
> >
>  http://markmail.org/message/2e7kpvyi3bqtgjwt#query:lucene%20termenum%20sk
> ipto+page:1+mid:lb46mbbgpgbnnuxk+state:results
> Darn! And api-wise it looks like a legitimate method :)

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: possible TermInfosReader speedup

Earwin Burrfoot
On Thu, Apr 9, 2009 at 02:01, Uwe Schindler <[hidden email]> wrote:

>> >> Also, on the other topic - how hard is it to boost
>> >> TermEnum.skipTo(term) speed to IndexReader.terms(term) level? Would be
>> >> nice for TrieRangeFilter and probably some other filters.
>> > I think all that's needed is to implement SegmentTermEnum.skipTo,
>> > calling something like tis.terms(Term) but instead of returning a
>> > cloned SegmentTermEnum, overwrite the one passed in?
>> I bet at least MultiSegmentReader.MultiTermEnum should be affected
>> too? (I'm looking at 2.3.2 sources)
>>
>> > Does TrieRangeFilter use TermEnum.skipTo?  If so, we should certainly
>> fix this.
>> It doesn't, but only because skipTo is so obviously slow + I have
>> another filter in my project that could use skipTo.
>>
>> Refer to: https://issues.apache.org/jira/browse/LUCENE-
>> 1470?focusedCommentId=12651318&page=com.atlassian.jira.plugin.system.issue
>> tabpanels%3Acomment-tabpanel#action_12651318
>> Uwe> I am fine with calling IndexReader.terms(Term) to use the cache
>> and faster seeking. The cost of creating new instances of TermEnums is
>> less than doing disk reads.
>
> I am fascinated; you remember my question... :-)
I don't, I retired from that issue comments earlier :)
But today I was borrowing parts of your code for my version of
rangefilter (which we discussed at the very beginning) and stumbled
upon obviously missed skipTo opportunity. Then I checked the
mailing-list and found there your supporting voice.

> Yes, if seekTo would work more performant, I could easily use it in
> TrieRange and would be happy as noted before. Currently, a new TermEnum is
> created on each sub-range. When TrieRange was committed and therefore
> updated, for me it was (and still is) not clear, why skipTo may not be as
> fast as a new TermEnum.
Check Michael's link below, this method (and its ugly implementation)
is a random offspring of some ancient bugfix. Nobody loved it, and it
grew in neglect.

>> But other people (like me) might use mmapped indexes, so cost(new
>> TermEnum)/cost(index read) relation looks different for us.
>>
>> > See also this, for historical context:
>> >
>>  http://markmail.org/message/2e7kpvyi3bqtgjwt#query:lucene%20termenum%20sk
>> ipto+page:1+mid:lb46mbbgpgbnnuxk+state:results
>> Darn! And api-wise it looks like a legitimate method :)
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Kirill Zakharenko/Кирилл Захаренко ([hidden email])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: possible TermInfosReader speedup

Michael Busch
In reply to this post by Earwin Burrfoot
On 4/8/09 2:08 PM, Earwin Burrfoot wrote:
On Thu, Apr 9, 2009 at 00:14, Michael McCandless
[hidden email] wrote:
  
On Wed, Apr 8, 2009 at 3:46 PM, Earwin Burrfoot [hidden email] wrote:

    
Currently, when we're seeking a given Term, it does a binary search
across all term space, including terms belonging to other fields.
I propose augmenting fields file with two pointers (firstTerm,
lastTerm) for each field. That reduces range we need to search, and
instead of comparing Terms we only need to compare values.
How does that sound?
      
That sounds great!  Wanna make a patch?
    
Can try. But I'm not at all comfortable with these parts of Lucene,
will probably need help, at least with tests.

  

I was thinking about doing this as part of LUCENE-1195. However, I doubt that the net win will be very noticeable here. A common scenario is that you have an index with one big body field that has a lot of unique terms, plus several metafields that contribute less unique terms. Even if all metafields together would contribute the same amount of additional unique terms as the body field, this proposed change would only save one term comparison per body term lookup. The reason is the O(log(n)) of the in-memory binary search.

The story is a bit different for looking up terms on the smaller metafields. Here you could probably save more term comparisons. But I still think the improvement here would at the end be in the noise. I mean how long do e.g. 30 in-memory term comparisons take compared to all the disk seeks, sequential I/Os, VInts decodings, etc. that every search needs to do? And you probably never have more than 2^30 unique terms in your index.

So I doubt this improvement will be noticeable, but I would be happy if you proved me wrong and this was indeed a long hanging fruit.

  Michael
Reply | Threaded
Open this post in threaded view
|

RE: possible TermInfosReader speedup

Uwe Schindler
In reply to this post by Earwin Burrfoot
> > Yes, if skipTo would work more performant, I could easily use it in
> > TrieRange and would be happy as noted before. Currently, a new TermEnum
> is
> > created on each sub-range. When TrieRange was committed and therefore
> > updated, for me it was (and still is) not clear, why skipTo may not be
> as
> > fast as a new TermEnum.
> Check Michael's link below, this method (and its ugly implementation)
> is a random offspring of some ancient bugfix. Nobody loved it, and it
> grew in neglect.
>
> >> But other people (like me) might use mmapped indexes, so cost(new
> >> TermEnum)/cost(index read) relation looks different for us.
> >>
> >> > See also this, for historical context:
> >> >
> >>
>  http://markmail.org/message/2e7kpvyi3bqtgjwt#query:lucene%20termenum%20sk
> >> ipto+page:1+mid:lb46mbbgpgbnnuxk+state:results
> >> Darn! And api-wise it looks like a legitimate method :)

I think, we should do what was suggested in this thread: Remove it or
deprecate it, if it is nowhere used internally to prevent people (like me in
the past) to try to use it.

Maybe put an additional warning in the JavaDocs in addition to deprecation:
"This method is not effective. It is recommeneded to create a new TermEnum
with IndexReader.terms(Term) instead of skipping."

Or we fix it.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: possible TermInfosReader speedup

Michael McCandless-2
On Thu, Apr 9, 2009 at 4:24 AM, Uwe Schindler <[hidden email]> wrote:

> I think, we should do what was suggested in this thread: Remove it or
> deprecate it, if it is nowhere used internally to prevent people (like me in
> the past) to try to use it.
>
> Maybe put an additional warning in the JavaDocs in addition to deprecation:
> "This method is not effective. It is recommeneded to create a new TermEnum
> with IndexReader.terms(Term) instead of skipping."

I agree, let's deprecate it and add that warning, at least, for 2.9.
Or if someone can work out the patch to make it behave like
IndexReader.terms(Term), that's even better!  I'll open an issue &
mark 2.9.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: possible TermInfosReader speedup

Michael McCandless-2
OK I opened https://issues.apache.org/jira/browse/LUCENE-1592.

Mike

On Thu, Apr 9, 2009 at 4:36 AM, Michael McCandless
<[hidden email]> wrote:

> On Thu, Apr 9, 2009 at 4:24 AM, Uwe Schindler <[hidden email]> wrote:
>
>> I think, we should do what was suggested in this thread: Remove it or
>> deprecate it, if it is nowhere used internally to prevent people (like me in
>> the past) to try to use it.
>>
>> Maybe put an additional warning in the JavaDocs in addition to deprecation:
>> "This method is not effective. It is recommeneded to create a new TermEnum
>> with IndexReader.terms(Term) instead of skipping."
>
> I agree, let's deprecate it and add that warning, at least, for 2.9.
> Or if someone can work out the patch to make it behave like
> IndexReader.terms(Term), that's even better!  I'll open an issue &
> mark 2.9.
>
> Mike
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: possible TermInfosReader speedup

Mike Klaas
In reply to this post by Michael Busch

On 8-Apr-09, at 11:13 PM, Michael Busch wrote:

>
> I was thinking about doing this as part of LUCENE-1195. However, I  
> doubt that the net win will be very noticeable here. A common  
> scenario is that you have an index with one big body field that has  
> a lot of unique terms, plus several metafields that contribute less  
> unique terms. Even if all metafields together would contribute the  
> same amount of additional unique terms as the body field, this  
> proposed change would only save one term comparison per body term  
> lookup. The reason is the O(log(n)) of the in-memory binary search.
>
> The story is a bit different for looking up terms on the smaller  
> metafields. Here you could probably save more term comparisons. But  
> I still think the improvement here would at the end be in the noise.  
> I mean how long do e.g. 30 in-memory term comparisons take compared  
> to all the disk seeks, sequential I/Os, VInts decodings, etc. that  
> every search needs to do? And you probably never have more than 2^30  
> unique terms in your index.
>
> So I doubt this improvement will be noticeable, but I would be happy  
> if you proved me wrong and this was indeed a long hanging fruit.

I had a similar thought.  It is hard to improve logarithmic  
algorithms, since you need to reduce N exponentially to get a linear  
speed up.  OTOH, I have a few indices with several fields with  
multiple-millions of terms each (I don't think that is uncommon.  Even  
indexing a docId per doc can cause this kind of situation).   Also, in  
your scenario, even if the main term search isn't much accelerated,  
the metadata field lookups might be much faster, which could add up to  
some sort of win.

Like Michael, I'm doubtful, but see no reason not to try it!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]