Performance Improvement for Search using PriorityQueue

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
No I haven't done that (to be honest, I don't know how to do that ... :-) ).
That's the reason I ran both tests multiple times and reported the last run.

On Dec 10, 2007 10:24 PM, Mike Klaas <[hidden email]> wrote:

> On 10-Dec-07, at 12:11 PM, Shai Erera wrote:
>
> > Actually, queries on large indexes are not necessarily I/O bound.
> > It depends
> > on how much of the posting list is being read into memory at once.
> > I'm not
> > that familiar with the inner-most of Lucene, but let's assume a
> > posting
> > element takes 4 bytes for docId and 2 more bytes per position in a
> > document
> > (that's without compression, I'm sure Lucene does some compression
> > on the
> > doc Ids). So, I think I won't miss by much by guessing that at most a
> > posting element takes 10 bytes. Which means that 1M posting
> > elements take
> > 10MB (this is considered a very long posting list).
> > Therefore if you read it into memory in chunks (16, 32, 64 KB),
> > most of the
> > time the query spends in the CPU, computing the scores, PQ etc. The
> > real IO
> > operations only involve reading fragments of the posting into
> > memory. In
> > todays hardware, reading 10MB into memory is pretty fast.
> > So I wouldn't be surprised here (unless I misunderstood you).
>
> My experience is that queries against indices which haven't been
> warmed into the os disk cache to be many times slower (this is
> especially true if the prox file is used at all).
>
> I initially assumed that you had cleared the os disk cache between
> the runs of the two algorithms, and were seeing a difference in
> uncached query performance.  I assume though from your comments that
> this isn't the case at all.
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael Busch
Shai Erera wrote:
> No I haven't done that (to be honest, I don't know how to do that ... :-) ).
> That's the reason I ran both tests multiple times and reported the last run.
>

Reboot your machine ;-) That's what I usually do - if there's another
way I'd like to know as well!

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Daniel Naber-8
On Montag, 10. Dezember 2007, Michael Busch wrote:

> Reboot your machine ;-) That's what I usually do - if there's another
> way I'd like to know as well!

On Linux (kernel 2.6.16 and later), call:

sync ; echo 3 > /proc/sys/vm/drop_caches

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
In reply to this post by Michael Busch
I see :-). It's just that "clear up the disk cache" sounds so professional,
I assumed there's a way to do it that I don't know of ... :-)
Thanks, I'll report again with a larger index measurement.

On Dec 10, 2007 11:06 PM, Michael Busch <[hidden email]> wrote:

> Shai Erera wrote:
> > No I haven't done that (to be honest, I don't know how to do that ...
> :-) ).
> > That's the reason I ran both tests multiple times and reported the last
> run.
> >
>
> Reboot your machine ;-) That's what I usually do - if there's another
> way I'd like to know as well!
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Robert Engels
In reply to this post by Michael Busch
On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches
On "OS X", use "purge", which is part of the CHUD tools.
On Windows, I think you're hosed.

On Dec 10, 2007, at 3:06 PM, Michael Busch wrote:

> Shai Erera wrote:
>> No I haven't done that (to be honest, I don't know how to do  
>> that ... :-) ).
>> That's the reason I ran both tests multiple times and reported the  
>> last run.
>>
>
> Reboot your machine ;-) That's what I usually do - if there's another
> way I'd like to know as well!
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
Thanks for the info. Too bad I use Windows ...

On Dec 10, 2007 11:16 PM, robert engels <[hidden email]> wrote:

> On linux, use "drop_caches", see http://linux-mm.org/Drop_Caches
> On "OS X", use "purge", which is part of the CHUD tools.
> On Windows, I think you're hosed.
>
> On Dec 10, 2007, at 3:06 PM, Michael Busch wrote:
>
> > Shai Erera wrote:
> >> No I haven't done that (to be honest, I don't know how to do
> >> that ... :-) ).
> >> That's the reason I ran both tests multiple times and reported the
> >> last run.
> >>
> >
> > Reboot your machine ;-) That's what I usually do - if there's another
> > way I'd like to know as well!
> >
> > -Michael
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Mike Klaas
On 10-Dec-07, at 1:20 PM, Shai Erera wrote:

> Thanks for the info. Too bad I use Windows ...

Just allocate a bunch of memory and free it.  This linux, but  
something similar should work on windows:

$ vmstat -S M
procs -----------memory----------
r  b   swpd   free   buff  cache
0  0      0     45    372    786

$ python -c '"a"*2000000000'

$ vmstat -S M
procs -----------memory----------
r  b   swpd   free   buff  cache
0  0    463   1761      0      6

-Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
Hi

Back from the experiments lab with more results. I've used two indexes (1
and 10 million documents) and ran over the two 2000 queries. Each run was
executed 4 times and I paste here the average of the latest 3 (to eliminate
any caching that is done by the OS and to mimic systems that are already
working and therefore have some data in the OS cache). Following are the
results:

Current TopDocCollector + PQ
--------------------------------------------
Index Size         1M              10M
Avg. Time           8.519ms     289.232ms
Avg. Allocations  77.38          97.35
Avg. # results      51,113        461,019

Modified TopDocCollector + PQ
----------------------------------------------
Index Size         1M              10M
Avg. Time           9.619ms     298.197ms
Avg. Allocations  9.92           10.12
Avg. # results      51,113        461,019

Basically the results haven't changed from yesterday. There isn't any
significant difference in the execution time of both versions. The only
difference is the number of allocations.
Although the number of allocations is very small (100 for 461,000 results),
I think it should not be neglected. On systems that rely solely on memory
(such as powerful systems that are able to keep entire indexes in-memory),
the number of object allocations may be significant.

The way I see it we can do either of the following:
1. Add the method to PQ and change TDC implementation to reuse ScoreDocs. We
gain only in the number of allocations. Basically, we don't lose anything by
doing that, we only gain.
2. Add the method to PQ for applications that require it and not change
TDC's implementation. For example, applications that want to show the 10
most recent documents from a very large collection need to run a
MatchAllDocsQuery with some sorting. They may create a lot more instances of
ScoreDoc.
3. Do nothing.

If you think I should run more tests, please let me know - I already have
the two indexes and any further tests can be performed quite immediately.

Thanks,

Shai

On Dec 10, 2007 11:46 PM, Mike Klaas <[hidden email]> wrote:

> On 10-Dec-07, at 1:20 PM, Shai Erera wrote:
>
> > Thanks for the info. Too bad I use Windows ...
>
> Just allocate a bunch of memory and free it.  This linux, but
> something similar should work on windows:
>
> $ vmstat -S M
> procs -----------memory----------
> r  b   swpd   free   buff  cache
> 0  0      0     45    372    786
>
> $ python -c '"a"*2000000000'
>
> $ vmstat -S M
> procs -----------memory----------
> r  b   swpd   free   buff  cache
> 0  0    463   1761      0      6
>
> -Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Timo Nentwig-7
In reply to this post by Paul Elschot
On Monday 10 December 2007 09:15:12 Paul Elschot wrote:
> The current TopDocCollector only allocates a ScoreDoc when the given
> score causes a new ScoreDoc to be added into the queue, but it does

I actually wrote my own HitCollector and now wonder about TopDocCollector:

  public void collect(int doc, float score) {
    if (score > 0.0f) {
      totalHits++;
      if (hq.size() < numHits || score >= minScore) {
        hq.insert(new ScoreDoc(doc, score));
        minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
      }
    }
  }

1) How can there be hits with score=0.0?
2) I don't understand minScore: inserts only document having a higher score
than the lowest score already in queue?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
For (1) - I can't explain it but I've run into documents with 0.0f scores.
For (2) - this is a simple logic - if the lowest score in the queue is 'x'
and you want to top docs only, then there's no point in attempting to insert
a document with score lower than 'x' (it will not be added).
Maybe I didn't understand your question correctly though ...

On Dec 11, 2007 2:25 PM, Timo Nentwig <[hidden email]> wrote:

> On Monday 10 December 2007 09:15:12 Paul Elschot wrote:
> > The current TopDocCollector only allocates a ScoreDoc when the given
> > score causes a new ScoreDoc to be added into the queue, but it does
>
> I actually wrote my own HitCollector and now wonder about TopDocCollector:
>
>  public void collect(int doc, float score) {
>    if (score > 0.0f) {
>      totalHits++;
>      if (hq.size() < numHits || score >= minScore) {
>        hq.insert(new ScoreDoc(doc, score));
>        minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
>      }
>    }
>  }
>
> 1) How can there be hits with score=0.0?
> 2) I don't understand minScore: inserts only document having a higher
> score
> than the lowest score already in queue?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Nadav Har'El
In reply to this post by Shai Erera
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue":
> Hi
>
> Lucene's PQ implements two methods: put (assumes the PQ has room for the
> object) and insert (checks whether the object can be inserted etc.). The
> implementation of insert() requires the application that uses it to allocate
> a new object every time it calls insert. Specifically, it cannot reuse the
> objects that were removed from the PQ.

I've read this entire thread, and would like to add my comments about three
independent issues, which I think that can and perhaps should be considered
separately:

1. When Shai wanted to add the insertWithOverflow() method to PriorityQueue
   he couldn't just subclass PriorityQueue in his application, but rather
   was forced to modify PriorityQueue itself. Why? just because one field
   of that classi - "heap" - was defined "private" instead of "protected".

   Is there a special reason for that? If not, can we make the trivial change
   to make PriorityQueue's fields protected, to allow Shai and others (see the
   next point) to add functionality on top of PriorityQueue?

2. PriorityQueue, in addition to being used in about a dozen places inside
   Lucene, is also a public class that advanced users often find useful when
   implementing features like new collectors, new queries, and so on.
   Unfortunately, my experience exactly matches Shai's: In the two occasions
   where I used a PriorityQueue, I found that I needed such a
   insertWithOverflow() method. If this feature is so useful (I can't believe
   that Shai and me are the only ones who found it useful), I think it would
   be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used
   inside Lucene.

   Just to make it sound more interesting, let me give you an example where
   I needed (and implemented) an "insertWithOverflow()": I was implementing a
   faceted search capability over Lucene. It calculated a count for each
   facet value, and then I used a PriorityQueue to find the 10 best values.
   The problem is that I also needed an "other" aggregator, which was supposed
   to aggregate (in various ways) all the facets except the 10 best ones. For
   that, I needed to know which facets dropped off the priorityqueue.

3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow()
   to be used in TopDocCollector. I have to admit I don't know how much
   of a benefit this will be in the "typical" case. But I do know that
   there's no such thing as a "typical" case...
   I can easily think of a quite typical "worst case" though: Consider a
   collection indexed in order of document age (a pretty typical scenario
   for a long-running index), and then you do a sorting query
   (TopFieldDocCollector), asking it to bring the 10 newest documents.
   In that case, each and every document will have a new DocScore created -
   which is the worst-case that Shai feared.
   It would be nice if Shai or someone else could provide a measurement in
   that case.

P.S. When looking now at PriorityQueue's code, I found two tiny
performance improvements that could be easily made to it - I wonder if
there's any reason not to do them:

 1. Insert can use heap[1] directly instead of calling top(). After all,
    this is done in an if() that already ensures that size>0.

 2. Regardless, top() could return heap[1] always, without any if(). After
    all, the heap array is initialized to all nulls, so when size==0, heap[1]
    is null anyway.



> PriorityQueue change (added insertWithOverflow method)
> -----------------------------------------------------------------------------------
>     /**
>      * insertWithOverflow() is similar to insert(), except its return value:
> it
>      * returns the object (if any) that was dropped off the heap because it
> was
>      * full. This can be the given parameter (in case it is smaller than the
>      * full heap's minimum, and couldn't be added) or another object that
> was
>      * previously the smallest value in the heap and now has been replaced
> by a
>      * larger one.
>      */
>     public Object insertWithOverflow(Object element) {
>         if (size < maxSize) {
>             put(element);
>             return null;
>         } else if (size > 0 && !lessThan(element, top())) {
>             Object ret = heap[1];
>             heap[1] = element;
>             adjustTop();
>             return ret;
>         } else {
>             return element;
>         }
>     }
> [Very similar to insert(), only it returns the object that was kicked out of
> the Queue, or null]

--
Nadav Har'El                        |       Tuesday, Dec 11 2007, 3 Tevet 5768
IBM Haifa Research Lab              |-----------------------------------------
                                    |A professor is one who talks in someone
http://nadav.harel.org.il           |else's sleep.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Peter Keegan
See my similar request from last March:
http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550

Peter

On Dec 11, 2007 11:54 AM, Nadav Har'El <[hidden email]> wrote:

> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for
> Search using PriorityQueue":
> > Hi
> >
> > Lucene's PQ implements two methods: put (assumes the PQ has room for the
> > object) and insert (checks whether the object can be inserted etc.). The
> > implementation of insert() requires the application that uses it to
> allocate
> > a new object every time it calls insert. Specifically, it cannot reuse
> the
> > objects that were removed from the PQ.
>
> I've read this entire thread, and would like to add my comments about
> three
> independent issues, which I think that can and perhaps should be
> considered
> separately:
>
> 1. When Shai wanted to add the insertWithOverflow() method to
> PriorityQueue
>   he couldn't just subclass PriorityQueue in his application, but rather
>   was forced to modify PriorityQueue itself. Why? just because one field
>   of that classi - "heap" - was defined "private" instead of "protected".
>
>   Is there a special reason for that? If not, can we make the trivial
> change
>   to make PriorityQueue's fields protected, to allow Shai and others (see
> the
>   next point) to add functionality on top of PriorityQueue?
>
> 2. PriorityQueue, in addition to being used in about a dozen places inside
>   Lucene, is also a public class that advanced users often find useful
> when
>   implementing features like new collectors, new queries, and so on.
>   Unfortunately, my experience exactly matches Shai's: In the two
> occasions
>   where I used a PriorityQueue, I found that I needed such a
>   insertWithOverflow() method. If this feature is so useful (I can't
> believe
>   that Shai and me are the only ones who found it useful), I think it
> would
>   be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used
>   inside Lucene.
>
>   Just to make it sound more interesting, let me give you an example where
>   I needed (and implemented) an "insertWithOverflow()": I was implementing
> a
>   faceted search capability over Lucene. It calculated a count for each
>   facet value, and then I used a PriorityQueue to find the 10 best values.
>   The problem is that I also needed an "other" aggregator, which was
> supposed
>   to aggregate (in various ways) all the facets except the 10 best ones.
> For
>   that, I needed to know which facets dropped off the priorityqueue.
>
> 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow()
>   to be used in TopDocCollector. I have to admit I don't know how much
>   of a benefit this will be in the "typical" case. But I do know that
>   there's no such thing as a "typical" case...
>   I can easily think of a quite typical "worst case" though: Consider a
>   collection indexed in order of document age (a pretty typical scenario
>   for a long-running index), and then you do a sorting query
>   (TopFieldDocCollector), asking it to bring the 10 newest documents.
>   In that case, each and every document will have a new DocScore created -
>   which is the worst-case that Shai feared.
>   It would be nice if Shai or someone else could provide a measurement in
>   that case.
>
> P.S. When looking now at PriorityQueue's code, I found two tiny
> performance improvements that could be easily made to it - I wonder if
> there's any reason not to do them:
>
>  1. Insert can use heap[1] directly instead of calling top(). After all,
>    this is done in an if() that already ensures that size>0.
>
>  2. Regardless, top() could return heap[1] always, without any if(). After
>    all, the heap array is initialized to all nulls, so when size==0,
> heap[1]
>    is null anyway.
>
>
>
> > PriorityQueue change (added insertWithOverflow method)
> >
> -----------------------------------------------------------------------------------
> >     /**
> >      * insertWithOverflow() is similar to insert(), except its return
> value:
> > it
> >      * returns the object (if any) that was dropped off the heap because
> it
> > was
> >      * full. This can be the given parameter (in case it is smaller than
> the
> >      * full heap's minimum, and couldn't be added) or another object
> that
> > was
> >      * previously the smallest value in the heap and now has been
> replaced
> > by a
> >      * larger one.
> >      */
> >     public Object insertWithOverflow(Object element) {
> >         if (size < maxSize) {
> >             put(element);
> >             return null;
> >         } else if (size > 0 && !lessThan(element, top())) {
> >             Object ret = heap[1];
> >             heap[1] = element;
> >             adjustTop();
> >             return ret;
> >         } else {
> >             return element;
> >         }
> >     }
> > [Very similar to insert(), only it returns the object that was kicked
> out of
> > the Queue, or null]
>
> --
> Nadav Har'El                        |       Tuesday, Dec 11 2007, 3 Tevet
> 5768
> IBM Haifa Research Lab
>  |-----------------------------------------
>                                    |A professor is one who talks in
> someone
> http://nadav.harel.org.il           |else's sleep.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2
In reply to this post by Nadav Har'El

I agree that even though we don't see gains on the queries tested,  
there are in theory cases where there could be a great many  
allocations that would be saved.

I think we should do Shai's suggested option 1 (add the method and  
change TDC to call it), change heap to be protected not private, plus  
the 2 tiny performance gains Nadav suggests below?  Shai can you open  
a Jira issue & attach a patch for these changes?  Thanks!

Mike

Nadav Har'El wrote:

> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance  
> Improvement for Search using PriorityQueue":
>> Hi
>>
>> Lucene's PQ implements two methods: put (assumes the PQ has room  
>> for the
>> object) and insert (checks whether the object can be inserted  
>> etc.). The
>> implementation of insert() requires the application that uses it  
>> to allocate
>> a new object every time it calls insert. Specifically, it cannot  
>> reuse the
>> objects that were removed from the PQ.
>
> I've read this entire thread, and would like to add my comments  
> about three
> independent issues, which I think that can and perhaps should be  
> considered
> separately:
>
> 1. When Shai wanted to add the insertWithOverflow() method to  
> PriorityQueue
>    he couldn't just subclass PriorityQueue in his application, but  
> rather
>    was forced to modify PriorityQueue itself. Why? just because one  
> field
>    of that classi - "heap" - was defined "private" instead of  
> "protected".
>
>    Is there a special reason for that? If not, can we make the  
> trivial change
>    to make PriorityQueue's fields protected, to allow Shai and  
> others (see the
>    next point) to add functionality on top of PriorityQueue?
>
> 2. PriorityQueue, in addition to being used in about a dozen places  
> inside
>    Lucene, is also a public class that advanced users often find  
> useful when
>    implementing features like new collectors, new queries, and so on.
>    Unfortunately, my experience exactly matches Shai's: In the two  
> occasions
>    where I used a PriorityQueue, I found that I needed such a
>    insertWithOverflow() method. If this feature is so useful (I  
> can't believe
>    that Shai and me are the only ones who found it useful), I think  
> it would
>    be nice to add it to Lucene's PriorityQueue, even if it isn't  
> (yet) used
>    inside Lucene.
>
>    Just to make it sound more interesting, let me give you an  
> example where
>    I needed (and implemented) an "insertWithOverflow()": I was  
> implementing a
>    faceted search capability over Lucene. It calculated a count for  
> each
>    facet value, and then I used a PriorityQueue to find the 10 best  
> values.
>    The problem is that I also needed an "other" aggregator, which  
> was supposed
>    to aggregate (in various ways) all the facets except the 10 best  
> ones. For
>    that, I needed to know which facets dropped off the priorityqueue.
>
> 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow()
>    to be used in TopDocCollector. I have to admit I don't know how  
> much
>    of a benefit this will be in the "typical" case. But I do know that
>    there's no such thing as a "typical" case...
>    I can easily think of a quite typical "worst case" though:  
> Consider a
>    collection indexed in order of document age (a pretty typical  
> scenario
>    for a long-running index), and then you do a sorting query
>    (TopFieldDocCollector), asking it to bring the 10 newest documents.
>    In that case, each and every document will have a new DocScore  
> created -
>    which is the worst-case that Shai feared.
>    It would be nice if Shai or someone else could provide a  
> measurement in
>    that case.
>
> P.S. When looking now at PriorityQueue's code, I found two tiny
> performance improvements that could be easily made to it - I wonder if
> there's any reason not to do them:
>
>  1. Insert can use heap[1] directly instead of calling top(). After  
> all,
>     this is done in an if() that already ensures that size>0.
>
>  2. Regardless, top() could return heap[1] always, without any if
> (). After
>     all, the heap array is initialized to all nulls, so when  
> size==0, heap[1]
>     is null anyway.
>
>
>
>> PriorityQueue change (added insertWithOverflow method)
>> ---------------------------------------------------------------------
>> --------------
>>     /**
>>      * insertWithOverflow() is similar to insert(), except its  
>> return value:
>> it
>>      * returns the object (if any) that was dropped off the heap  
>> because it
>> was
>>      * full. This can be the given parameter (in case it is  
>> smaller than the
>>      * full heap's minimum, and couldn't be added) or another  
>> object that
>> was
>>      * previously the smallest value in the heap and now has been  
>> replaced
>> by a
>>      * larger one.
>>      */
>>     public Object insertWithOverflow(Object element) {
>>         if (size < maxSize) {
>>             put(element);
>>             return null;
>>         } else if (size > 0 && !lessThan(element, top())) {
>>             Object ret = heap[1];
>>             heap[1] = element;
>>             adjustTop();
>>             return ret;
>>         } else {
>>             return element;
>>         }
>>     }
>> [Very similar to insert(), only it returns the object that was  
>> kicked out of
>> the Queue, or null]
>
> --
> Nadav Har'El                        |       Tuesday, Dec 11 2007, 3  
> Tevet 5768
> IBM Haifa Research Lab              
> |-----------------------------------------
>                                     |A professor is one who talks  
> in someone
> http://nadav.harel.org.il           |else's sleep.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2
In reply to this post by Peter Keegan

I think we can't make lessThan public since that would cause  
subclasses to fail to compile (ie this breaks backwards compatibility)?

Adding "wouldBeInserted()" seems OK?

Mike

Peter Keegan wrote:

> See my similar request from last March:
> http://www.nabble.com/FieldSortedHitQueue-enhancement- 
> to9733550.html#a9733550
>
> Peter
>
> On Dec 11, 2007 11:54 AM, Nadav Har'El <[hidden email]>  
> wrote:
>
>> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance  
>> Improvement for
>> Search using PriorityQueue":
>>> Hi
>>>
>>> Lucene's PQ implements two methods: put (assumes the PQ has room  
>>> for the
>>> object) and insert (checks whether the object can be inserted  
>>> etc.). The
>>> implementation of insert() requires the application that uses it to
>> allocate
>>> a new object every time it calls insert. Specifically, it cannot  
>>> reuse
>> the
>>> objects that were removed from the PQ.
>>
>> I've read this entire thread, and would like to add my comments about
>> three
>> independent issues, which I think that can and perhaps should be
>> considered
>> separately:
>>
>> 1. When Shai wanted to add the insertWithOverflow() method to
>> PriorityQueue
>>   he couldn't just subclass PriorityQueue in his application, but  
>> rather
>>   was forced to modify PriorityQueue itself. Why? just because one  
>> field
>>   of that classi - "heap" - was defined "private" instead of  
>> "protected".
>>
>>   Is there a special reason for that? If not, can we make the trivial
>> change
>>   to make PriorityQueue's fields protected, to allow Shai and  
>> others (see
>> the
>>   next point) to add functionality on top of PriorityQueue?
>>
>> 2. PriorityQueue, in addition to being used in about a dozen  
>> places inside
>>   Lucene, is also a public class that advanced users often find  
>> useful
>> when
>>   implementing features like new collectors, new queries, and so on.
>>   Unfortunately, my experience exactly matches Shai's: In the two
>> occasions
>>   where I used a PriorityQueue, I found that I needed such a
>>   insertWithOverflow() method. If this feature is so useful (I can't
>> believe
>>   that Shai and me are the only ones who found it useful), I think it
>> would
>>   be nice to add it to Lucene's PriorityQueue, even if it isn't  
>> (yet) used
>>   inside Lucene.
>>
>>   Just to make it sound more interesting, let me give you an  
>> example where
>>   I needed (and implemented) an "insertWithOverflow()": I was  
>> implementing
>> a
>>   faceted search capability over Lucene. It calculated a count for  
>> each
>>   facet value, and then I used a PriorityQueue to find the 10 best  
>> values.
>>   The problem is that I also needed an "other" aggregator, which was
>> supposed
>>   to aggregate (in various ways) all the facets except the 10 best  
>> ones.
>> For
>>   that, I needed to know which facets dropped off the priorityqueue.
>>
>> 3. Finally, Shai asked for this new  
>> PriorityQueue.insertWithOverflow()
>>   to be used in TopDocCollector. I have to admit I don't know how  
>> much
>>   of a benefit this will be in the "typical" case. But I do know that
>>   there's no such thing as a "typical" case...
>>   I can easily think of a quite typical "worst case" though:  
>> Consider a
>>   collection indexed in order of document age (a pretty typical  
>> scenario
>>   for a long-running index), and then you do a sorting query
>>   (TopFieldDocCollector), asking it to bring the 10 newest documents.
>>   In that case, each and every document will have a new DocScore  
>> created -
>>   which is the worst-case that Shai feared.
>>   It would be nice if Shai or someone else could provide a  
>> measurement in
>>   that case.
>>
>> P.S. When looking now at PriorityQueue's code, I found two tiny
>> performance improvements that could be easily made to it - I  
>> wonder if
>> there's any reason not to do them:
>>
>>  1. Insert can use heap[1] directly instead of calling top().  
>> After all,
>>    this is done in an if() that already ensures that size>0.
>>
>>  2. Regardless, top() could return heap[1] always, without any if
>> (). After
>>    all, the heap array is initialized to all nulls, so when size==0,
>> heap[1]
>>    is null anyway.
>>
>>
>>
>>> PriorityQueue change (added insertWithOverflow method)
>>>
>> ---------------------------------------------------------------------
>> --------------
>>>     /**
>>>      * insertWithOverflow() is similar to insert(), except its  
>>> return
>> value:
>>> it
>>>      * returns the object (if any) that was dropped off the heap  
>>> because
>> it
>>> was
>>>      * full. This can be the given parameter (in case it is  
>>> smaller than
>> the
>>>      * full heap's minimum, and couldn't be added) or another object
>> that
>>> was
>>>      * previously the smallest value in the heap and now has been
>> replaced
>>> by a
>>>      * larger one.
>>>      */
>>>     public Object insertWithOverflow(Object element) {
>>>         if (size < maxSize) {
>>>             put(element);
>>>             return null;
>>>         } else if (size > 0 && !lessThan(element, top())) {
>>>             Object ret = heap[1];
>>>             heap[1] = element;
>>>             adjustTop();
>>>             return ret;
>>>         } else {
>>>             return element;
>>>         }
>>>     }
>>> [Very similar to insert(), only it returns the object that was  
>>> kicked
>> out of
>>> the Queue, or null]
>>
>> --
>> Nadav Har'El                        |       Tuesday, Dec 11 2007,  
>> 3 Tevet
>> 5768
>> IBM Haifa Research Lab
>>  |-----------------------------------------
>>                                    |A professor is one who talks in
>> someone
>> http://nadav.harel.org.il           |else's sleep.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
Hi,
I will open an issue and create the patch. One thing I'm not sure of is the
wouldBeInserted method you mentioned - in what context should it be used?
And ... lessThan shouldn't be public, it can stay protected.


On 12/11/07, Michael McCandless <[hidden email]> wrote:

>
>
> I think we can't make lessThan public since that would cause
> subclasses to fail to compile (ie this breaks backwards compatibility)?
>
> Adding "wouldBeInserted()" seems OK?
>
> Mike
>
> Peter Keegan wrote:
>
> > See my similar request from last March:
> > http://www.nabble.com/FieldSortedHitQueue-enhancement-
> > to9733550.html#a9733550
> >
> > Peter
> >
> > On Dec 11, 2007 11:54 AM, Nadav Har'El <[hidden email]>
> > wrote:
> >
> >> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance
> >> Improvement for
> >> Search using PriorityQueue":
> >>> Hi
> >>>
> >>> Lucene's PQ implements two methods: put (assumes the PQ has room
> >>> for the
> >>> object) and insert (checks whether the object can be inserted
> >>> etc.). The
> >>> implementation of insert() requires the application that uses it to
> >> allocate
> >>> a new object every time it calls insert. Specifically, it cannot
> >>> reuse
> >> the
> >>> objects that were removed from the PQ.
> >>
> >> I've read this entire thread, and would like to add my comments about
> >> three
> >> independent issues, which I think that can and perhaps should be
> >> considered
> >> separately:
> >>
> >> 1. When Shai wanted to add the insertWithOverflow() method to
> >> PriorityQueue
> >>   he couldn't just subclass PriorityQueue in his application, but
> >> rather
> >>   was forced to modify PriorityQueue itself. Why? just because one
> >> field
> >>   of that classi - "heap" - was defined "private" instead of
> >> "protected".
> >>
> >>   Is there a special reason for that? If not, can we make the trivial
> >> change
> >>   to make PriorityQueue's fields protected, to allow Shai and
> >> others (see
> >> the
> >>   next point) to add functionality on top of PriorityQueue?
> >>
> >> 2. PriorityQueue, in addition to being used in about a dozen
> >> places inside
> >>   Lucene, is also a public class that advanced users often find
> >> useful
> >> when
> >>   implementing features like new collectors, new queries, and so on.
> >>   Unfortunately, my experience exactly matches Shai's: In the two
> >> occasions
> >>   where I used a PriorityQueue, I found that I needed such a
> >>   insertWithOverflow() method. If this feature is so useful (I can't
> >> believe
> >>   that Shai and me are the only ones who found it useful), I think it
> >> would
> >>   be nice to add it to Lucene's PriorityQueue, even if it isn't
> >> (yet) used
> >>   inside Lucene.
> >>
> >>   Just to make it sound more interesting, let me give you an
> >> example where
> >>   I needed (and implemented) an "insertWithOverflow()": I was
> >> implementing
> >> a
> >>   faceted search capability over Lucene. It calculated a count for
> >> each
> >>   facet value, and then I used a PriorityQueue to find the 10 best
> >> values.
> >>   The problem is that I also needed an "other" aggregator, which was
> >> supposed
> >>   to aggregate (in various ways) all the facets except the 10 best
> >> ones.
> >> For
> >>   that, I needed to know which facets dropped off the priorityqueue.
> >>
> >> 3. Finally, Shai asked for this new
> >> PriorityQueue.insertWithOverflow()
> >>   to be used in TopDocCollector. I have to admit I don't know how
> >> much
> >>   of a benefit this will be in the "typical" case. But I do know that
> >>   there's no such thing as a "typical" case...
> >>   I can easily think of a quite typical "worst case" though:
> >> Consider a
> >>   collection indexed in order of document age (a pretty typical
> >> scenario
> >>   for a long-running index), and then you do a sorting query
> >>   (TopFieldDocCollector), asking it to bring the 10 newest documents.
> >>   In that case, each and every document will have a new DocScore
> >> created -
> >>   which is the worst-case that Shai feared.
> >>   It would be nice if Shai or someone else could provide a
> >> measurement in
> >>   that case.
> >>
> >> P.S. When looking now at PriorityQueue's code, I found two tiny
> >> performance improvements that could be easily made to it - I
> >> wonder if
> >> there's any reason not to do them:
> >>
> >>  1. Insert can use heap[1] directly instead of calling top().
> >> After all,
> >>    this is done in an if() that already ensures that size>0.
> >>
> >>  2. Regardless, top() could return heap[1] always, without any if
> >> (). After
> >>    all, the heap array is initialized to all nulls, so when size==0,
> >> heap[1]
> >>    is null anyway.
> >>
> >>
> >>
> >>> PriorityQueue change (added insertWithOverflow method)
> >>>
> >> ---------------------------------------------------------------------
> >> --------------
> >>>     /**
> >>>      * insertWithOverflow() is similar to insert(), except its
> >>> return
> >> value:
> >>> it
> >>>      * returns the object (if any) that was dropped off the heap
> >>> because
> >> it
> >>> was
> >>>      * full. This can be the given parameter (in case it is
> >>> smaller than
> >> the
> >>>      * full heap's minimum, and couldn't be added) or another object
> >> that
> >>> was
> >>>      * previously the smallest value in the heap and now has been
> >> replaced
> >>> by a
> >>>      * larger one.
> >>>      */
> >>>     public Object insertWithOverflow(Object element) {
> >>>         if (size < maxSize) {
> >>>             put(element);
> >>>             return null;
> >>>         } else if (size > 0 && !lessThan(element, top())) {
> >>>             Object ret = heap[1];
> >>>             heap[1] = element;
> >>>             adjustTop();
> >>>             return ret;
> >>>         } else {
> >>>             return element;
> >>>         }
> >>>     }
> >>> [Very similar to insert(), only it returns the object that was
> >>> kicked
> >> out of
> >>> the Queue, or null]
> >>
> >> --
> >> Nadav Har'El                        |       Tuesday, Dec 11 2007,
> >> 3 Tevet
> >> 5768
> >> IBM Haifa Research Lab
> >>  |-----------------------------------------
> >>                                    |A professor is one who talks in
> >> someone
> >> http://nadav.harel.org.il           |else's sleep.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2

Shai Erera wrote:

> Hi,
> I will open an issue and create the patch. One thing I'm not sure  
> of is the
> wouldBeInserted method you mentioned - in what context should it be  
> used?
> And ... lessThan shouldn't be public, it can stay protected.

Sorry, this is a method Peter suggested (see below) in order to add  
de-duping logic on top of the PQ.

We should do it separately (it's unrelated).

Peter can you open an issue for that one?  Thanks!

Mike

>
> On 12/11/07, Michael McCandless <[hidden email]> wrote:
>>
>>
>> I think we can't make lessThan public since that would cause
>> subclasses to fail to compile (ie this breaks backwards  
>> compatibility)?
>>
>> Adding "wouldBeInserted()" seems OK?
>>
>> Mike
>>
>> Peter Keegan wrote:
>>
>>> See my similar request from last March:
>>> http://www.nabble.com/FieldSortedHitQueue-enhancement-
>>> to9733550.html#a9733550
>>>
>>> Peter
>>>
>>> On Dec 11, 2007 11:54 AM, Nadav Har'El <[hidden email]>
>>> wrote:
>>>
>>>> On Mon, Dec 10, 2007, Shai Erera wrote about "Performance
>>>> Improvement for
>>>> Search using PriorityQueue":
>>>>> Hi
>>>>>
>>>>> Lucene's PQ implements two methods: put (assumes the PQ has room
>>>>> for the
>>>>> object) and insert (checks whether the object can be inserted
>>>>> etc.). The
>>>>> implementation of insert() requires the application that uses  
>>>>> it to
>>>> allocate
>>>>> a new object every time it calls insert. Specifically, it cannot
>>>>> reuse
>>>> the
>>>>> objects that were removed from the PQ.
>>>>
>>>> I've read this entire thread, and would like to add my comments  
>>>> about
>>>> three
>>>> independent issues, which I think that can and perhaps should be
>>>> considered
>>>> separately:
>>>>
>>>> 1. When Shai wanted to add the insertWithOverflow() method to
>>>> PriorityQueue
>>>>   he couldn't just subclass PriorityQueue in his application, but
>>>> rather
>>>>   was forced to modify PriorityQueue itself. Why? just because one
>>>> field
>>>>   of that classi - "heap" - was defined "private" instead of
>>>> "protected".
>>>>
>>>>   Is there a special reason for that? If not, can we make the  
>>>> trivial
>>>> change
>>>>   to make PriorityQueue's fields protected, to allow Shai and
>>>> others (see
>>>> the
>>>>   next point) to add functionality on top of PriorityQueue?
>>>>
>>>> 2. PriorityQueue, in addition to being used in about a dozen
>>>> places inside
>>>>   Lucene, is also a public class that advanced users often find
>>>> useful
>>>> when
>>>>   implementing features like new collectors, new queries, and so  
>>>> on.
>>>>   Unfortunately, my experience exactly matches Shai's: In the two
>>>> occasions
>>>>   where I used a PriorityQueue, I found that I needed such a
>>>>   insertWithOverflow() method. If this feature is so useful (I  
>>>> can't
>>>> believe
>>>>   that Shai and me are the only ones who found it useful), I  
>>>> think it
>>>> would
>>>>   be nice to add it to Lucene's PriorityQueue, even if it isn't
>>>> (yet) used
>>>>   inside Lucene.
>>>>
>>>>   Just to make it sound more interesting, let me give you an
>>>> example where
>>>>   I needed (and implemented) an "insertWithOverflow()": I was
>>>> implementing
>>>> a
>>>>   faceted search capability over Lucene. It calculated a count for
>>>> each
>>>>   facet value, and then I used a PriorityQueue to find the 10 best
>>>> values.
>>>>   The problem is that I also needed an "other" aggregator, which  
>>>> was
>>>> supposed
>>>>   to aggregate (in various ways) all the facets except the 10 best
>>>> ones.
>>>> For
>>>>   that, I needed to know which facets dropped off the  
>>>> priorityqueue.
>>>>
>>>> 3. Finally, Shai asked for this new
>>>> PriorityQueue.insertWithOverflow()
>>>>   to be used in TopDocCollector. I have to admit I don't know how
>>>> much
>>>>   of a benefit this will be in the "typical" case. But I do know  
>>>> that
>>>>   there's no such thing as a "typical" case...
>>>>   I can easily think of a quite typical "worst case" though:
>>>> Consider a
>>>>   collection indexed in order of document age (a pretty typical
>>>> scenario
>>>>   for a long-running index), and then you do a sorting query
>>>>   (TopFieldDocCollector), asking it to bring the 10 newest  
>>>> documents.
>>>>   In that case, each and every document will have a new DocScore
>>>> created -
>>>>   which is the worst-case that Shai feared.
>>>>   It would be nice if Shai or someone else could provide a
>>>> measurement in
>>>>   that case.
>>>>
>>>> P.S. When looking now at PriorityQueue's code, I found two tiny
>>>> performance improvements that could be easily made to it - I
>>>> wonder if
>>>> there's any reason not to do them:
>>>>
>>>>  1. Insert can use heap[1] directly instead of calling top().
>>>> After all,
>>>>    this is done in an if() that already ensures that size>0.
>>>>
>>>>  2. Regardless, top() could return heap[1] always, without any if
>>>> (). After
>>>>    all, the heap array is initialized to all nulls, so when  
>>>> size==0,
>>>> heap[1]
>>>>    is null anyway.
>>>>
>>>>
>>>>
>>>>> PriorityQueue change (added insertWithOverflow method)
>>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> --------------
>>>>>     /**
>>>>>      * insertWithOverflow() is similar to insert(), except its
>>>>> return
>>>> value:
>>>>> it
>>>>>      * returns the object (if any) that was dropped off the heap
>>>>> because
>>>> it
>>>>> was
>>>>>      * full. This can be the given parameter (in case it is
>>>>> smaller than
>>>> the
>>>>>      * full heap's minimum, and couldn't be added) or another  
>>>>> object
>>>> that
>>>>> was
>>>>>      * previously the smallest value in the heap and now has been
>>>> replaced
>>>>> by a
>>>>>      * larger one.
>>>>>      */
>>>>>     public Object insertWithOverflow(Object element) {
>>>>>         if (size < maxSize) {
>>>>>             put(element);
>>>>>             return null;
>>>>>         } else if (size > 0 && !lessThan(element, top())) {
>>>>>             Object ret = heap[1];
>>>>>             heap[1] = element;
>>>>>             adjustTop();
>>>>>             return ret;
>>>>>         } else {
>>>>>             return element;
>>>>>         }
>>>>>     }
>>>>> [Very similar to insert(), only it returns the object that was
>>>>> kicked
>>>> out of
>>>>> the Queue, or null]
>>>>
>>>> --
>>>> Nadav Har'El                        |       Tuesday, Dec 11 2007,
>>>> 3 Tevet
>>>> 5768
>>>> IBM Haifa Research Lab
>>>>  |-----------------------------------------
>>>>                                    |A professor is one who talks in
>>>> someone
>>>> http://nadav.harel.org.il           |else's sleep.
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Timo Nentwig-7
In reply to this post by Shai Erera
On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
> For (1) - I can't explain it but I've run into documents with 0.0f scores.
> For (2) - this is a simple logic - if the lowest score in the queue is 'x'
> and you want to top docs only, then there's no point in attempting to
> insert a document with score lower than 'x' (it will not be added).

Sure. I didn't notice that score is passed as parameter and was surprised that
subsequent calls to collect() are supposed to be guaranteed to have a lower
score.

Ok, stupid question :)

> Maybe I didn't understand your question correctly though ...
>
> On Dec 11, 2007 2:25 PM, Timo Nentwig <[hidden email]> wrote:
> > On Monday 10 December 2007 09:15:12 Paul Elschot wrote:
> > > The current TopDocCollector only allocates a ScoreDoc when the given
> > > score causes a new ScoreDoc to be added into the queue, but it does
> >
> > I actually wrote my own HitCollector and now wonder about
> > TopDocCollector:
> >
> >  public void collect(int doc, float score) {
> >    if (score > 0.0f) {
> >      totalHits++;
> >      if (hq.size() < numHits || score >= minScore) {
> >        hq.insert(new ScoreDoc(doc, score));
> >        minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> >      }
> >    }
> >  }
> >
> > 1) How can there be hits with score=0.0?
> > 2) I don't understand minScore: inserts only document having a higher
> > score
> > than the lowest score already in queue?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Yonik Seeley-2
On Dec 11, 2007 1:21 PM, Timo Nentwig <[hidden email]> wrote:
> On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
> > For (1) - I can't explain it but I've run into documents with 0.0f scores.
> > For (2) - this is a simple logic - if the lowest score in the queue is 'x'
> > and you want to top docs only, then there's no point in attempting to
> > insert a document with score lower than 'x' (it will not be added).
>
> Sure. I didn't notice that score is passed as parameter and was surprised that
> subsequent calls to collect() are supposed to be guaranteed to have a lower
> score.

One is not guaranteed this... collect() generally goes in docid order,
and scores are unordered.

If you are only gathering the top 10 docs by score, you can compare
the current score to the lowest of the top 10 you currently have to
determine if you should bother inserting into the queue.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
Hi

I created https://issues.apache.org/jira/browse/LUCENE-1089 and added a
patch.
I noticed that we can replace the calls to insert() with
insertWithOverflow() in several other places, like QualityQueriesFinder,
FuzzyQuery and TopFieldDocCollector. I wasn't sure if that should be handled
as part of this issue, or a different one.

On Dec 11, 2007 8:32 PM, Yonik Seeley <[hidden email]> wrote:

> On Dec 11, 2007 1:21 PM, Timo Nentwig <[hidden email]> wrote:
> > On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
> > > For (1) - I can't explain it but I've run into documents with 0.0fscores.
> > > For (2) - this is a simple logic - if the lowest score in the queue is
> 'x'
> > > and you want to top docs only, then there's no point in attempting to
> > > insert a document with score lower than 'x' (it will not be added).
> >
> > Sure. I didn't notice that score is passed as parameter and was
> surprised that
> > subsequent calls to collect() are supposed to be guaranteed to have a
> lower
> > score.
>
> One is not guaranteed this... collect() generally goes in docid order,
> and scores are unordered.
>
> If you are only gathering the top 10 docs by score, you can compare
> the current score to the lowest of the top 10 you currently have to
> determine if you should bother inserting into the queue.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2

I think it's fine to include those changes with this issue.

Mike

Shai Erera wrote:

> Hi
>
> I created https://issues.apache.org/jira/browse/LUCENE-1089 and  
> added a
> patch.
> I noticed that we can replace the calls to insert() with
> insertWithOverflow() in several other places, like  
> QualityQueriesFinder,
> FuzzyQuery and TopFieldDocCollector. I wasn't sure if that should  
> be handled
> as part of this issue, or a different one.
>
> On Dec 11, 2007 8:32 PM, Yonik Seeley <[hidden email]> wrote:
>
>> On Dec 11, 2007 1:21 PM, Timo Nentwig <[hidden email]> wrote:
>>> On Tuesday 11 December 2007 14:32:12 Shai Erera wrote:
>>>> For (1) - I can't explain it but I've run into documents with  
>>>> 0.0fscores.
>>>> For (2) - this is a simple logic - if the lowest score in the  
>>>> queue is
>> 'x'
>>>> and you want to top docs only, then there's no point in  
>>>> attempting to
>>>> insert a document with score lower than 'x' (it will not be added).
>>>
>>> Sure. I didn't notice that score is passed as parameter and was
>> surprised that
>>> subsequent calls to collect() are supposed to be guaranteed to  
>>> have a
>> lower
>>> score.
>>
>> One is not guaranteed this... collect() generally goes in docid  
>> order,
>> and scores are unordered.
>>
>> If you are only gathering the top 10 docs by score, you can compare
>> the current score to the lowest of the top 10 you currently have to
>> determine if you should bother inserting into the queue.
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123