Performance Improvement for Search using PriorityQueue

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Performance Improvement for Search using PriorityQueue

Shai Erera
Hi

Lucene's PQ implements two methods: put (assumes the PQ has room for the
object) and insert (checks whether the object can be inserted etc.). The
implementation of insert() requires the application that uses it to allocate
a new object every time it calls insert. Specifically, it cannot reuse the
objects that were removed from the PQ.
I've done some measurements on the performance that search would gain by
using that method, through the change of TopDocCollector.

PriorityQueue change (added insertWithOverflow method)
-----------------------------------------------------------------------------------
    /**
     * insertWithOverflow() is similar to insert(), except its return value:
it
     * returns the object (if any) that was dropped off the heap because it
was
     * full. This can be the given parameter (in case it is smaller than the
     * full heap's minimum, and couldn't be added) or another object that
was
     * previously the smallest value in the heap and now has been replaced
by a
     * larger one.
     */
    public Object insertWithOverflow(Object element) {
        if (size < maxSize) {
            put(element);
            return null;
        } else if (size > 0 && !lessThan(element, top())) {
            Object ret = heap[1];
            heap[1] = element;
            adjustTop();
            return ret;
        } else {
            return element;
        }
    }
[Very similar to insert(), only it returns the object that was kicked out of
the Queue, or null]

TopDocCollector's current implementation of collect()
----------------------------------------------------------------------------
  public void collect(int doc, float score) {
    if (score > 0.0f) {
      totalHits++;
      if (hq.size() < numHits || score >= minScore) {
        hq.insert(new ScoreDoc(doc, score));
        minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
      }
    }
  }
[See how it allocates a new ScoreDoc every time this method is called]

TopDocCollector's new implementation of collect()
-----------------------------------------------------------------------------
  public void collect(int doc, float score) {
      if (score == 0.0f) {
          return;
      }
      totalHits++;
      if (hq.size() < numHits || score >= minScore) {
          if (sd == null) {
              sd = new ScoreDoc(doc, score);
          } else {
              sd.doc = doc;
              sd.score = score;
          }
          sd = (ScoreDoc) hq.insertWithOverflow(sd);
          minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
      }
  }
[sd is a class memeber of the collector, of type ScoreDoc]

Now for the performance tests. I've done two tests:
1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and 100,000,000
times using both implementations. Here are the results:
                              1M         10M          100M
Current Collector      218 ms   1,907ms    20,000ms
Modified Collector    141 ms    1,297ms   12,672ms
As you can see, the new implementation 30-40% faster.

2. Wrote some tasks in the benchmark package that makes use of the new PQ
while executing a real search over an index with 1M documents, of small size
(10 characters). Here are the results:
                          Current TopDocCollector
-----------------------------------------------------------------------------------------------------------------------------------------------
------------> Report Sum By (any) Name (5 about 6 out of 6)
Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
avgUsedMem    avgTotalMem
CreateIndex         0        1            1         10.6        0.09
2,219,112      4,389,376
AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -  19.20 -
34,497,176 -   52,689,408
CloseIndex          0        2            1          0.1       16.62
26,178,972     51,378,688
OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
48,771,304 -   50,067,968
--> MySearch            0        1            1          5.3        0.19
48,771,304     50,067,968


                          Modified TopDocCollector
-----------------------------------------------------------------------------------------------------------------------------------------------
------------> Report Sum By (any) Name (5 about 6 out of 6)
Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
avgUsedMem    avgTotalMem
CreateIndex         0        1            1         10.6        0.09
2,207,008      4,389,376
AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -  19.62 -
32,531,992 -   44,825,088
CloseIndex          0        2            1          0.1       16.84
57,853,148     61,929,984
OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
57,792,136 -   61,929,984
--> MySearch            0        1            1          7.1        0.14
57,939,856     61,929,984
Notice the time difference in search (0.14 modified vs. 0.19 current). Here
too, a 30% improvement.

One thing that I wasn't able to show, but I think it's pretty much clear -->
the new implementation causes a lot less object allocations. Consider the
typical search for 10 results over an index with 1M documents. The current
implementation will allocate 1M (!) ScoreDoc instances, while the new one
will allocate 11 (10 for the PQ and 1 for reusing). On heavily loaded
systems, this will result in far less work for the GC.

I would like to suggest to reflect this new implementation in PriorityQueue
and also modify TopDocCollector to make use of this new method. Several ways
to do it:
1. Add insertWithOverflow to PQ (I hate the name and I'm willing to accept
better ones), make insert() deprecated (let's say in 2.3) and release after
that (2.4?) rename insertWithOverflow to insert(). A complex process, but it
won't break existing applications' code.
2. Change insert's signature and state that applications that move to
2.3(for example), need to change their code as well (actually it won't
compile).
3. Any other alternatives?

And .. of course, review the classes in the Lucene code that use PQ and
modify them as necessary.

A very long email for a very simple (but important) performance improvement.

Shai
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Paul Elschot
The current TopDocCollector only allocates a ScoreDoc when the given
score causes a new ScoreDoc to be added into the queue, but it does
not reuse anything that overflows out of the queue.
So, reusing the overflow out of the queue should reduce object
allocations. especially for indexes that tend to have better scoring
docs at the end. I wouldn't expect a 30% improvement out of that,
but it would help, if only to reduce occasional performance
deteriorations.

Regards,
Paul Elschot


On Monday 10 December 2007 08:11:50 Shai Erera wrote:

> Hi
>
> Lucene's PQ implements two methods: put (assumes the PQ has room for the
> object) and insert (checks whether the object can be inserted etc.). The
> implementation of insert() requires the application that uses it to allocate
> a new object every time it calls insert. Specifically, it cannot reuse the
> objects that were removed from the PQ.
> I've done some measurements on the performance that search would gain by
> using that method, through the change of TopDocCollector.
>
> PriorityQueue change (added insertWithOverflow method)
> -----------------------------------------------------------------------------------
>     /**
>      * insertWithOverflow() is similar to insert(), except its return value:
> it
>      * returns the object (if any) that was dropped off the heap because it
> was
>      * full. This can be the given parameter (in case it is smaller than the
>      * full heap's minimum, and couldn't be added) or another object that
> was
>      * previously the smallest value in the heap and now has been replaced
> by a
>      * larger one.
>      */
>     public Object insertWithOverflow(Object element) {
>         if (size < maxSize) {
>             put(element);
>             return null;
>         } else if (size > 0 && !lessThan(element, top())) {
>             Object ret = heap[1];
>             heap[1] = element;
>             adjustTop();
>             return ret;
>         } else {
>             return element;
>         }
>     }
> [Very similar to insert(), only it returns the object that was kicked out of
> the Queue, or null]
>
> TopDocCollector's current implementation of collect()
> ----------------------------------------------------------------------------
>   public void collect(int doc, float score) {
>     if (score > 0.0f) {
>       totalHits++;
>       if (hq.size() < numHits || score >= minScore) {
>         hq.insert(new ScoreDoc(doc, score));
>         minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
>       }
>     }
>   }
> [See how it allocates a new ScoreDoc every time this method is called]
>
> TopDocCollector's new implementation of collect()
> -----------------------------------------------------------------------------
>   public void collect(int doc, float score) {
>       if (score == 0.0f) {
>           return;
>       }
>       totalHits++;
>       if (hq.size() < numHits || score >= minScore) {
>           if (sd == null) {
>               sd = new ScoreDoc(doc, score);
>           } else {
>               sd.doc = doc;
>               sd.score = score;
>           }
>           sd = (ScoreDoc) hq.insertWithOverflow(sd);
>           minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
>       }
>   }
> [sd is a class memeber of the collector, of type ScoreDoc]
>
> Now for the performance tests. I've done two tests:
> 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and 100,000,000
> times using both implementations. Here are the results:
>                               1M         10M          100M
> Current Collector      218 ms   1,907ms    20,000ms
> Modified Collector    141 ms    1,297ms   12,672ms
> As you can see, the new implementation 30-40% faster.
>
> 2. Wrote some tasks in the benchmark package that makes use of the new PQ
> while executing a real search over an index with 1M documents, of small size
> (10 characters). Here are the results:
>                           Current TopDocCollector
> -----------------------------------------------------------------------------------------------------------------------------------------------
> ------------> Report Sum By (any) Name (5 about 6 out of 6)
> Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> avgUsedMem    avgTotalMem
> CreateIndex         0        1            1         10.6        0.09
> 2,219,112      4,389,376
> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -  19.20 -
> 34,497,176 -   52,689,408
> CloseIndex          0        2            1          0.1       16.62
> 26,178,972     51,378,688
> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
> 48,771,304 -   50,067,968
> --> MySearch            0        1            1          5.3        0.19
> 48,771,304     50,067,968
>
>
>                           Modified TopDocCollector
> -----------------------------------------------------------------------------------------------------------------------------------------------
> ------------> Report Sum By (any) Name (5 about 6 out of 6)
> Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> avgUsedMem    avgTotalMem
> CreateIndex         0        1            1         10.6        0.09
> 2,207,008      4,389,376
> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -  19.62 -
> 32,531,992 -   44,825,088
> CloseIndex          0        2            1          0.1       16.84
> 57,853,148     61,929,984
> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
> 57,792,136 -   61,929,984
> --> MySearch            0        1            1          7.1        0.14
> 57,939,856     61,929,984
> Notice the time difference in search (0.14 modified vs. 0.19 current). Here
> too, a 30% improvement.
>
> One thing that I wasn't able to show, but I think it's pretty much clear -->
> the new implementation causes a lot less object allocations. Consider the
> typical search for 10 results over an index with 1M documents. The current
> implementation will allocate 1M (!) ScoreDoc instances, while the new one
> will allocate 11 (10 for the PQ and 1 for reusing). On heavily loaded
> systems, this will result in far less work for the GC.
>
> I would like to suggest to reflect this new implementation in PriorityQueue
> and also modify TopDocCollector to make use of this new method. Several ways
> to do it:
> 1. Add insertWithOverflow to PQ (I hate the name and I'm willing to accept
> better ones), make insert() deprecated (let's say in 2.3) and release after
> that (2.4?) rename insertWithOverflow to insert(). A complex process, but it
> won't break existing applications' code.
> 2. Change insert's signature and state that applications that move to
> 2.3(for example), need to change their code as well (actually it won't
> compile).
> 3. Any other alternatives?
>
> And .. of course, review the classes in the Lucene code that use PQ and
> modify them as necessary.
>
> A very long email for a very simple (but important) performance improvement.
>
> Shai
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
I agree w.r.t the current implementation, however in the worse case (as we
tend to consider when talking about computer algorithms), it will allocate a
ScoreDoc per result. With the overflow reuse, it will not allocate those
objects, no matter what's the input it gets.
Also, notice that there is a performance hit with the current implementation
of TopDocCollector: it first checks that the document should be inserted and
if so, PQ does the same check again.
So, if you change the current implementation to always attempt an insert,
you'd gain some performance improvements as well.

On Dec 10, 2007 10:15 AM, Paul Elschot <[hidden email]> wrote:

> The current TopDocCollector only allocates a ScoreDoc when the given
> score causes a new ScoreDoc to be added into the queue, but it does
> not reuse anything that overflows out of the queue.
> So, reusing the overflow out of the queue should reduce object
> allocations. especially for indexes that tend to have better scoring
> docs at the end. I wouldn't expect a 30% improvement out of that,
> but it would help, if only to reduce occasional performance
> deteriorations.
>
> Regards,
> Paul Elschot
>
>
> On Monday 10 December 2007 08:11:50 Shai Erera wrote:
> > Hi
> >
> > Lucene's PQ implements two methods: put (assumes the PQ has room for the
> > object) and insert (checks whether the object can be inserted etc.). The
> > implementation of insert() requires the application that uses it to
> allocate
> > a new object every time it calls insert. Specifically, it cannot reuse
> the
> > objects that were removed from the PQ.
> > I've done some measurements on the performance that search would gain by
> > using that method, through the change of TopDocCollector.
> >
> > PriorityQueue change (added insertWithOverflow method)
> >
> -----------------------------------------------------------------------------------
> >     /**
> >      * insertWithOverflow() is similar to insert(), except its return
> value:
> > it
> >      * returns the object (if any) that was dropped off the heap because
> it
> > was
> >      * full. This can be the given parameter (in case it is smaller than
> the
> >      * full heap's minimum, and couldn't be added) or another object
> that
> > was
> >      * previously the smallest value in the heap and now has been
> replaced
> > by a
> >      * larger one.
> >      */
> >     public Object insertWithOverflow(Object element) {
> >         if (size < maxSize) {
> >             put(element);
> >             return null;
> >         } else if (size > 0 && !lessThan(element, top())) {
> >             Object ret = heap[1];
> >             heap[1] = element;
> >             adjustTop();
> >             return ret;
> >         } else {
> >             return element;
> >         }
> >     }
> > [Very similar to insert(), only it returns the object that was kicked
> out of
> > the Queue, or null]
> >
> > TopDocCollector's current implementation of collect()
> >
> ----------------------------------------------------------------------------
> >   public void collect(int doc, float score) {
> >     if (score > 0.0f) {
> >       totalHits++;
> >       if (hq.size() < numHits || score >= minScore) {
> >         hq.insert(new ScoreDoc(doc, score));
> >         minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> >       }
> >     }
> >   }
> > [See how it allocates a new ScoreDoc every time this method is called]
> >
> > TopDocCollector's new implementation of collect()
> >
> -----------------------------------------------------------------------------
> >   public void collect(int doc, float score) {
> >       if (score == 0.0f) {
> >           return;
> >       }
> >       totalHits++;
> >       if (hq.size() < numHits || score >= minScore) {
> >           if (sd == null) {
> >               sd = new ScoreDoc(doc, score);
> >           } else {
> >               sd.doc = doc;
> >               sd.score = score;
> >           }
> >           sd = (ScoreDoc) hq.insertWithOverflow(sd);
> >           minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> >       }
> >   }
> > [sd is a class memeber of the collector, of type ScoreDoc]
> >
> > Now for the performance tests. I've done two tests:
> > 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and 100,000,000
> > times using both implementations. Here are the results:
> >                               1M         10M          100M
> > Current Collector      218 ms   1,907ms    20,000ms
> > Modified Collector    141 ms    1,297ms   12,672ms
> > As you can see, the new implementation 30-40% faster.
> >
> > 2. Wrote some tasks in the benchmark package that makes use of the new
> PQ
> > while executing a real search over an index with 1M documents, of small
> size
> > (10 characters). Here are the results:
> >                           Current TopDocCollector
> >
> -----------------------------------------------------------------------------------------------------------------------------------------------
> > ------------> Report Sum By (any) Name (5 about 6 out of 6)
> > Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> > avgUsedMem    avgTotalMem
> > CreateIndex         0        1            1         10.6        0.09
> > 2,219,112      4,389,376
> > AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -  19.20 -
> > 34,497,176 -   52,689,408
> > CloseIndex          0        2            1          0.1       16.62
> > 26,178,972     51,378,688
> > OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
> > 48,771,304 -   50,067,968
> > --> MySearch            0        1            1          5.3        0.19
> > 48,771,304     50,067,968
> >
> >
> >                           Modified TopDocCollector
> >
> -----------------------------------------------------------------------------------------------------------------------------------------------
> > ------------> Report Sum By (any) Name (5 about 6 out of 6)
> > Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> > avgUsedMem    avgTotalMem
> > CreateIndex         0        1            1         10.6        0.09
> > 2,207,008      4,389,376
> > AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -  19.62 -
> > 32,531,992 -   44,825,088
> > CloseIndex          0        2            1          0.1       16.84
> > 57,853,148     61,929,984
> > OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
> > 57,792,136 -   61,929,984
> > --> MySearch            0        1            1          7.1        0.14
> > 57,939,856     61,929,984
> > Notice the time difference in search (0.14 modified vs. 0.19 current).
> Here
> > too, a 30% improvement.
> >
> > One thing that I wasn't able to show, but I think it's pretty much clear
> -->
> > the new implementation causes a lot less object allocations. Consider
> the
> > typical search for 10 results over an index with 1M documents. The
> current
> > implementation will allocate 1M (!) ScoreDoc instances, while the new
> one
> > will allocate 11 (10 for the PQ and 1 for reusing). On heavily loaded
> > systems, this will result in far less work for the GC.
> >
> > I would like to suggest to reflect this new implementation in
> PriorityQueue
> > and also modify TopDocCollector to make use of this new method. Several
> ways
> > to do it:
> > 1. Add insertWithOverflow to PQ (I hate the name and I'm willing to
> accept
> > better ones), make insert() deprecated (let's say in 2.3) and release
> after
> > that (2.4?) rename insertWithOverflow to insert(). A complex process,
> but it
> > won't break existing applications' code.
> > 2. Change insert's signature and state that applications that move to
> > 2.3(for example), need to change their code as well (actually it won't
> > compile).
> > 3. Any other alternatives?
> >
> > And .. of course, review the classes in the Lucene code that use PQ and
> > modify them as necessary.
> >
> > A very long email for a very simple (but important) performance
> improvement.
> >
> > Shai
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2

Have you done any tests on real queries to see what impact this  
improvement has in practice?  Or, to measure how many ScoreDocs are  
"typically" allocated?

Mike

Shai Erera wrote:

> I agree w.r.t the current implementation, however in the worse case  
> (as we
> tend to consider when talking about computer algorithms), it will  
> allocate a
> ScoreDoc per result. With the overflow reuse, it will not allocate  
> those
> objects, no matter what's the input it gets.
> Also, notice that there is a performance hit with the current  
> implementation
> of TopDocCollector: it first checks that the document should be  
> inserted and
> if so, PQ does the same check again.
> So, if you change the current implementation to always attempt an  
> insert,
> you'd gain some performance improvements as well.
>
> On Dec 10, 2007 10:15 AM, Paul Elschot <[hidden email]> wrote:
>
>> The current TopDocCollector only allocates a ScoreDoc when the given
>> score causes a new ScoreDoc to be added into the queue, but it does
>> not reuse anything that overflows out of the queue.
>> So, reusing the overflow out of the queue should reduce object
>> allocations. especially for indexes that tend to have better scoring
>> docs at the end. I wouldn't expect a 30% improvement out of that,
>> but it would help, if only to reduce occasional performance
>> deteriorations.
>>
>> Regards,
>> Paul Elschot
>>
>>
>> On Monday 10 December 2007 08:11:50 Shai Erera wrote:
>>> Hi
>>>
>>> Lucene's PQ implements two methods: put (assumes the PQ has room  
>>> for the
>>> object) and insert (checks whether the object can be inserted  
>>> etc.). The
>>> implementation of insert() requires the application that uses it to
>> allocate
>>> a new object every time it calls insert. Specifically, it cannot  
>>> reuse
>> the
>>> objects that were removed from the PQ.
>>> I've done some measurements on the performance that search would  
>>> gain by
>>> using that method, through the change of TopDocCollector.
>>>
>>> PriorityQueue change (added insertWithOverflow method)
>>>
>> ---------------------------------------------------------------------
>> --------------
>>>     /**
>>>      * insertWithOverflow() is similar to insert(), except its  
>>> return
>> value:
>>> it
>>>      * returns the object (if any) that was dropped off the heap  
>>> because
>> it
>>> was
>>>      * full. This can be the given parameter (in case it is  
>>> smaller than
>> the
>>>      * full heap's minimum, and couldn't be added) or another object
>> that
>>> was
>>>      * previously the smallest value in the heap and now has been
>> replaced
>>> by a
>>>      * larger one.
>>>      */
>>>     public Object insertWithOverflow(Object element) {
>>>         if (size < maxSize) {
>>>             put(element);
>>>             return null;
>>>         } else if (size > 0 && !lessThan(element, top())) {
>>>             Object ret = heap[1];
>>>             heap[1] = element;
>>>             adjustTop();
>>>             return ret;
>>>         } else {
>>>             return element;
>>>         }
>>>     }
>>> [Very similar to insert(), only it returns the object that was  
>>> kicked
>> out of
>>> the Queue, or null]
>>>
>>> TopDocCollector's current implementation of collect()
>>>
>> ---------------------------------------------------------------------
>> -------
>>>   public void collect(int doc, float score) {
>>>     if (score > 0.0f) {
>>>       totalHits++;
>>>       if (hq.size() < numHits || score >= minScore) {
>>>         hq.insert(new ScoreDoc(doc, score));
>>>         minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
>>>       }
>>>     }
>>>   }
>>> [See how it allocates a new ScoreDoc every time this method is  
>>> called]
>>>
>>> TopDocCollector's new implementation of collect()
>>>
>> ---------------------------------------------------------------------
>> --------
>>>   public void collect(int doc, float score) {
>>>       if (score == 0.0f) {
>>>           return;
>>>       }
>>>       totalHits++;
>>>       if (hq.size() < numHits || score >= minScore) {
>>>           if (sd == null) {
>>>               sd = new ScoreDoc(doc, score);
>>>           } else {
>>>               sd.doc = doc;
>>>               sd.score = score;
>>>           }
>>>           sd = (ScoreDoc) hq.insertWithOverflow(sd);
>>>           minScore = ((ScoreDoc)hq.top()).score; // maintain  
>>> minScore
>>>       }
>>>   }
>>> [sd is a class memeber of the collector, of type ScoreDoc]
>>>
>>> Now for the performance tests. I've done two tests:
>>> 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and  
>>> 100,000,000
>>> times using both implementations. Here are the results:
>>>                               1M         10M          100M
>>> Current Collector      218 ms   1,907ms    20,000ms
>>> Modified Collector    141 ms    1,297ms   12,672ms
>>> As you can see, the new implementation 30-40% faster.
>>>
>>> 2. Wrote some tasks in the benchmark package that makes use of  
>>> the new
>> PQ
>>> while executing a real search over an index with 1M documents, of  
>>> small
>> size
>>> (10 characters). Here are the results:
>>>                           Current TopDocCollector
>>>
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> -----
>>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
>>> Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
>>> avgUsedMem    avgTotalMem
>>> CreateIndex         0        1            1         10.6        0.09
>>> 2,219,112      4,389,376
>>> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -  
>>> 19.20 -
>>> 34,497,176 -   52,689,408
>>> CloseIndex          0        2            1          0.1       16.62
>>> 26,178,972     51,378,688
>>> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -    
>>> 0.00 -
>>> 48,771,304 -   50,067,968
>>> --> MySearch            0        1            1          
>>> 5.3        0.19
>>> 48,771,304     50,067,968
>>>
>>>
>>>                           Modified TopDocCollector
>>>
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> -----
>>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
>>> Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
>>> avgUsedMem    avgTotalMem
>>> CreateIndex         0        1            1         10.6        0.09
>>> 2,207,008      4,389,376
>>> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -  
>>> 19.62 -
>>> 32,531,992 -   44,825,088
>>> CloseIndex          0        2            1          0.1       16.84
>>> 57,853,148     61,929,984
>>> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -    
>>> 0.00 -
>>> 57,792,136 -   61,929,984
>>> --> MySearch            0        1            1          
>>> 7.1        0.14
>>> 57,939,856     61,929,984
>>> Notice the time difference in search (0.14 modified vs. 0.19  
>>> current).
>> Here
>>> too, a 30% improvement.
>>>
>>> One thing that I wasn't able to show, but I think it's pretty  
>>> much clear
>> -->
>>> the new implementation causes a lot less object allocations.  
>>> Consider
>> the
>>> typical search for 10 results over an index with 1M documents. The
>> current
>>> implementation will allocate 1M (!) ScoreDoc instances, while the  
>>> new
>> one
>>> will allocate 11 (10 for the PQ and 1 for reusing). On heavily  
>>> loaded
>>> systems, this will result in far less work for the GC.
>>>
>>> I would like to suggest to reflect this new implementation in
>> PriorityQueue
>>> and also modify TopDocCollector to make use of this new method.  
>>> Several
>> ways
>>> to do it:
>>> 1. Add insertWithOverflow to PQ (I hate the name and I'm willing to
>> accept
>>> better ones), make insert() deprecated (let's say in 2.3) and  
>>> release
>> after
>>> that (2.4?) rename insertWithOverflow to insert(). A complex  
>>> process,
>> but it
>>> won't break existing applications' code.
>>> 2. Change insert's signature and state that applications that  
>>> move to
>>> 2.3(for example), need to change their code as well (actually it  
>>> won't
>>> compile).
>>> 3. Any other alternatives?
>>>
>>> And .. of course, review the classes in the Lucene code that use  
>>> PQ and
>>> modify them as necessary.
>>>
>>> A very long email for a very simple (but important) performance
>> improvement.
>>>
>>> Shai
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
No - I didn't try to populate an index with real data and run real queries
(what is "real" after all?). I know from my experience of indexes with
several millions of documents where there are queries with several hundred
thousands results (one query even hit 2.5 M documents). This is typical in
search: users type on average 2.3 terms in a query. The chances you'd hit a
query with huge result set are not that small in such cases (I'm not saying
this is the most common case though, I agree that most of the searches don't
process that many documents).
However, this change will improve performance from the algorithm point of
view - you allocate as many as numRequestedHits+1 no matter how many
documents your query processes.

On Dec 10, 2007 10:59 AM, Michael McCandless <[hidden email]>
wrote:

>
> Have you done any tests on real queries to see what impact this
> improvement has in practice?  Or, to measure how many ScoreDocs are
> "typically" allocated?
>
> Mike
>
> Shai Erera wrote:
>
> > I agree w.r.t the current implementation, however in the worse case
> > (as we
> > tend to consider when talking about computer algorithms), it will
> > allocate a
> > ScoreDoc per result. With the overflow reuse, it will not allocate
> > those
> > objects, no matter what's the input it gets.
> > Also, notice that there is a performance hit with the current
> > implementation
> > of TopDocCollector: it first checks that the document should be
> > inserted and
> > if so, PQ does the same check again.
> > So, if you change the current implementation to always attempt an
> > insert,
> > you'd gain some performance improvements as well.
> >
> > On Dec 10, 2007 10:15 AM, Paul Elschot <[hidden email]> wrote:
> >
> >> The current TopDocCollector only allocates a ScoreDoc when the given
> >> score causes a new ScoreDoc to be added into the queue, but it does
> >> not reuse anything that overflows out of the queue.
> >> So, reusing the overflow out of the queue should reduce object
> >> allocations. especially for indexes that tend to have better scoring
> >> docs at the end. I wouldn't expect a 30% improvement out of that,
> >> but it would help, if only to reduce occasional performance
> >> deteriorations.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >>
> >> On Monday 10 December 2007 08:11:50 Shai Erera wrote:
> >>> Hi
> >>>
> >>> Lucene's PQ implements two methods: put (assumes the PQ has room
> >>> for the
> >>> object) and insert (checks whether the object can be inserted
> >>> etc.). The
> >>> implementation of insert() requires the application that uses it to
> >> allocate
> >>> a new object every time it calls insert. Specifically, it cannot
> >>> reuse
> >> the
> >>> objects that were removed from the PQ.
> >>> I've done some measurements on the performance that search would
> >>> gain by
> >>> using that method, through the change of TopDocCollector.
> >>>
> >>> PriorityQueue change (added insertWithOverflow method)
> >>>
> >> ---------------------------------------------------------------------
> >> --------------
> >>>     /**
> >>>      * insertWithOverflow() is similar to insert(), except its
> >>> return
> >> value:
> >>> it
> >>>      * returns the object (if any) that was dropped off the heap
> >>> because
> >> it
> >>> was
> >>>      * full. This can be the given parameter (in case it is
> >>> smaller than
> >> the
> >>>      * full heap's minimum, and couldn't be added) or another object
> >> that
> >>> was
> >>>      * previously the smallest value in the heap and now has been
> >> replaced
> >>> by a
> >>>      * larger one.
> >>>      */
> >>>     public Object insertWithOverflow(Object element) {
> >>>         if (size < maxSize) {
> >>>             put(element);
> >>>             return null;
> >>>         } else if (size > 0 && !lessThan(element, top())) {
> >>>             Object ret = heap[1];
> >>>             heap[1] = element;
> >>>             adjustTop();
> >>>             return ret;
> >>>         } else {
> >>>             return element;
> >>>         }
> >>>     }
> >>> [Very similar to insert(), only it returns the object that was
> >>> kicked
> >> out of
> >>> the Queue, or null]
> >>>
> >>> TopDocCollector's current implementation of collect()
> >>>
> >> ---------------------------------------------------------------------
> >> -------
> >>>   public void collect(int doc, float score) {
> >>>     if (score > 0.0f) {
> >>>       totalHits++;
> >>>       if (hq.size() < numHits || score >= minScore) {
> >>>         hq.insert(new ScoreDoc(doc, score));
> >>>         minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> >>>       }
> >>>     }
> >>>   }
> >>> [See how it allocates a new ScoreDoc every time this method is
> >>> called]
> >>>
> >>> TopDocCollector's new implementation of collect()
> >>>
> >> ---------------------------------------------------------------------
> >> --------
> >>>   public void collect(int doc, float score) {
> >>>       if (score == 0.0f) {
> >>>           return;
> >>>       }
> >>>       totalHits++;
> >>>       if (hq.size() < numHits || score >= minScore) {
> >>>           if (sd == null) {
> >>>               sd = new ScoreDoc(doc, score);
> >>>           } else {
> >>>               sd.doc = doc;
> >>>               sd.score = score;
> >>>           }
> >>>           sd = (ScoreDoc) hq.insertWithOverflow(sd);
> >>>           minScore = ((ScoreDoc)hq.top()).score; // maintain
> >>> minScore
> >>>       }
> >>>   }
> >>> [sd is a class memeber of the collector, of type ScoreDoc]
> >>>
> >>> Now for the performance tests. I've done two tests:
> >>> 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and
> >>> 100,000,000
> >>> times using both implementations. Here are the results:
> >>>                               1M         10M          100M
> >>> Current Collector      218 ms   1,907ms    20,000ms
> >>> Modified Collector    141 ms    1,297ms   12,672ms
> >>> As you can see, the new implementation 30-40% faster.
> >>>
> >>> 2. Wrote some tasks in the benchmark package that makes use of
> >>> the new
> >> PQ
> >>> while executing a real search over an index with 1M documents, of
> >>> small
> >> size
> >>> (10 characters). Here are the results:
> >>>                           Current TopDocCollector
> >>>
> >> ---------------------------------------------------------------------
> >> ---------------------------------------------------------------------
> >> -----
> >>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
> >>> Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> >>> avgUsedMem    avgTotalMem
> >>> CreateIndex         0        1            1         10.6        0.09
> >>> 2,219,112      4,389,376
> >>> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -
> >>> 19.20 -
> >>> 34,497,176 -   52,689,408
> >>> CloseIndex          0        2            1          0.1       16.62
> >>> 26,178,972     51,378,688
> >>> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -
> >>> 0.00 -
> >>> 48,771,304 -   50,067,968
> >>> --> MySearch            0        1            1
> >>> 5.3        0.19
> >>> 48,771,304     50,067,968
> >>>
> >>>
> >>>                           Modified TopDocCollector
> >>>
> >> ---------------------------------------------------------------------
> >> ---------------------------------------------------------------------
> >> -----
> >>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
> >>> Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> >>> avgUsedMem    avgTotalMem
> >>> CreateIndex         0        1            1         10.6        0.09
> >>> 2,207,008      4,389,376
> >>> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -
> >>> 19.62 -
> >>> 32,531,992 -   44,825,088
> >>> CloseIndex          0        2            1          0.1       16.84
> >>> 57,853,148     61,929,984
> >>> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -
> >>> 0.00 -
> >>> 57,792,136 -   61,929,984
> >>> --> MySearch            0        1            1
> >>> 7.1        0.14
> >>> 57,939,856     61,929,984
> >>> Notice the time difference in search (0.14 modified vs. 0.19
> >>> current).
> >> Here
> >>> too, a 30% improvement.
> >>>
> >>> One thing that I wasn't able to show, but I think it's pretty
> >>> much clear
> >> -->
> >>> the new implementation causes a lot less object allocations.
> >>> Consider
> >> the
> >>> typical search for 10 results over an index with 1M documents. The
> >> current
> >>> implementation will allocate 1M (!) ScoreDoc instances, while the
> >>> new
> >> one
> >>> will allocate 11 (10 for the PQ and 1 for reusing). On heavily
> >>> loaded
> >>> systems, this will result in far less work for the GC.
> >>>
> >>> I would like to suggest to reflect this new implementation in
> >> PriorityQueue
> >>> and also modify TopDocCollector to make use of this new method.
> >>> Several
> >> ways
> >>> to do it:
> >>> 1. Add insertWithOverflow to PQ (I hate the name and I'm willing to
> >> accept
> >>> better ones), make insert() deprecated (let's say in 2.3) and
> >>> release
> >> after
> >>> that (2.4?) rename insertWithOverflow to insert(). A complex
> >>> process,
> >> but it
> >>> won't break existing applications' code.
> >>> 2. Change insert's signature and state that applications that
> >>> move to
> >>> 2.3(for example), need to change their code as well (actually it
> >>> won't
> >>> compile).
> >>> 3. Any other alternatives?
> >>>
> >>> And .. of course, review the classes in the Lucene code that use
> >>> PQ and
> >>> modify them as necessary.
> >>>
> >>> A very long email for a very simple (but important) performance
> >> improvement.
> >>>
> >>> Shai
> >>>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2
Shai Erera wrote:

> No - I didn't try to populate an index with real data and run real  
> queries
> (what is "real" after all?). I know from my experience of indexes with
> several millions of documents where there are queries with several  
> hundred
> thousands results (one query even hit 2.5 M documents). This is  
> typical in
> search: users type on average 2.3 terms in a query. The chances  
> you'd hit a
> query with huge result set are not that small in such cases (I'm  
> not saying
> this is the most common case though, I agree that most of the  
> searches don't
> process that many documents).

Agreed: many queries do hit a great many results.  But I agree with  
Paul:
it's not clear how this "typically" translates into how many ScoreDocs
get created?

> However, this change will improve performance from the algorithm  
> point of
> view - you allocate as many as numRequestedHits+1 no matter how many
> documents your query processes.

It's definitely a good step forward: not creating extra garbage in hot
spots is worthwhile, so I think we should make this change.  Still I'm
wondering how much this helps in practice.

I think benchmarking on "real" use cases (vs synthetic tests) is
worthwhile: it keeps you focused on what really counts, in the end.

In this particular case there are at least 2 things it could show us:

   * How many ScoreDocs really get created, or, what %tg of hits
     actually result in an insertion into the PQ?

   * How much is this savings as a %tg of the overall time spent
     searching?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
Do you have a dataset and queries I can test on?

On Dec 10, 2007 1:16 PM, Michael McCandless <[hidden email]>
wrote:

> Shai Erera wrote:
>
> > No - I didn't try to populate an index with real data and run real
> > queries
> > (what is "real" after all?). I know from my experience of indexes with
> > several millions of documents where there are queries with several
> > hundred
> > thousands results (one query even hit 2.5 M documents). This is
> > typical in
> > search: users type on average 2.3 terms in a query. The chances
> > you'd hit a
> > query with huge result set are not that small in such cases (I'm
> > not saying
> > this is the most common case though, I agree that most of the
> > searches don't
> > process that many documents).
>
> Agreed: many queries do hit a great many results.  But I agree with
> Paul:
> it's not clear how this "typically" translates into how many ScoreDocs
> get created?
>
> > However, this change will improve performance from the algorithm
> > point of
> > view - you allocate as many as numRequestedHits+1 no matter how many
> > documents your query processes.
>
> It's definitely a good step forward: not creating extra garbage in hot
> spots is worthwhile, so I think we should make this change.  Still I'm
> wondering how much this helps in practice.
>
> I think benchmarking on "real" use cases (vs synthetic tests) is
> worthwhile: it keeps you focused on what really counts, in the end.
>
> In this particular case there are at least 2 things it could show us:
>
>   * How many ScoreDocs really get created, or, what %tg of hits
>     actually result in an insertion into the PQ?
>
>   * How much is this savings as a %tg of the overall time spent
>     searching?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2

I don't offhand.  Working on the indexing side is so much easier :)

You mentioned your experience with large indices & large result sets  
-- is that something you could draw on?

There have also been discussions lately about finding real search  
logs we could use for exactly this reason, though I don't think  
that's come to a good solution yet.

As a simple test you could break Wikipedia into smallish docs (~4K  
each = ~2.1 million docs), build the index, and make up a set of  
queries, or randomly pick terms for queries?  Obviously the queries  
aren't "real", but it's at least a step closer.... progress not  
perfection.

Or, if you have access to TREC...

Mike

Shai Erera wrote:

> Do you have a dataset and queries I can test on?
>
> On Dec 10, 2007 1:16 PM, Michael McCandless  
> <[hidden email]>
> wrote:
>
>> Shai Erera wrote:
>>
>>> No - I didn't try to populate an index with real data and run real
>>> queries
>>> (what is "real" after all?). I know from my experience of indexes  
>>> with
>>> several millions of documents where there are queries with several
>>> hundred
>>> thousands results (one query even hit 2.5 M documents). This is
>>> typical in
>>> search: users type on average 2.3 terms in a query. The chances
>>> you'd hit a
>>> query with huge result set are not that small in such cases (I'm
>>> not saying
>>> this is the most common case though, I agree that most of the
>>> searches don't
>>> process that many documents).
>>
>> Agreed: many queries do hit a great many results.  But I agree with
>> Paul:
>> it's not clear how this "typically" translates into how many  
>> ScoreDocs
>> get created?
>>
>>> However, this change will improve performance from the algorithm
>>> point of
>>> view - you allocate as many as numRequestedHits+1 no matter how many
>>> documents your query processes.
>>
>> It's definitely a good step forward: not creating extra garbage in  
>> hot
>> spots is worthwhile, so I think we should make this change.  Still  
>> I'm
>> wondering how much this helps in practice.
>>
>> I think benchmarking on "real" use cases (vs synthetic tests) is
>> worthwhile: it keeps you focused on what really counts, in the end.
>>
>> In this particular case there are at least 2 things it could show us:
>>
>>   * How many ScoreDocs really get created, or, what %tg of hits
>>     actually result in an insertion into the PQ?
>>
>>   * How much is this savings as a %tg of the overall time spent
>>     searching?
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
I have access to TREC. I can try this.
W.r.t the large indexes - I don't have access to the data, just scenarios
our customers ran into the past.
Does the benchmark package includes code to crawl Wikipedia? If not, do you
have such code? I don't want to write it from scratch for this specific
task.

On Dec 10, 2007 1:50 PM, Michael McCandless <[hidden email]>
wrote:

>
> I don't offhand.  Working on the indexing side is so much easier :)
>
> You mentioned your experience with large indices & large result sets
> -- is that something you could draw on?
>
> There have also been discussions lately about finding real search
> logs we could use for exactly this reason, though I don't think
> that's come to a good solution yet.
>
> As a simple test you could break Wikipedia into smallish docs (~4K
> each = ~2.1 million docs), build the index, and make up a set of
> queries, or randomly pick terms for queries?  Obviously the queries
> aren't "real", but it's at least a step closer.... progress not
> perfection.
>
> Or, if you have access to TREC...
>
> Mike
>
> Shai Erera wrote:
>
> > Do you have a dataset and queries I can test on?
> >
> > On Dec 10, 2007 1:16 PM, Michael McCandless
> > <[hidden email]>
> > wrote:
> >
> >> Shai Erera wrote:
> >>
> >>> No - I didn't try to populate an index with real data and run real
> >>> queries
> >>> (what is "real" after all?). I know from my experience of indexes
> >>> with
> >>> several millions of documents where there are queries with several
> >>> hundred
> >>> thousands results (one query even hit 2.5 M documents). This is
> >>> typical in
> >>> search: users type on average 2.3 terms in a query. The chances
> >>> you'd hit a
> >>> query with huge result set are not that small in such cases (I'm
> >>> not saying
> >>> this is the most common case though, I agree that most of the
> >>> searches don't
> >>> process that many documents).
> >>
> >> Agreed: many queries do hit a great many results.  But I agree with
> >> Paul:
> >> it's not clear how this "typically" translates into how many
> >> ScoreDocs
> >> get created?
> >>
> >>> However, this change will improve performance from the algorithm
> >>> point of
> >>> view - you allocate as many as numRequestedHits+1 no matter how many
> >>> documents your query processes.
> >>
> >> It's definitely a good step forward: not creating extra garbage in
> >> hot
> >> spots is worthwhile, so I think we should make this change.  Still
> >> I'm
> >> wondering how much this helps in practice.
> >>
> >> I think benchmarking on "real" use cases (vs synthetic tests) is
> >> worthwhile: it keeps you focused on what really counts, in the end.
> >>
> >> In this particular case there are at least 2 things it could show us:
> >>
> >>   * How many ScoreDocs really get created, or, what %tg of hits
> >>     actually result in an insertion into the PQ?
> >>
> >>   * How much is this savings as a %tg of the overall time spent
> >>     searching?
> >>
> >> Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Doron Cohen
In reply to this post by Michael McCandless-2
I have a TREC index of 25M docs that I can try this
with. Shai, if you can create an issue and upload a
patch that I (and others) can experiment with, I
will try a few queries on this index with and
without your patch...

Doron

Michael McCandless <[hidden email]> wrote on 10/12/2007
13:50:53:

>
> I don't offhand.  Working on the indexing side is so much easier :)
>
> You mentioned your experience with large indices & large result sets
> -- is that something you could draw on?
>
> There have also been discussions lately about finding real search
> logs we could use for exactly this reason, though I don't think
> that's come to a good solution yet.
>
> As a simple test you could break Wikipedia into smallish docs (~4K
> each = ~2.1 million docs), build the index, and make up a set of
> queries, or randomly pick terms for queries?  Obviously the queries
> aren't "real", but it's at least a step closer.... progress not
> perfection.
>
> Or, if you have access to TREC...
>
> Mike
>
> Shai Erera wrote:
>
> > Do you have a dataset and queries I can test on?
> >
> > On Dec 10, 2007 1:16 PM, Michael McCandless
> > <[hidden email]>
> > wrote:
> >
> >> Shai Erera wrote:
> >>
> >>> No - I didn't try to populate an index with real data and run real
> >>> queries
> >>> (what is "real" after all?). I know from my experience of indexes
> >>> with
> >>> several millions of documents where there are queries with several
> >>> hundred
> >>> thousands results (one query even hit 2.5 M documents). This is
> >>> typical in
> >>> search: users type on average 2.3 terms in a query. The chances
> >>> you'd hit a
> >>> query with huge result set are not that small in such cases (I'm
> >>> not saying
> >>> this is the most common case though, I agree that most of the
> >>> searches don't
> >>> process that many documents).
> >>
> >> Agreed: many queries do hit a great many results.  But I agree with
> >> Paul:
> >> it's not clear how this "typically" translates into how many
> >> ScoreDocs
> >> get created?
> >>
> >>> However, this change will improve performance from the algorithm
> >>> point of
> >>> view - you allocate as many as numRequestedHits+1 no matter how many
> >>> documents your query processes.
> >>
> >> It's definitely a good step forward: not creating extra garbage in
> >> hot
> >> spots is worthwhile, so I think we should make this change.  Still
> >> I'm
> >> wondering how much this helps in practice.
> >>
> >> I think benchmarking on "real" use cases (vs synthetic tests) is
> >> worthwhile: it keeps you focused on what really counts, in the end.
> >>
> >> In this particular case there are at least 2 things it could show us:
> >>
> >>   * How many ScoreDocs really get created, or, what %tg of hits
> >>     actually result in an insertion into the PQ?
> >>
> >>   * How much is this savings as a %tg of the overall time spent
> >>     searching?
> >>
> >> Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2
In reply to this post by Shai Erera

OK, sounds like a plan, thanks!

Yes, contrib/benchmark has EnwikiDocMaker to generate docs off the  
Wikipedia XML export file.

Mike

On Dec 10, 2007, at 7:03 AM, Shai Erera wrote:

> I have access to TREC. I can try this.
> W.r.t the large indexes - I don't have access to the data, just  
> scenarios
> our customers ran into the past.
> Does the benchmark package includes code to crawl Wikipedia? If  
> not, do you
> have such code? I don't want to write it from scratch for this  
> specific
> task.
>
> On Dec 10, 2007 1:50 PM, Michael McCandless  
> <[hidden email]>
> wrote:
>
>>
>> I don't offhand.  Working on the indexing side is so much easier :)
>>
>> You mentioned your experience with large indices & large result sets
>> -- is that something you could draw on?
>>
>> There have also been discussions lately about finding real search
>> logs we could use for exactly this reason, though I don't think
>> that's come to a good solution yet.
>>
>> As a simple test you could break Wikipedia into smallish docs (~4K
>> each = ~2.1 million docs), build the index, and make up a set of
>> queries, or randomly pick terms for queries?  Obviously the queries
>> aren't "real", but it's at least a step closer.... progress not
>> perfection.
>>
>> Or, if you have access to TREC...
>>
>> Mike
>>
>> Shai Erera wrote:
>>
>>> Do you have a dataset and queries I can test on?
>>>
>>> On Dec 10, 2007 1:16 PM, Michael McCandless
>>> <[hidden email]>
>>> wrote:
>>>
>>>> Shai Erera wrote:
>>>>
>>>>> No - I didn't try to populate an index with real data and run real
>>>>> queries
>>>>> (what is "real" after all?). I know from my experience of indexes
>>>>> with
>>>>> several millions of documents where there are queries with several
>>>>> hundred
>>>>> thousands results (one query even hit 2.5 M documents). This is
>>>>> typical in
>>>>> search: users type on average 2.3 terms in a query. The chances
>>>>> you'd hit a
>>>>> query with huge result set are not that small in such cases (I'm
>>>>> not saying
>>>>> this is the most common case though, I agree that most of the
>>>>> searches don't
>>>>> process that many documents).
>>>>
>>>> Agreed: many queries do hit a great many results.  But I agree with
>>>> Paul:
>>>> it's not clear how this "typically" translates into how many
>>>> ScoreDocs
>>>> get created?
>>>>
>>>>> However, this change will improve performance from the algorithm
>>>>> point of
>>>>> view - you allocate as many as numRequestedHits+1 no matter how  
>>>>> many
>>>>> documents your query processes.
>>>>
>>>> It's definitely a good step forward: not creating extra garbage in
>>>> hot
>>>> spots is worthwhile, so I think we should make this change.  Still
>>>> I'm
>>>> wondering how much this helps in practice.
>>>>
>>>> I think benchmarking on "real" use cases (vs synthetic tests) is
>>>> worthwhile: it keeps you focused on what really counts, in the end.
>>>>
>>>> In this particular case there are at least 2 things it could  
>>>> show us:
>>>>
>>>>   * How many ScoreDocs really get created, or, what %tg of hits
>>>>     actually result in an insertion into the PQ?
>>>>
>>>>   * How much is this savings as a %tg of the overall time spent
>>>>     searching?
>>>>
>>>> Mike
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Shai Erera
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Paul Elschot
In reply to this post by Shai Erera
On Monday 10 December 2007 09:19:43 Shai Erera wrote:
> I agree w.r.t the current implementation, however in the worse case (as we
> tend to consider when talking about computer algorithms), it will allocate a
> ScoreDoc per result. With the overflow reuse, it will not allocate those
> objects, no matter what's the input it gets.
> Also, notice that there is a performance hit with the current implementation
> of TopDocCollector: it first checks that the document should be inserted and
> if so, PQ does the same check again.
> So, if you change the current implementation to always attempt an insert,
> you'd gain some performance improvements as well.

One could consider a specialized priority queue, somewhat like
ScorerDocQueue, with an insert operation using the arguments
as in the collect() call instead a ScoreDoc object.
I don't know where HitQueue is currently used, but that could be
the place to inline PriorityQueue instead of inheriting from it.
Otherwise TopDocCollector could be used for inlining PriorityQueue.

Although practical tests are always good, an extra performance test
with the worst case of slowly increasing score values should be quite
helpful for comparing with the current implementation and to get the last
bits of performance out of this.

Regards,
Paul Elschot


>
> On Dec 10, 2007 10:15 AM, Paul Elschot <[hidden email]> wrote:
>
> > The current TopDocCollector only allocates a ScoreDoc when the given
> > score causes a new ScoreDoc to be added into the queue, but it does
> > not reuse anything that overflows out of the queue.
> > So, reusing the overflow out of the queue should reduce object
> > allocations. especially for indexes that tend to have better scoring
> > docs at the end. I wouldn't expect a 30% improvement out of that,
> > but it would help, if only to reduce occasional performance
> > deteriorations.
> >
> > Regards,
> > Paul Elschot
> >
> >
> > On Monday 10 December 2007 08:11:50 Shai Erera wrote:
> > > Hi
> > >
> > > Lucene's PQ implements two methods: put (assumes the PQ has room for the
> > > object) and insert (checks whether the object can be inserted etc.). The
> > > implementation of insert() requires the application that uses it to
> > allocate
> > > a new object every time it calls insert. Specifically, it cannot reuse
> > the
> > > objects that were removed from the PQ.
> > > I've done some measurements on the performance that search would gain by
> > > using that method, through the change of TopDocCollector.
> > >
> > > PriorityQueue change (added insertWithOverflow method)
> > >
> > -----------------------------------------------------------------------------------
> > >     /**
> > >      * insertWithOverflow() is similar to insert(), except its return
> > value:
> > > it
> > >      * returns the object (if any) that was dropped off the heap because
> > it
> > > was
> > >      * full. This can be the given parameter (in case it is smaller than
> > the
> > >      * full heap's minimum, and couldn't be added) or another object
> > that
> > > was
> > >      * previously the smallest value in the heap and now has been
> > replaced
> > > by a
> > >      * larger one.
> > >      */
> > >     public Object insertWithOverflow(Object element) {
> > >         if (size < maxSize) {
> > >             put(element);
> > >             return null;
> > >         } else if (size > 0 && !lessThan(element, top())) {
> > >             Object ret = heap[1];
> > >             heap[1] = element;
> > >             adjustTop();
> > >             return ret;
> > >         } else {
> > >             return element;
> > >         }
> > >     }
> > > [Very similar to insert(), only it returns the object that was kicked
> > out of
> > > the Queue, or null]
> > >
> > > TopDocCollector's current implementation of collect()
> > >
> > ----------------------------------------------------------------------------
> > >   public void collect(int doc, float score) {
> > >     if (score > 0.0f) {
> > >       totalHits++;
> > >       if (hq.size() < numHits || score >= minScore) {
> > >         hq.insert(new ScoreDoc(doc, score));
> > >         minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> > >       }
> > >     }
> > >   }
> > > [See how it allocates a new ScoreDoc every time this method is called]
> > >
> > > TopDocCollector's new implementation of collect()
> > >
> > -----------------------------------------------------------------------------
> > >   public void collect(int doc, float score) {
> > >       if (score == 0.0f) {
> > >           return;
> > >       }
> > >       totalHits++;
> > >       if (hq.size() < numHits || score >= minScore) {
> > >           if (sd == null) {
> > >               sd = new ScoreDoc(doc, score);
> > >           } else {
> > >               sd.doc = doc;
> > >               sd.score = score;
> > >           }
> > >           sd = (ScoreDoc) hq.insertWithOverflow(sd);
> > >           minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> > >       }
> > >   }
> > > [sd is a class memeber of the collector, of type ScoreDoc]
> > >
> > > Now for the performance tests. I've done two tests:
> > > 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and 100,000,000
> > > times using both implementations. Here are the results:
> > >                               1M         10M          100M
> > > Current Collector      218 ms   1,907ms    20,000ms
> > > Modified Collector    141 ms    1,297ms   12,672ms
> > > As you can see, the new implementation 30-40% faster.
> > >
> > > 2. Wrote some tasks in the benchmark package that makes use of the new
> > PQ
> > > while executing a real search over an index with 1M documents, of small
> > size
> > > (10 characters). Here are the results:
> > >                           Current TopDocCollector
> > >
> > -----------------------------------------------------------------------------------------------------------------------------------------------
> > > ------------> Report Sum By (any) Name (5 about 6 out of 6)
> > > Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> > > avgUsedMem    avgTotalMem
> > > CreateIndex         0        1            1         10.6        0.09
> > > 2,219,112      4,389,376
> > > AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -  19.20 -
> > > 34,497,176 -   52,689,408
> > > CloseIndex          0        2            1          0.1       16.62
> > > 26,178,972     51,378,688
> > > OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
> > > 48,771,304 -   50,067,968
> > > --> MySearch            0        1            1          5.3        0.19
> > > 48,771,304     50,067,968
> > >
> > >
> > >                           Modified TopDocCollector
> > >
> > -----------------------------------------------------------------------------------------------------------------------------------------------
> > > ------------> Report Sum By (any) Name (5 about 6 out of 6)
> > > Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> > > avgUsedMem    avgTotalMem
> > > CreateIndex         0        1            1         10.6        0.09
> > > 2,207,008      4,389,376
> > > AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -  19.62 -
> > > 32,531,992 -   44,825,088
> > > CloseIndex          0        2            1          0.1       16.84
> > > 57,853,148     61,929,984
> > > OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -
> > > 57,792,136 -   61,929,984
> > > --> MySearch            0        1            1          7.1        0.14
> > > 57,939,856     61,929,984
> > > Notice the time difference in search (0.14 modified vs. 0.19 current).
> > Here
> > > too, a 30% improvement.
> > >
> > > One thing that I wasn't able to show, but I think it's pretty much clear
> > -->
> > > the new implementation causes a lot less object allocations. Consider
> > the
> > > typical search for 10 results over an index with 1M documents. The
> > current
> > > implementation will allocate 1M (!) ScoreDoc instances, while the new
> > one
> > > will allocate 11 (10 for the PQ and 1 for reusing). On heavily loaded
> > > systems, this will result in far less work for the GC.
> > >
> > > I would like to suggest to reflect this new implementation in
> > PriorityQueue
> > > and also modify TopDocCollector to make use of this new method. Several
> > ways
> > > to do it:
> > > 1. Add insertWithOverflow to PQ (I hate the name and I'm willing to
> > accept
> > > better ones), make insert() deprecated (let's say in 2.3) and release
> > after
> > > that (2.4?) rename insertWithOverflow to insert(). A complex process,
> > but it
> > > won't break existing applications' code.
> > > 2. Change insert's signature and state that applications that move to
> > > 2.3(for example), need to change their code as well (actually it won't
> > > compile).
> > > 3. Any other alternatives?
> > >
> > > And .. of course, review the classes in the Lucene code that use PQ and
> > > modify them as necessary.
> > >
> > > A very long email for a very simple (but important) performance
> > improvement.
> > >
> > > Shai
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
Hi

Well, I have results from a 1M index for now (the index contains 100K
documents duplicated 10 times, so it's not the final test I'll run, but it
still shows something). I ran 2000 short queries (2.4 keywords on average)
on a 1M docs index, after 50 queries warm-up. Following are the results:

Current TopDocCollector:
------------------------------------
num queries: 2000
numDocs=1000000
total time: 15910 ms
avg time: 7.955 ms
avg. allocations: 79.7445
total allocation time: 0
avg. num results: 54841.26

Modified TopDocCollector:
-------------------------------------
num queries: 2000
numDocs=1000000
total time: 15909 ms
avg time: 7.9545 ms
avg. allocations: 9.8565
total allocation time: 0
avg. num results: 54841.26

As you can see, the actual allocation time is really negligible and there
isn't much difference in the avg. running times of the queries. However, the
*current* runs performed a lot worse at the beginning, before the OS cache
warmed up.
The only significant difference is the number of allocations - the modified
TDC and PQ allocate ~90% (!) less objects. This is significant, especially
in heavy loaded systems.

Tomorrow I will run the same test on a 10M (unique) documents index, but I
think the picture is getting clear ...


On Dec 10, 2007 6:15 PM, Paul Elschot <[hidden email]> wrote:

> On Monday 10 December 2007 09:19:43 Shai Erera wrote:
> > I agree w.r.t the current implementation, however in the worse case (as
> we
> > tend to consider when talking about computer algorithms), it will
> allocate a
> > ScoreDoc per result. With the overflow reuse, it will not allocate those
> > objects, no matter what's the input it gets.
> > Also, notice that there is a performance hit with the current
> implementation
> > of TopDocCollector: it first checks that the document should be inserted
> and
> > if so, PQ does the same check again.
> > So, if you change the current implementation to always attempt an
> insert,
> > you'd gain some performance improvements as well.
>
> One could consider a specialized priority queue, somewhat like
> ScorerDocQueue, with an insert operation using the arguments
> as in the collect() call instead a ScoreDoc object.
> I don't know where HitQueue is currently used, but that could be
> the place to inline PriorityQueue instead of inheriting from it.
> Otherwise TopDocCollector could be used for inlining PriorityQueue.
>
> Although practical tests are always good, an extra performance test
> with the worst case of slowly increasing score values should be quite
> helpful for comparing with the current implementation and to get the last
> bits of performance out of this.
>
> Regards,
> Paul Elschot
>
>
> >
> > On Dec 10, 2007 10:15 AM, Paul Elschot <[hidden email]> wrote:
> >
> > > The current TopDocCollector only allocates a ScoreDoc when the given
> > > score causes a new ScoreDoc to be added into the queue, but it does
> > > not reuse anything that overflows out of the queue.
> > > So, reusing the overflow out of the queue should reduce object
> > > allocations. especially for indexes that tend to have better scoring
> > > docs at the end. I wouldn't expect a 30% improvement out of that,
> > > but it would help, if only to reduce occasional performance
> > > deteriorations.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > > On Monday 10 December 2007 08:11:50 Shai Erera wrote:
> > > > Hi
> > > >
> > > > Lucene's PQ implements two methods: put (assumes the PQ has room for
> the
> > > > object) and insert (checks whether the object can be inserted etc.).
> The
> > > > implementation of insert() requires the application that uses it to
> > > allocate
> > > > a new object every time it calls insert. Specifically, it cannot
> reuse
> > > the
> > > > objects that were removed from the PQ.
> > > > I've done some measurements on the performance that search would
> gain by
> > > > using that method, through the change of TopDocCollector.
> > > >
> > > > PriorityQueue change (added insertWithOverflow method)
> > > >
> > >
> -----------------------------------------------------------------------------------
> > > >     /**
> > > >      * insertWithOverflow() is similar to insert(), except its
> return
> > > value:
> > > > it
> > > >      * returns the object (if any) that was dropped off the heap
> because
> > > it
> > > > was
> > > >      * full. This can be the given parameter (in case it is smaller
> than
> > > the
> > > >      * full heap's minimum, and couldn't be added) or another object
> > > that
> > > > was
> > > >      * previously the smallest value in the heap and now has been
> > > replaced
> > > > by a
> > > >      * larger one.
> > > >      */
> > > >     public Object insertWithOverflow(Object element) {
> > > >         if (size < maxSize) {
> > > >             put(element);
> > > >             return null;
> > > >         } else if (size > 0 && !lessThan(element, top())) {
> > > >             Object ret = heap[1];
> > > >             heap[1] = element;
> > > >             adjustTop();
> > > >             return ret;
> > > >         } else {
> > > >             return element;
> > > >         }
> > > >     }
> > > > [Very similar to insert(), only it returns the object that was
> kicked
> > > out of
> > > > the Queue, or null]
> > > >
> > > > TopDocCollector's current implementation of collect()
> > > >
> > >
> ----------------------------------------------------------------------------
> > > >   public void collect(int doc, float score) {
> > > >     if (score > 0.0f) {
> > > >       totalHits++;
> > > >       if (hq.size() < numHits || score >= minScore) {
> > > >         hq.insert(new ScoreDoc(doc, score));
> > > >         minScore = ((ScoreDoc)hq.top()).score; // maintain minScore
> > > >       }
> > > >     }
> > > >   }
> > > > [See how it allocates a new ScoreDoc every time this method is
> called]
> > > >
> > > > TopDocCollector's new implementation of collect()
> > > >
> > >
> -----------------------------------------------------------------------------
> > > >   public void collect(int doc, float score) {
> > > >       if (score == 0.0f) {
> > > >           return;
> > > >       }
> > > >       totalHits++;
> > > >       if (hq.size() < numHits || score >= minScore) {
> > > >           if (sd == null) {
> > > >               sd = new ScoreDoc(doc, score);
> > > >           } else {
> > > >               sd.doc = doc;
> > > >               sd.score = score;
> > > >           }
> > > >           sd = (ScoreDoc) hq.insertWithOverflow(sd);
> > > >           minScore = ((ScoreDoc)hq.top()).score; // maintain
> minScore
> > > >       }
> > > >   }
> > > > [sd is a class memeber of the collector, of type ScoreDoc]
> > > >
> > > > Now for the performance tests. I've done two tests:
> > > > 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and
> 100,000,000
> > > > times using both implementations. Here are the results:
> > > >                               1M         10M          100M
> > > > Current Collector      218 ms   1,907ms    20,000ms
> > > > Modified Collector    141 ms    1,297ms   12,672ms
> > > > As you can see, the new implementation 30-40% faster.
> > > >
> > > > 2. Wrote some tasks in the benchmark package that makes use of the
> new
> > > PQ
> > > > while executing a real search over an index with 1M documents, of
> small
> > > size
> > > > (10 characters). Here are the results:
> > > >                           Current TopDocCollector
> > > >
> > >
> -----------------------------------------------------------------------------------------------------------------------------------------------
> > > > ------------> Report Sum By (any) Name (5 about 6 out of 6)
> > > > Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> > > > avgUsedMem    avgTotalMem
> > > > CreateIndex         0        1            1         10.6        0.09
> > > > 2,219,112      4,389,376
> > > > AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -  19.20-
> > > > 34,497,176 -   52,689,408
> > > > CloseIndex          0        2            1          0.1       16.62
> > > > 26,178,972     51,378,688
> > > > OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00-
> > > > 48,771,304 -   50,067,968
> > > > --> MySearch            0        1            1          5.3
> 0.19
> > > > 48,771,304     50,067,968
> > > >
> > > >
> > > >                           Modified TopDocCollector
> > > >
> > >
> -----------------------------------------------------------------------------------------------------------------------------------------------
> > > > ------------> Report Sum By (any) Name (5 about 6 out of 6)
> > > > Operation       round   runCnt   recsPerRun        rec/s  elapsedSec
> > > > avgUsedMem    avgTotalMem
> > > > CreateIndex         0        1            1         10.6        0.09
> > > > 2,207,008      4,389,376
> > > > AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -  19.62-
> > > > 32,531,992 -   44,825,088
> > > > CloseIndex          0        2            1          0.1       16.84
> > > > 57,853,148     61,929,984
> > > > OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00-
> > > > 57,792,136 -   61,929,984
> > > > --> MySearch            0        1            1          7.1
> 0.14
> > > > 57,939,856     61,929,984
> > > > Notice the time difference in search (0.14 modified vs. 0.19current).
> > > Here
> > > > too, a 30% improvement.
> > > >
> > > > One thing that I wasn't able to show, but I think it's pretty much
> clear
> > > -->
> > > > the new implementation causes a lot less object allocations.
> Consider
> > > the
> > > > typical search for 10 results over an index with 1M documents. The
> > > current
> > > > implementation will allocate 1M (!) ScoreDoc instances, while the
> new
> > > one
> > > > will allocate 11 (10 for the PQ and 1 for reusing). On heavily
> loaded
> > > > systems, this will result in far less work for the GC.
> > > >
> > > > I would like to suggest to reflect this new implementation in
> > > PriorityQueue
> > > > and also modify TopDocCollector to make use of this new method.
> Several
> > > ways
> > > > to do it:
> > > > 1. Add insertWithOverflow to PQ (I hate the name and I'm willing to
> > > accept
> > > > better ones), make insert() deprecated (let's say in 2.3) and
> release
> > > after
> > > > that (2.4?) rename insertWithOverflow to insert(). A complex
> process,
> > > but it
> > > > won't break existing applications' code.
> > > > 2. Change insert's signature and state that applications that move
> to
> > > > 2.3(for example), need to change their code as well (actually it
> won't
> > > > compile).
> > > > 3. Any other alternatives?
> > > >
> > > > And .. of course, review the classes in the Lucene code that use PQ
> and
> > > > modify them as necessary.
> > > >
> > > > A very long email for a very simple (but important) performance
> > > improvement.
> > > >
> > > > Shai
> > > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Mike Klaas
On 10-Dec-07, at 11:31 AM, Shai Erera wrote:

> As you can see, the actual allocation time is really negligible and  
> there
> isn't much difference in the avg. running times of the queries.  
> However, the
> *current* runs performed a lot worse at the beginning, before the  
> OS cache
> warmed up.

This surprises me.  I would have thought that the query execution at  
this point would be IO-bound, hence _less_ suceptible to a little  
extra gc.

> The only significant difference is the number of allocations - the  
> modified
> TDC and PQ allocate ~90% (!) less objects. This is significant,  
> especially
> in heavy loaded systems.

I agree; this is desirable.

nice work,
-Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Michael McCandless-2
In reply to this post by Shai Erera

Shai Erera wrote:

> Hi
>
> Well, I have results from a 1M index for now (the index contains 100K
> documents duplicated 10 times, so it's not the final test I'll run,  
> but it
> still shows something). I ran 2000 short queries (2.4 keywords on  
> average)
> on a 1M docs index, after 50 queries warm-up. Following are the  
> results:
>
> Current TopDocCollector:
> ------------------------------------
> num queries: 2000
> numDocs=1000000
> total time: 15910 ms
> avg time: 7.955 ms
> avg. allocations: 79.7445
> total allocation time: 0
> avg. num results: 54841.26

Avg number of allocations per query is ~80?  Ie, there were only ~80  
inserts in to the PQ, meaning it had very quickly accumulated the top  
scoring docs and then doesn't change much after that?  This seems  
surprisingly low?  Am I mis-reading this or something?

> Modified TopDocCollector:
> -------------------------------------
> num queries: 2000
> numDocs=1000000
> total time: 15909 ms
> avg time: 7.9545 ms
> avg. allocations: 9.8565
> total allocation time: 0
> avg. num results: 54841.26
>
> As you can see, the actual allocation time is really negligible and  
> there
> isn't much difference in the avg. running times of the queries.  
> However, the
> *current* runs performed a lot worse at the beginning, before the  
> OS cache
> warmed up.

I'm also baffled by that difference, especially if we are only  
talking about 80 allocations per query to begin with.  Did you flush  
OS cache before running "Modified" to make sure it wasn't just using  
the cache?

> The only significant difference is the number of allocations - the  
> modified
> TDC and PQ allocate ~90% (!) less objects. This is significant,  
> especially
> in heavy loaded systems.

But, 80 allocations is really not very many to begin with?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Robert Engels
In reply to this post by Shai Erera
As has been pointed out on many threads, a modern JVM really doesn't  
need you to recycle objects, especially for small short lived objects.

It is actually less efficient in many cases (since the variables need  
to be reinitialized).

Using object pools (except when pooling external resources) is a  
waste of time.

On Dec 10, 2007, at 1:31 PM, Shai Erera wrote:

> Hi
>
> Well, I have results from a 1M index for now (the index contains 100K
> documents duplicated 10 times, so it's not the final test I'll run,  
> but it
> still shows something). I ran 2000 short queries (2.4 keywords on  
> average)
> on a 1M docs index, after 50 queries warm-up. Following are the  
> results:
>
> Current TopDocCollector:
> ------------------------------------
> num queries: 2000
> numDocs=1000000
> total time: 15910 ms
> avg time: 7.955 ms
> avg. allocations: 79.7445
> total allocation time: 0
> avg. num results: 54841.26
>
> Modified TopDocCollector:
> -------------------------------------
> num queries: 2000
> numDocs=1000000
> total time: 15909 ms
> avg time: 7.9545 ms
> avg. allocations: 9.8565
> total allocation time: 0
> avg. num results: 54841.26
>
> As you can see, the actual allocation time is really negligible and  
> there
> isn't much difference in the avg. running times of the queries.  
> However, the
> *current* runs performed a lot worse at the beginning, before the  
> OS cache
> warmed up.
> The only significant difference is the number of allocations - the  
> modified
> TDC and PQ allocate ~90% (!) less objects. This is significant,  
> especially
> in heavy loaded systems.
>
> Tomorrow I will run the same test on a 10M (unique) documents  
> index, but I
> think the picture is getting clear ...
>
>
> On Dec 10, 2007 6:15 PM, Paul Elschot <[hidden email]> wrote:
>
>> On Monday 10 December 2007 09:19:43 Shai Erera wrote:
>>> I agree w.r.t the current implementation, however in the worse  
>>> case (as
>> we
>>> tend to consider when talking about computer algorithms), it will
>> allocate a
>>> ScoreDoc per result. With the overflow reuse, it will not  
>>> allocate those
>>> objects, no matter what's the input it gets.
>>> Also, notice that there is a performance hit with the current
>> implementation
>>> of TopDocCollector: it first checks that the document should be  
>>> inserted
>> and
>>> if so, PQ does the same check again.
>>> So, if you change the current implementation to always attempt an
>> insert,
>>> you'd gain some performance improvements as well.
>>
>> One could consider a specialized priority queue, somewhat like
>> ScorerDocQueue, with an insert operation using the arguments
>> as in the collect() call instead a ScoreDoc object.
>> I don't know where HitQueue is currently used, but that could be
>> the place to inline PriorityQueue instead of inheriting from it.
>> Otherwise TopDocCollector could be used for inlining PriorityQueue.
>>
>> Although practical tests are always good, an extra performance test
>> with the worst case of slowly increasing score values should be quite
>> helpful for comparing with the current implementation and to get  
>> the last
>> bits of performance out of this.
>>
>> Regards,
>> Paul Elschot
>>
>>
>>>
>>> On Dec 10, 2007 10:15 AM, Paul Elschot <[hidden email]>  
>>> wrote:
>>>
>>>> The current TopDocCollector only allocates a ScoreDoc when the  
>>>> given
>>>> score causes a new ScoreDoc to be added into the queue, but it does
>>>> not reuse anything that overflows out of the queue.
>>>> So, reusing the overflow out of the queue should reduce object
>>>> allocations. especially for indexes that tend to have better  
>>>> scoring
>>>> docs at the end. I wouldn't expect a 30% improvement out of that,
>>>> but it would help, if only to reduce occasional performance
>>>> deteriorations.
>>>>
>>>> Regards,
>>>> Paul Elschot
>>>>
>>>>
>>>> On Monday 10 December 2007 08:11:50 Shai Erera wrote:
>>>>> Hi
>>>>>
>>>>> Lucene's PQ implements two methods: put (assumes the PQ has  
>>>>> room for
>> the
>>>>> object) and insert (checks whether the object can be inserted  
>>>>> etc.).
>> The
>>>>> implementation of insert() requires the application that uses  
>>>>> it to
>>>> allocate
>>>>> a new object every time it calls insert. Specifically, it cannot
>> reuse
>>>> the
>>>>> objects that were removed from the PQ.
>>>>> I've done some measurements on the performance that search would
>> gain by
>>>>> using that method, through the change of TopDocCollector.
>>>>>
>>>>> PriorityQueue change (added insertWithOverflow method)
>>>>>
>>>>
>> ---------------------------------------------------------------------
>> --------------
>>>>>     /**
>>>>>      * insertWithOverflow() is similar to insert(), except its
>> return
>>>> value:
>>>>> it
>>>>>      * returns the object (if any) that was dropped off the heap
>> because
>>>> it
>>>>> was
>>>>>      * full. This can be the given parameter (in case it is  
>>>>> smaller
>> than
>>>> the
>>>>>      * full heap's minimum, and couldn't be added) or another  
>>>>> object
>>>> that
>>>>> was
>>>>>      * previously the smallest value in the heap and now has been
>>>> replaced
>>>>> by a
>>>>>      * larger one.
>>>>>      */
>>>>>     public Object insertWithOverflow(Object element) {
>>>>>         if (size < maxSize) {
>>>>>             put(element);
>>>>>             return null;
>>>>>         } else if (size > 0 && !lessThan(element, top())) {
>>>>>             Object ret = heap[1];
>>>>>             heap[1] = element;
>>>>>             adjustTop();
>>>>>             return ret;
>>>>>         } else {
>>>>>             return element;
>>>>>         }
>>>>>     }
>>>>> [Very similar to insert(), only it returns the object that was
>> kicked
>>>> out of
>>>>> the Queue, or null]
>>>>>
>>>>> TopDocCollector's current implementation of collect()
>>>>>
>>>>
>> ---------------------------------------------------------------------
>> -------
>>>>>   public void collect(int doc, float score) {
>>>>>     if (score > 0.0f) {
>>>>>       totalHits++;
>>>>>       if (hq.size() < numHits || score >= minScore) {
>>>>>         hq.insert(new ScoreDoc(doc, score));
>>>>>         minScore = ((ScoreDoc)hq.top()).score; // maintain  
>>>>> minScore
>>>>>       }
>>>>>     }
>>>>>   }
>>>>> [See how it allocates a new ScoreDoc every time this method is
>> called]
>>>>>
>>>>> TopDocCollector's new implementation of collect()
>>>>>
>>>>
>> ---------------------------------------------------------------------
>> --------
>>>>>   public void collect(int doc, float score) {
>>>>>       if (score == 0.0f) {
>>>>>           return;
>>>>>       }
>>>>>       totalHits++;
>>>>>       if (hq.size() < numHits || score >= minScore) {
>>>>>           if (sd == null) {
>>>>>               sd = new ScoreDoc(doc, score);
>>>>>           } else {
>>>>>               sd.doc = doc;
>>>>>               sd.score = score;
>>>>>           }
>>>>>           sd = (ScoreDoc) hq.insertWithOverflow(sd);
>>>>>           minScore = ((ScoreDoc)hq.top()).score; // maintain
>> minScore
>>>>>       }
>>>>>   }
>>>>> [sd is a class memeber of the collector, of type ScoreDoc]
>>>>>
>>>>> Now for the performance tests. I've done two tests:
>>>>> 1. Calling TopDocCollector.collect 1,000,000, 10,000,000 and
>> 100,000,000
>>>>> times using both implementations. Here are the results:
>>>>>                               1M         10M          100M
>>>>> Current Collector      218 ms   1,907ms    20,000ms
>>>>> Modified Collector    141 ms    1,297ms   12,672ms
>>>>> As you can see, the new implementation 30-40% faster.
>>>>>
>>>>> 2. Wrote some tasks in the benchmark package that makes use of the
>> new
>>>> PQ
>>>>> while executing a real search over an index with 1M documents, of
>> small
>>>> size
>>>>> (10 characters). Here are the results:
>>>>>                           Current TopDocCollector
>>>>>
>>>>
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> -----
>>>>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
>>>>> Operation       round   runCnt   recsPerRun        rec/s  
>>>>> elapsedSec
>>>>> avgUsedMem    avgTotalMem
>>>>> CreateIndex         0        1            1         10.6        
>>>>> 0.09
>>>>> 2,219,112      4,389,376
>>>>> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   52,075.2 -  -  
>>>>> 19.20-
>>>>> 34,497,176 -   52,689,408
>>>>> CloseIndex          0        2            1          0.1        
>>>>> 16.62
>>>>> 26,178,972     51,378,688
>>>>> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -    
>>>>> 0.00-
>>>>> 48,771,304 -   50,067,968
>>>>> --> MySearch            0        1            1          5.3
>> 0.19
>>>>> 48,771,304     50,067,968
>>>>>
>>>>>
>>>>>                           Modified TopDocCollector
>>>>>
>>>>
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> -----
>>>>> ------------> Report Sum By (any) Name (5 about 6 out of 6)
>>>>> Operation       round   runCnt   recsPerRun        rec/s  
>>>>> elapsedSec
>>>>> avgUsedMem    avgTotalMem
>>>>> CreateIndex         0        1            1         10.6        
>>>>> 0.09
>>>>> 2,207,008      4,389,376
>>>>> AddDocs_1000000 -   0 -  -   1 -  - 1000000 -   50,955.4 -  -  
>>>>> 19.62-
>>>>> 32,531,992 -   44,825,088
>>>>> CloseIndex          0        2            1          0.1        
>>>>> 16.84
>>>>> 57,853,148     61,929,984
>>>>> OpenIndex -  -  -   0 -  -   1 -  -  -  - 1 -  - 1,000.0 -  -    
>>>>> 0.00-
>>>>> 57,792,136 -   61,929,984
>>>>> --> MySearch            0        1            1          7.1
>> 0.14
>>>>> 57,939,856     61,929,984
>>>>> Notice the time difference in search (0.14 modified vs.  
>>>>> 0.19current).
>>>> Here
>>>>> too, a 30% improvement.
>>>>>
>>>>> One thing that I wasn't able to show, but I think it's pretty much
>> clear
>>>> -->
>>>>> the new implementation causes a lot less object allocations.
>> Consider
>>>> the
>>>>> typical search for 10 results over an index with 1M documents. The
>>>> current
>>>>> implementation will allocate 1M (!) ScoreDoc instances, while the
>> new
>>>> one
>>>>> will allocate 11 (10 for the PQ and 1 for reusing). On heavily
>> loaded
>>>>> systems, this will result in far less work for the GC.
>>>>>
>>>>> I would like to suggest to reflect this new implementation in
>>>> PriorityQueue
>>>>> and also modify TopDocCollector to make use of this new method.
>> Several
>>>> ways
>>>>> to do it:
>>>>> 1. Add insertWithOverflow to PQ (I hate the name and I'm  
>>>>> willing to
>>>> accept
>>>>> better ones), make insert() deprecated (let's say in 2.3) and
>> release
>>>> after
>>>>> that (2.4?) rename insertWithOverflow to insert(). A complex
>> process,
>>>> but it
>>>>> won't break existing applications' code.
>>>>> 2. Change insert's signature and state that applications that move
>> to
>>>>> 2.3(for example), need to change their code as well (actually it
>> won't
>>>>> compile).
>>>>> 3. Any other alternatives?
>>>>>
>>>>> And .. of course, review the classes in the Lucene code that  
>>>>> use PQ
>> and
>>>>> modify them as necessary.
>>>>>
>>>>> A very long email for a very simple (but important) performance
>>>> improvement.
>>>>>
>>>>> Shai
>>>>>
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
In reply to this post by Mike Klaas
Actually, queries on large indexes are not necessarily I/O bound. It depends
on how much of the posting list is being read into memory at once. I'm not
that familiar with the inner-most of Lucene, but let's assume a posting
element takes 4 bytes for docId and 2 more bytes per position in a document
(that's without compression, I'm sure Lucene does some compression on the
doc Ids). So, I think I won't miss by much by guessing that at most a
posting element takes 10 bytes. Which means that 1M posting elements take
10MB (this is considered a very long posting list).
Therefore if you read it into memory in chunks (16, 32, 64 KB), most of the
time the query spends in the CPU, computing the scores, PQ etc. The real IO
operations only involve reading fragments of the posting into memory. In
todays hardware, reading 10MB into memory is pretty fast.
So I wouldn't be surprised here (unless I misunderstood you).

> I agree; this is desirable.
I will run the test with 10M documents tomorrow and then if the results are
the same will open an issue. Is that agreed?

On Dec 10, 2007 9:59 PM, Mike Klaas <[hidden email]> wrote:

> On 10-Dec-07, at 11:31 AM, Shai Erera wrote:
>
> > As you can see, the actual allocation time is really negligible and
> > there
> > isn't much difference in the avg. running times of the queries.
> > However, the
> > *current* runs performed a lot worse at the beginning, before the
> > OS cache
> > warmed up.
>
> This surprises me.  I would have thought that the query execution at
> this point would be IO-bound, hence _less_ suceptible to a little
> extra gc.
>
> > The only significant difference is the number of allocations - the
> > modified
> > TDC and PQ allocate ~90% (!) less objects. This is significant,
> > especially
> > in heavy loaded systems.
>
> I agree; this is desirable.
>
> nice work,
> -Mike
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Shai Erera
In reply to this post by Michael McCandless-2
Well .. I suspect this behavior is due the nature of the index - 100K docs
duplicated 10 times. Therefore at some point it hits the same documents (and
scores). Like I said, tomorrow I'll re-run the test on a 10M unique docs
index.
I agree that 80 allocations are not much, but that's per query. There were
160,000 allocations overall, which does cause some work to the GC.
Why not save those allocations?

On Dec 10, 2007 10:09 PM, Michael McCandless <[hidden email]>
wrote:

>
> Shai Erera wrote:
>
> > Hi
> >
> > Well, I have results from a 1M index for now (the index contains 100K
> > documents duplicated 10 times, so it's not the final test I'll run,
> > but it
> > still shows something). I ran 2000 short queries (2.4 keywords on
> > average)
> > on a 1M docs index, after 50 queries warm-up. Following are the
> > results:
> >
> > Current TopDocCollector:
> > ------------------------------------
> > num queries: 2000
> > numDocs=1000000
> > total time: 15910 ms
> > avg time: 7.955 ms
> > avg. allocations: 79.7445
> > total allocation time: 0
> > avg. num results: 54841.26
>
> Avg number of allocations per query is ~80?  Ie, there were only ~80
> inserts in to the PQ, meaning it had very quickly accumulated the top
> scoring docs and then doesn't change much after that?  This seems
> surprisingly low?  Am I mis-reading this or something?
>
> > Modified TopDocCollector:
> > -------------------------------------
> > num queries: 2000
> > numDocs=1000000
> > total time: 15909 ms
> > avg time: 7.9545 ms
> > avg. allocations: 9.8565
> > total allocation time: 0
> > avg. num results: 54841.26
> >
> > As you can see, the actual allocation time is really negligible and
> > there
> > isn't much difference in the avg. running times of the queries.
> > However, the
> > *current* runs performed a lot worse at the beginning, before the
> > OS cache
> > warmed up.
>
> I'm also baffled by that difference, especially if we are only
> talking about 80 allocations per query to begin with.  Did you flush
> OS cache before running "Modified" to make sure it wasn't just using
> the cache?
>
> > The only significant difference is the number of allocations - the
> > modified
> > TDC and PQ allocate ~90% (!) less objects. This is significant,
> > especially
> > in heavy loaded systems.
>
> But, 80 allocations is really not very many to begin with?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Robert Engels
In reply to this post by Shai Erera
I think you might be underestimating the IO cost a bit.

A modern drive does 40-70 mb per second, but that is sequential  
reads. The seek time (9 ms is good) can kill you.

Because of disk fragmentation, there is no guarantee that the posting  
is sequential on the disk.  Obviously the OS and drive controller  
read-ahead trying to improve things, but I think you would be  
surprised how much IO affects things.

A lot of this is going to depend on the size of the index. If it all  
fits in the disk cache, repeated reads (even by multiple threads) is  
going to be very fast.


On Dec 10, 2007, at 2:11 PM, Shai Erera wrote:

> Actually, queries on large indexes are not necessarily I/O bound.  
> It depends
> on how much of the posting list is being read into memory at once.  
> I'm not
> that familiar with the inner-most of Lucene, but let's assume a  
> posting
> element takes 4 bytes for docId and 2 more bytes per position in a  
> document
> (that's without compression, I'm sure Lucene does some compression  
> on the
> doc Ids). So, I think I won't miss by much by guessing that at most a
> posting element takes 10 bytes. Which means that 1M posting  
> elements take
> 10MB (this is considered a very long posting list).
> Therefore if you read it into memory in chunks (16, 32, 64 KB),  
> most of the
> time the query spends in the CPU, computing the scores, PQ etc. The  
> real IO
> operations only involve reading fragments of the posting into  
> memory. In
> todays hardware, reading 10MB into memory is pretty fast.
> So I wouldn't be surprised here (unless I misunderstood you).
>
>> I agree; this is desirable.
> I will run the test with 10M documents tomorrow and then if the  
> results are
> the same will open an issue. Is that agreed?
>
> On Dec 10, 2007 9:59 PM, Mike Klaas <[hidden email]> wrote:
>
>> On 10-Dec-07, at 11:31 AM, Shai Erera wrote:
>>
>>> As you can see, the actual allocation time is really negligible and
>>> there
>>> isn't much difference in the avg. running times of the queries.
>>> However, the
>>> *current* runs performed a lot worse at the beginning, before the
>>> OS cache
>>> warmed up.
>>
>> This surprises me.  I would have thought that the query execution at
>> this point would be IO-bound, hence _less_ suceptible to a little
>> extra gc.
>>
>>> The only significant difference is the number of allocations - the
>>> modified
>>> TDC and PQ allocate ~90% (!) less objects. This is significant,
>>> especially
>>> in heavy loaded systems.
>>
>> I agree; this is desirable.
>>
>> nice work,
>> -Mike
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Improvement for Search using PriorityQueue

Mike Klaas
In reply to this post by Shai Erera
On 10-Dec-07, at 12:11 PM, Shai Erera wrote:

> Actually, queries on large indexes are not necessarily I/O bound.  
> It depends
> on how much of the posting list is being read into memory at once.  
> I'm not
> that familiar with the inner-most of Lucene, but let's assume a  
> posting
> element takes 4 bytes for docId and 2 more bytes per position in a  
> document
> (that's without compression, I'm sure Lucene does some compression  
> on the
> doc Ids). So, I think I won't miss by much by guessing that at most a
> posting element takes 10 bytes. Which means that 1M posting  
> elements take
> 10MB (this is considered a very long posting list).
> Therefore if you read it into memory in chunks (16, 32, 64 KB),  
> most of the
> time the query spends in the CPU, computing the scores, PQ etc. The  
> real IO
> operations only involve reading fragments of the posting into  
> memory. In
> todays hardware, reading 10MB into memory is pretty fast.
> So I wouldn't be surprised here (unless I misunderstood you).

My experience is that queries against indices which haven't been  
warmed into the os disk cache to be many times slower (this is  
especially true if the prox file is used at all).

I initially assumed that you had cleared the os disk cache between  
the runs of the two algorithms, and were seeing a difference in  
uncached query performance.  I assume though from your comments that  
this isn't the case at all.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123