eliminating scoring for the sake of efficiency

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

eliminating scoring for the sake of efficiency

Boris Galitsky-2
Hello

    We don't need any scoring in our application domain, but
efficiency is the key because we are getting tens thousand of hits for
span queries; all these hits are necessary to collect.
    Is there a simple way to turn scoring off while indexing, while
search  and while delivering document IDs to save on time?

Best regards
Boris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: eliminating scoring for the sake of efficiency

Paul Elschot
On Thursday 11 May 2006 22:42, Boris Galitsky wrote:
> Hello
>
>     We don't need any scoring in our application domain, but
> efficiency is the key because we are getting tens thousand of hits for
> span queries; all these hits are necessary to collect.
>     Is there a simple way to turn scoring off while indexing, while
> search  and while delivering document IDs to save on time?

You could use getSpans() on the top level SpanQuery, and use a loop
calling next() on the Spans, and ignore duplicate doc() values from the Spans
in that loop.
A counter in the loop would also give you the number of matching occurrences
of the SpanQuery.

This way of using the Spans directly should be slightly more efficient than
using a HitCollector, but don't hold your breath.

In case you have ordered SpanQuery's without overlaps, the
NearSpansOrdered here  might be a bit faster than the NearSpans
currently in Lucene:
http://issues.apache.org/jira/browse/LUCENE-413
(you'll also need the patch to SpanNearQuery).

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

accelerate hits.id(i) function: eliminating scoring for the sake of efficiency

Boris Galitsky-2
Yes, thanks Paul.

  We are already using
>  getSpans() on the top level SpanQuery, and use a loop
> calling next() on the Spans, and ignore duplicate doc() values from
>the Spans
> in that loop.
> A counter in the loop would also give you the number of matching
>occurrences
> of the SpanQuery.

I will look into
> NearSpansOrdered here  might be a bit faster than the NearSpans

However what significantly slows us down is the hits.id(i) function.
Can we accelerate it somehow "cleaning" Lucene code itself from
scoring?

Best regards
Boris



> On Thursday 11 May 2006 22:42, Boris Galitsky wrote:
>> Hello
>>
>>     We don't need any scoring in our application domain, but
>> efficiency is the key because we are getting tens thousand of hits
>>for
>> span queries; all these hits are necessary to collect.
>>     Is there a simple way to turn scoring off while indexing, while
>> search  and while delivering document IDs to save on time?
>
> You could use getSpans() on the top level SpanQuery, and use a loop
> calling next() on the Spans, and ignore duplicate doc() values from
>the Spans
> in that loop.
> A counter in the loop would also give you the number of matching
>occurrences
> of the SpanQuery.
>
> This way of using the Spans directly should be slightly more
>efficient than
> using a HitCollector, but don't hold your breath.
>
> In case you have ordered SpanQuery's without overlaps, the
> NearSpansOrdered here  might be a bit faster than the NearSpans
> currently in Lucene:
> http://issues.apache.org/jira/browse/LUCENE-413
> (you'll also need the patch to SpanNearQuery).
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: accelerate hits.id(i) function: eliminating scoring for the sake of efficiency

Chris Hostetter-3

: However what significantly slows us down is the hits.id(i) function.
: Can we accelerate it somehow "cleaning" Lucene code itself from
: scoring?

you said in your last message...

:     We don't need any scoring in our application domain, but
: efficiency is the key because we are getting tens thousand of hits for
: span queries; all these hits are necessary to collect.

if you are iterating over all of the matching documents for each query,
and you are getting more then a few dozen matches for each query, then you
should not be using the Hits obejct at all.

Hits is designed for the "common case" or paginated searches with
10-20 items per page, that rarely care about going past page 5 or 6, and
don't mind if the high numbered pages take a little longer.

If you are iterating over all the matches, then you want do be using a
HitCollector.  If you use a Hits object, and you iterate past the first
100 results: it will do your search twice under the covers; if you go past
the 200th result, it will do your search threetimes. past 400, it will do
it 4 times, etc...



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]