[lucy-user] Hits offset and search performarce

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Hits offset and search performarce

Thomas den Braber
Hallo,

I am migrating from Swish-e to Lucy, most of the features I have implemented without many
troubles but I have found some (big) differences in performance when using the offset
parameter in the hits object.

If I do:

my $hits = $searcher->hits(
        query => $queryobj,
        sort_spec => $sort_spec,
        num_wanted => 20,
        offset => 5000
);

The search time is much higher then with the offset set to 0. In Swish-e there is only a
small increase in search time.

I found out that when running with  num_wanted => 5000 and offset => 0 the search time is
almost the same as with  num_wanted => 20 and offset => 5000. It looks like there is no
optimization when using 'offset'.
I would expect that using the offset, performance should be higher because no processing
needs to be done to the hits before the offset (no score calculation).

Is there a way to improve the search performance when using the 'offset' ?
Could this be something for future versions ?

Also I am missing the seek command (that exists in swish-e) to fast forward through my
results or did I overlook something ?

But in general I am very satisfied with Lucy so far. Thanks for the work guys !

Thomas


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Marvin Humphrey
On Sat, Nov 10, 2012 at 11:13 AM, Thomas den Braber <[hidden email]> wrote:

> I am migrating from Swish-e to Lucy, most of the features I have implemented
> without many troubles but I have found some (big) differences in performance
> when using the offset parameter in the hits object.
>
> If I do:
>
> my $hits = $searcher->hits(
>         query => $queryobj,
>         sort_spec => $sort_spec,
>         num_wanted => 20,
>         offset => 5000
> );
>
> The search time is much higher then with the offset set to 0. In Swish-e
> there is only a small increase in search time.

I don't know how Swish-e implements sorting of hits, but this is expected
behavior in Lucy.

While iterating through matches, the top hits are kept in a priority queue.
The size of the priority queue is `offset + num_wanted`.  The bigger the
priority queue, the more MatchDoc objects which have to be created to serve as
elements within the priority queue, the more comparison operations which have
to take place to keep the priority queue sorted, etc -- impacting both speed
and memory footprint.

> I found out that when running with  num_wanted => 5000 and offset => 0 the
> search time is almost the same as with num_wanted => 20 and offset => 5000.
> It looks like there is no optimization when using 'offset'.

In one case, the priority queue has a size of 5000, and in the other it has a
size of 5020 -- not much difference!

(Note: we don't retrieve any of the actual document content until later --
individual Doc objects are fetched on demand with each call to `$hits->next`.)

> I would expect that using the offset, performance should be higher because
> no processing needs to be done to the hits before the offset (no score
> calculation).

How do you know that the hit number 5000 actually ranks 5000th in sort order
unless you calculate scores for all documents and perform sorting?

There are certain times when Lucy can avoid calculating scores -- when
SortSpecs do not require scores, or when documents match pure negative clauses
(docs matching "bar" in the query `foo AND NOT bar`).  But when you are
ranking documents based on score, we have to calculate a score for **every**
document.

> Is there a way to improve the search performance when using the 'offset' ?

Well, you could say that we already do. :)  The shallower your search, the
smaller we can make the priority queue!

> Also I am missing the seek command (that exists in swish-e) to fast forward
> through my results or did I overlook something ?

I would assume that Swish-e and Lucy are implemented differently.  I don't
know what seek() does in the context of Swish-e.

> But in general I am very satisfied with Lucy so far. Thanks for the work guys !

:)

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Thomas den Braber
On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <[hidden email]> wrote:

> I don't know how Swish-e implements sorting of hits, but this is expected
> behavior in Lucy.

Swish-e can use presorting of attributes during indexing:
'By default Swish-e generates presorted tables while indexing for each property name. This
allows faster sorting when generating results. On large document collections this
presorting may add to the indexing time, and also adds to the total size of the index.
This directive can be used to customize exactly which properties will be presorted.'

Maybe this does the trick ?

>> I would expect that using the offset, performance should be higher because
>> no processing needs to be done to the hits before the offset (no score
>> calculation).

> How do you know that the hit number 5000 actually ranks 5000th in sort order
> unless you calculate scores for all documents and perform sorting?

> There are certain times when Lucy can avoid calculating scores -- when
> SortSpecs do not require scores, or when documents match pure negative clauses
> (docs matching "bar" in the query `foo AND NOT bar`).  But when you are
> ranking documents based on score, we have to calculate a score for **every**
> document.

Sorry I didn't mention this but I really meant sorting by attributes other the score, like
modification date or file size. Is calculating of the score also needed here?

> I would assume that Swish-e and Lucy are implemented differently.  I don't
> know what seek() does in the context of Swish-e.

Seek will fast forward through the search result without first specifying the total hits
you want to collect and not reading the results that exists before the seek pointer. In
swish you also do not have to say in advance how many hits you want.

I can overcome the absence of such a command in Lucy by tweaking my program and moving
some of my logic to an earlier stage.

I will continue my migration and will let you know if there are 'more bumps on the road'.

I can also make a more detailed performance comparison if you like.

Thomas


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Peter Karman
Thomas den Braber wrote on 11/12/12 3:53 AM:

> On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <[hidden email]> wrote:
>
>> I don't know how Swish-e implements sorting of hits, but this is expected
>> behavior in Lucy.
>
> Swish-e can use presorting of attributes during indexing:
> 'By default Swish-e generates presorted tables while indexing for each property name. This
> allows faster sorting when generating results. On large document collections this
> presorting may add to the indexing time, and also adds to the total size of the index.
> This directive can be used to customize exactly which properties will be presorted.'
>
> Maybe this does the trick ?


Swish-e does presort attributes, but rank/score is not one of them. That is
always a per-search attribute.

ISTR an email exchange about this back when I was first using KinoSearch
(pre-Lucy), but I can't find it now.



>
>>> I would expect that using the offset, performance should be higher because
>>> no processing needs to be done to the hits before the offset (no score
>>> calculation).
>
>> How do you know that the hit number 5000 actually ranks 5000th in sort order
>> unless you calculate scores for all documents and perform sorting?


Swish-e calculates the score for all documents before sorting them. Just like Lucy.


>
>> There are certain times when Lucy can avoid calculating scores -- when
>> SortSpecs do not require scores, or when documents match pure negative clauses
>> (docs matching "bar" in the query `foo AND NOT bar`).  But when you are
>> ranking documents based on score, we have to calculate a score for **every**
>> document.
>
> Sorry I didn't mention this but I really meant sorting by attributes other the score, like
> modification date or file size. Is calculating of the score also needed here?


No. If you look at the source for SWISH::Prog::Lucy::Searcher->search() you will
see that I always add a SortRule for 'score' but that is only so that I can show
the result, not to sort by it.



>
>> I would assume that Swish-e and Lucy are implemented differently.  I don't
>> know what seek() does in the context of Swish-e.
>
> Seek will fast forward through the search result without first specifying the total hits
> you want to collect and not reading the results that exists before the seek pointer. In
> swish you also do not have to say in advance how many hits you want.


$hits->seek(10); # skip the first 9 hits

This is similar to the Lucy::Index::Lexicon->seek() method.

It would be useful to have it for Lucy::Search::Hits too, imo.


>
> I can overcome the absence of such a command in Lucy by tweaking my program and moving
> some of my logic to an earlier stage.
>
> I will continue my migration and will let you know if there are 'more bumps on the road'.
>
> I can also make a more detailed performance comparison if you like.


I, for one, would be interested in hearing your thoughts, Thomas. As you might
expect, I have some experience with both. :)


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Thomas den Braber
Peter Karman wrote on 11/12/2012 5:46 PM

> Thomas den Braber wrote on 11/12/12 3:53 AM:
> > On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <[hidden email]> wrote:
> >
> >> I don't know how Swish-e implements sorting of hits, but this is expected
> >> behavior in Lucy.
> >
> > Swish-e can use presorting of attributes during indexing:
> > 'By default Swish-e generates presorted tables while indexing for each property name.
> This
> > allows faster sorting when generating results. On large document collections this
> > presorting may add to the indexing time, and also adds to the total size of the
> index.
> > This directive can be used to customize exactly which properties will be presorted.'
> >
> > Maybe this does the trick ?
>
>
> Swish-e does presort attributes, but rank/score is not one of them. That is
> always a per-search attribute.
>
> ISTR an email exchange about this back when I was first using KinoSearch
> (pre-Lucy), but I can't find it now.
>

I never understood what the presorting in Swish-e was actually doing. Maybe not that
interesting for Lucy users, but could you explain in a couple of lines?

>
> >
> >>> I would expect that using the offset, performance should be higher because
> >>> no processing needs to be done to the hits before the offset (no score
> >>> calculation).
> >
> >> How do you know that the hit number 5000 actually ranks 5000th in sort order
> >> unless you calculate scores for all documents and perform sorting?
>
>
> Swish-e calculates the score for all documents before sorting them. Just like Lucy.
>
>
> >
> >> There are certain times when Lucy can avoid calculating scores -- when
> >> SortSpecs do not require scores, or when documents match pure negative clauses
> >> (docs matching "bar" in the query `foo AND NOT bar`).  But when you are
> >> ranking documents based on score, we have to calculate a score for **every**
> >> document.
> >
> > Sorry I didn't mention this but I really meant sorting by attributes other the score,
> like
> > modification date or file size. Is calculating of the score also needed here?
>
>
> No. If you look at the source for SWISH::Prog::Lucy::Searcher->search() you will
> see that I always add a SortRule for 'score' but that is only so that I can show
> the result, not to sort by it.
>

Doesn't this cost performance ?

>
>
> >
> >> I would assume that Swish-e and Lucy are implemented differently.  I don't
> >> know what seek() does in the context of Swish-e.
> >
> > Seek will fast forward through the search result without first specifying the total
> hits
> > you want to collect and not reading the results that exists before the seek pointer.
> In
> > swish you also do not have to say in advance how many hits you want.
>
>
> $hits->seek(10); # skip the first 9 hits
>
> This is similar to the Lucy::Index::Lexicon->seek() method.
>
> It would be useful to have it for Lucy::Search::Hits too, imo.
>

That would be nice. It also would be useful to change the num_wanted and offset ofter the
'$searcher->hits()' call has been done, without performing the search again with a
different offset.


> >
> > I can overcome the absence of such a command in Lucy by tweaking my program and
> moving
> > some of my logic to an earlier stage.
> >
> > I will continue my migration and will let you know if there are 'more bumps on the
> road'.
> >
> > I can also make a more detailed performance comparison if you like.
>
>
> I, for one, would be interested in hearing your thoughts, Thomas. As you might
> expect, I have some experience with both. :)

I know, I'm glad that you joined the Lucy project.

//Thomas


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Marvin Humphrey
On Tue, Nov 13, 2012 at 8:47 AM, Thomas den Braber <[hidden email]> wrote:
> Peter Karman wrote on 11/12/2012 5:46 PM
>> Thomas den Braber wrote on 11/12/12 3:53 AM:
>> > On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <[hidden email]> wrote:

>> >> I would assume that Swish-e and Lucy are implemented differently.  I don't
>> >> know what seek() does in the context of Swish-e.
>> >
>> > Seek will fast forward through the search result without first specifying
>> > the total hits you want to collect and not reading the results that
>> > exists before the seek pointer.  In swish you also do not have to say in
>> > advance how many hits you want.
>>
>> $hits->seek(10); # skip the first 9 hits
>>
>> This is similar to the Lucy::Index::Lexicon->seek() method.
>>
>> It would be useful to have it for Lucy::Search::Hits too, imo.
>
> That would be nice. It also would be useful to change the num_wanted and
> offset ofter the '$searcher->hits()' call has been done, without performing
> the search again with a different offset.

In Lucy, it would be impossible to change `num_wanted` and `offset` after the
fact arbitrarily without rerunning the search.  If the priority queue had a size
of 30 because `offset` was 20 and `num_wanted` was 10, we only have the top 30
hits.  You can't seek to 31, 50, 100 or whatever without rerunning the search
with a bigger priority queue.

I would oppose a seek() which runs implicit searches behind the scenes because
it would surprise users with a hidden performance cost.

If the lack of seek() is forcing you to re-architect your program while
transitioning to Lucy, then it is forcing you to re-architect it in ways which
make best use of Lucy.  Adding a deceptive interface on top of Lucy to mimic
Swish-e wouldn't do anybody any favors -- in fact, it would be actively
harmful, as it would encourage people to use Lucy inefficiently.  Then we'd

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Peter Karman
On 11/13/12 11:33 AM, Marvin Humphrey wrote:

>
> I would oppose a seek() which runs implicit searches behind the scenes because
> it would surprise users with a hidden performance cost.
>

I was imagining that $hits->seek() would just adjust the iterator
pointer, like a shortcut for this pseudo-code:

  sub seek {
      my ($hits, $new_pos) = @_;
      # some error check here to verify $new_pos
      # isn't beyond the offset+num_wanted
      while ($hits->current_pos < $new_pos) {
          $hits->next or last;
      }
  }


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Marvin Humphrey
On Tue, Nov 13, 2012 at 9:47 AM, Peter Karman <[hidden email]> wrote:
> On 11/13/12 11:33 AM, Marvin Humphrey wrote:
>
>> I would oppose a seek() which runs implicit searches behind the scenes
>> because it would surprise users with a hidden performance cost.
>
> I was imagining that $hits->seek() would just adjust the iterator pointer,

Why not just use the `offset` parameter to Searcher#hits?  So long as
rerunning the search is ruled out, adding seek() offers little functionality
beyond what we already get from `offset`.

The motivation for adding seek() seems to be to make Lucy more Swish-like.
However, if we really wanted to offer Swish-like behavior, we'd have to run
the initial search with the size of priority queue set to the size of the
index every time in order to capture all possible hits.  Speed and memory
consumption would suck, but at least they would suck uniformly for all values
of `offset`! :)  And you could seek all over without having to rerun a search.

I assume nobody wants that -- not even the original poster.  So we're stuck
with the less satisfactory semantics of having seek() fail silently beyond
`offset + num_wanted`.

Well, if we're not going to offer full Swish-e semantics with seek(), and the
limited version offers hardly any new functionality, what do we accomplish by
adding it?  We'd make people like the original poster feel a little better at
first -- but we're just setting them up for disappointment when they discover
that Lucy's seek() and Swish-e's seek() aren't really the same after all.

There's another reason not to add seek(): it conflicts with adding
pre-fetching support to Hits.  What happens when you seek back and forth to
different points in the iterator?  Does it throw away the docs that were just
pre-fetched?  Does it keep them and risk exceeding the prefetch_count and
blowing up memory consumption?

Right this moment, it seems like adding seek() to Hits would be
straightforward -- but in my view, the conflict with pre-fetching serves to
illustrate why it pays to be prudent about expanding public APIs, both in this
specific case and in general.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Peter Karman
Marvin Humphrey wrote on 11/13/12 6:38 PM:

> On Tue, Nov 13, 2012 at 9:47 AM, Peter Karman <[hidden email]> wrote:
>> On 11/13/12 11:33 AM, Marvin Humphrey wrote:
>>
>>> I would oppose a seek() which runs implicit searches behind the scenes
>>> because it would surprise users with a hidden performance cost.
>>
>> I was imagining that $hits->seek() would just adjust the iterator pointer,
>
> Why not just use the `offset` parameter to Searcher#hits?  So long as
> rerunning the search is ruled out, adding seek() offers little functionality
> beyond what we already get from `offset`.
>

point taken. I re-read the swish-e docs just now and seek_result() does just
what offset does.

/me retracts

--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Peter Karman
In reply to this post by Thomas den Braber
Thomas den Braber wrote on 11/13/12 10:47 AM:
> Peter Karman wrote on 11/12/2012 5:46 PM

>> Swish-e does presort attributes, but rank/score is not one of them. That is
>> always a per-search attribute.
>>
>> ISTR an email exchange about this back when I was first using KinoSearch
>> (pre-Lucy), but I can't find it now.
>>
>
> I never understood what the presorting in Swish-e was actually doing. Maybe not that
> interesting for Lucy users, but could you explain in a couple of lines?
>

Lucy pre-sorts field values at index time, using a different approach than
Swish-e. But the idea is the same: pre-sorted field values are faster to return
at search time. IIRC, the pre-sorting is done when you call optimize() on an
Indexer. (Marvin, please correct my poor memory if I'm wrong there.)


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Hits offset and search performarce

Marvin Humphrey
On Tue, Nov 13, 2012 at 8:42 PM, Peter Karman <[hidden email]> wrote:

> Lucy pre-sorts field values at index time, using a different approach than
> Swish-e. But the idea is the same: pre-sorted field values are faster to return
> at search time. IIRC, the pre-sorting is done when you call optimize() on an
> Indexer. (Marvin, please correct my poor memory if I'm wrong there.)

Since you insist: to pick a nit it's prepare_commit() -- called implicitly by
commit() -- rather than optimize(). (The presorting happens with every segment
write.)

Marvin Humphrey