Performance guarantees and index format

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance guarantees and index format

Kyle Maxwell-5
I'd like to be able to guarantee that a search will finish in
(approximately?) N seconds.  This seems like a generally applicable
goal for the project.  It would be nice to not have to worry about
malicious or naive users DOSing a search instance.  In some cases,
precision can be sacrificed, to see results returned more speedily.

This can presently be accomplished by modifying Scorers, such that the
skipTo method is aware of how long the query has been underway.  One
can simply return the best results found after a predetermined amount
of time.

It is unfortunate, however, that iteration is performed from oldest to
newest.  In most cases, the newest content is the most relevant.  In
the scenario described above, the newest (and therefore best) content
would not be found.

As I understand it, documents are stored in singly linked lists.
Therefore, reverse iteration is impossible.  What do people think
about modifying Lucene, such that iteration is performed in reverse?

--
Kyle Maxwell
Software Engineer
CastTV, Inc
http://www.casttv.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance guarantees and index format

Mark Miller-3
https://issues.apache.org/jira/browse/LUCENE-997

- Mark



Kyle Maxwell wrote:

> I'd like to be able to guarantee that a search will finish in
> (approximately?) N seconds.  This seems like a generally applicable
> goal for the project.  It would be nice to not have to worry about
> malicious or naive users DOSing a search instance.  In some cases,
> precision can be sacrificed, to see results returned more speedily.
>
> This can presently be accomplished by modifying Scorers, such that the
> skipTo method is aware of how long the query has been underway.  One
> can simply return the best results found after a predetermined amount
> of time.
>
> It is unfortunate, however, that iteration is performed from oldest to
> newest.  In most cases, the newest content is the most relevant.  In
> the scenario described above, the newest (and therefore best) content
> would not be found.
>
> As I understand it, documents are stored in singly linked lists.
> Therefore, reverse iteration is impossible.  What do people think
> about modifying Lucene, such that iteration is performed in reverse?
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance guarantees and index format

Andrzej Białecki-2
Mark Miller wrote:
> https://issues.apache.org/jira/browse/LUCENE-997

What this issue doesn't discuss is what to do with partial results
obtained when a timeout occurred. As the original poster points out,
document lists are traversed in the order they were added and not the
order of their importance, which introduces a bias to partial results in
that they reflect results from a random sample (which is likely not the
most relevant, i.e. there could have been more relevant results later in
the traversal order).

The answer to this issue is org.apache.nutch.indexer.IndexSorter, which
rearranges posting lists in an already existing index, according to an
arbitrary measure of document importance, so that documents with smaller
id are more "important". Then the partial hit lists obtained as a result
of early termination will have a better chance of being more relevant.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance guarantees and index format

hossman

: What this issue doesn't discuss is what to do with partial results obtained
: when a timeout occurred. As the original poster points out, document lists are
: traversed in the order they were added and not the order of their importance,
: which introduces a bias to partial results in that they reflect results from a
: random sample (which is likely not the most relevant, i.e. there could have
: been more relevant results later in the traversal order).
:
: The answer to this issue is org.apache.nutch.indexer.IndexSorter, which

skimming this it doesn't seem like a refactored version that was less
nutch specific cold make a handy contrib ... but it also seems like there
may be a simpler approach for the (i assume) common case of prefering docs
that were indexed later....

if we eliminate the requirement for *strict* preference of recent
documents and make that a more loose desire, then we coulnd't we do a
pretty good job if we just changed Segment merging to reorder reverse the
order of the segments before each merge?  it wouldn't be very useful to
start doing this on an index that's already a decent size, but if this was
happening on every merge right from the very begining, then the most
recent documents would percollate to the front of the index right?

The only downside i can think of would be that docids would frequently
(not not very predictably) change even if there were no deletions .. but
you'd pay that same penalty with something like the nutch's IndexSorter.

I'm not much of an expert on segment merging.. but with the exception of
docid order i can'tthink of many reasons why there couldn't be a merger
that revesed the order of hte segments.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance guarantees and index format

Andrzej Białecki-2
Chris Hostetter wrote:

> : What this issue doesn't discuss is what to do with partial results obtained
> : when a timeout occurred. As the original poster points out, document lists are
> : traversed in the order they were added and not the order of their importance,
> : which introduces a bias to partial results in that they reflect results from a
> : random sample (which is likely not the most relevant, i.e. there could have
> : been more relevant results later in the traversal order).
> :
> : The answer to this issue is org.apache.nutch.indexer.IndexSorter, which
>
> skimming this it doesn't seem like a refactored version that was less
> nutch specific cold make a handy contrib ... but it also seems like there
> may be a simpler approach for the (i assume) common case of prefering docs
> that were indexed later....
>
> if we eliminate the requirement for *strict* preference of recent
> documents and make that a more loose desire, then we coulnd't we do a
> pretty good job if we just changed Segment merging to reorder reverse the
> order of the segments before each merge?  it wouldn't be very useful to
> start doing this on an index that's already a decent size, but if this was
> happening on every merge right from the very begining, then the most
> recent documents would percollate to the front of the index right?
>
> The only downside i can think of would be that docids would frequently
> (not not very predictably) change even if there were no deletions .. but
> you'd pay that same penalty with something like the nutch's IndexSorter.
>
> I'm not much of an expert on segment merging.. but with the exception of
> docid order i can'tthink of many reasons why there couldn't be a merger
> that revesed the order of hte segments.

I think this would be too messy - currently we can be sure of the simple
rule that documents added to the index get incrementally higher docids,
i.e. the higher the docid the more recent is the document. I think it
would be much simpler to write a FilterIndexReader that simply reverses
the order of docids.

The issue with Nutch's IndexSorter is that it allows you to reorder
docids in an arbitrary manner, using a user-supplied mapping between old
and new docids, which can be based on values retrieved from the current
index or from any other source. So I think this would be the main value
of this class applicable to various scenarios.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance guarantees and index format

hossman

: I think this would be too messy - currently we can be sure of the simple rule
: that documents added to the index get incrementally higher docids, i.e. the
: higher the docid the more recent is the document. I think it would be much
: simpler to write a FilterIndexReader that simply reverses the order of docids.

First off: you only have that garuntee while indexing ... if you
frequently reorder docs using something like the IndexSorter then that
rule no longer applies (and you must not care or you wouldn't have
reordered everything)

Second: using IndexSorter after an index is completley built is definitely
a simpler, clearner, way of accomplishing something like this -- but it
only seems adequate for situations in which "index building" is seperate
and distinct from "index searching" ... I can't see how it would work very
easily in situations where you are continuously performing incremental
updates while searches are taking place.

: The issue with Nutch's IndexSorter is that it allows you to reorder docids in
: an arbitrary manner, using a user-supplied mapping between old and new docids,
: which can be based on values retrieved from the current index or from any
: other source. So I think this would be the main value of this class applicable
: to various scenarios.

No Argument what-so-ever.  IndexSorter seems like a sweet tool to have in
the Lucene toolbox for letting people reordering the docs in an index by
arbitrary criteria ... but for people with the specific case of
*prefering* that recently added docs be in front of older docs, automatic
segment reordering seems like it would also be a handy tool to have in the
toolbox so that documents could "bubble up" gradually.  (maybe as a new
MergePolicy? ... probably need some API changes to allow order to be
specified)

There would definitley be trade offs people would need to consdier before
using it -- but those tradeoffs would probably also apply if they wanted
to use IndexSorter.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance guarantees and index format

Doron Cohen-2
I was once involved in modified a search index
implementation (not Lucene) to write posting lists so that
they can be traversed (only) in reverse order. Docids
were preserved but you got higher IDs first. This was
a non-trivial code change.

Now the suggestion to (optionally) order merged
segments from new to old should be much simpler
to implement (I think) and would be an interesting add-on.

If in addition DocumentsWriter is modified to optionally
reverse the order of written docs, you get the docs
completely reversed.

Being optional, applications caring about docids
stability would not use this option.

On Fri, Feb 8, 2008 at 12:22 AM, Chris Hostetter <[hidden email]>
wrote:

>
> : I think this would be too messy - currently we can be sure of the simple
> rule
> : that documents added to the index get incrementally higher docids, i.e.
> the
> : higher the docid the more recent is the document. I think it would be
> much
> : simpler to write a FilterIndexReader that simply reverses the order of
> docids.
>
> First off: you only have that garuntee while indexing ... if you
> frequently reorder docs using something like the IndexSorter then that
> rule no longer applies (and you must not care or you wouldn't have
> reordered everything)
>
> Second: using IndexSorter after an index is completley built is definitely
> a simpler, clearner, way of accomplishing something like this -- but it
> only seems adequate for situations in which "index building" is seperate
> and distinct from "index searching" ... I can't see how it would work very
> easily in situations where you are continuously performing incremental
> updates while searches are taking place.
>
> : The issue with Nutch's IndexSorter is that it allows you to reorder
> docids in
> : an arbitrary manner, using a user-supplied mapping between old and new
> docids,
> : which can be based on values retrieved from the current index or from
> any
> : other source. So I think this would be the main value of this class
> applicable
> : to various scenarios.
>
> No Argument what-so-ever.  IndexSorter seems like a sweet tool to have in
> the Lucene toolbox for letting people reordering the docs in an index by
> arbitrary criteria ... but for people with the specific case of
> *prefering* that recently added docs be in front of older docs, automatic
> segment reordering seems like it would also be a handy tool to have in the
> toolbox so that documents could "bubble up" gradually.  (maybe as a new
> MergePolicy? ... probably need some API changes to allow order to be
> specified)
>
> There would definitley be trade offs people would need to consdier before
> using it -- but those tradeoffs would probably also apply if they wanted
> to use IndexSorter.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>