Filters and multiple, per-segment calls to getDocIdSet

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Filters and multiple, per-segment calls to getDocIdSet

Daniel Noll-3-2
Hi all.

I notice that Filter.getDocIdSet() is now documented as follows:

    Note: This method will be called once per segment in
    the index during searching.  The returned {@link DocIdSet}
    must refer to document IDs for that segment, not for
    the top-level reader.

If I look at Lucene's own DuplicateFilter, isn't it making the
assumption that it will only be called once?

And a related question: for those of us who want to implement
something *like* DuplicateFilter (as I have done before discovering
this new Javadoc), is there a good way to go about it?  It seems like
we now need to keep a hash of all terms previously seen so that when
we go over the new term enum we can check which ones have already been
seen.  This will dramatically increase memory usage compared to a
single BitSet/OpenBitSet.  Is there a better way?

Also, I presume this means that Filter is now explicitly not
threadsafe.  We weren't keeping any state in them anyway, but now we
will have to, so there is potential for a lot of new bugs if a filter
is somehow used by two queries running at the same time.

Daniel


--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Filters and multiple, per-segment calls to getDocIdSet

Michael McCandless-2
On Thu, Mar 25, 2010 at 12:55 AM, Daniel Noll <[hidden email]> wrote:
> Hi all.
>
> I notice that Filter.getDocIdSet() is now documented as follows:
>
>    Note: This method will be called once per segment in
>    the index during searching.  The returned {@link DocIdSet}
>    must refer to document IDs for that segment, not for
>    the top-level reader.

Right, this is from the cutover to per-segment searching, as of 2.9.

> If I look at Lucene's own DuplicateFilter, isn't it making the
> assumption that it will only be called once?

Hmm... yes it seems so.  Ie, as it now stands, it only eliminates
duplicates within each segment, not across segments.  Can you open an
issue?  Thanks.

> And a related question: for those of us who want to implement
> something *like* DuplicateFilter (as I have done before discovering
> this new Javadoc), is there a good way to go about it?  It seems like
> we now need to keep a hash of all terms previously seen so that when
> we go over the new term enum we can check which ones have already been
> seen.  This will dramatically increase memory usage compared to a
> single BitSet/OpenBitSet.  Is there a better way?

This depends on the particulars of filter... but in general you
shouldn't have to consume more RAM, I think?  Ie you should be able to
do your computation against the top-level reader, and then store the
results of your computation per-sub-reader.

EG, for DuplicatesFilter, probably it should up-front (or, 1st time
its used -- lazily) iterate all terms/docs across all segments,
building up a map of sub-reader -> bitset, and then when getDocIdSet
is called for a given reader, just return what it had already computed
for that reader.

> Also, I presume this means that Filter is now explicitly not
> threadsafe.  We weren't keeping any state in them anyway, but now we
> will have to, so there is potential for a lot of new bugs if a filter
> is somehow used by two queries running at the same time.

This is dependent on the specific filter.  Many filters don't need the
top-level reader in order to generate the bitset for a sub-reader, so
they can remain "stateless".

For those that do need top-level reader, like DuplicatesFilter, I
agree you'll need some sync'ing so that only 1 thread does that lazy
init, and its results are safely visible to other threads.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Filters and multiple, per-segment calls to getDocIdSet

Daniel Noll-3-2
On Thu, Mar 25, 2010 at 21:41, Michael McCandless
<[hidden email]> wrote:
>
> This depends on the particulars of filter... but in general you
> shouldn't have to consume more RAM, I think?  Ie you should be able to
> do your computation against the top-level reader, and then store the
> results of your computation per-sub-reader.

I am having issues figuring out how to get a reference to the
top-level reader.  The API passes them in one by one and I can't see a
way to find the top-level reader for one which was passed in.  I can't
easily cheat and pass the top-level one into the Filter constructor,
because filters are serialisable and that kind of thing won't survive
serialisation.

To throw an additional spanner in the works, the behaviour I need is
that only the *last* document should be returned.  So even if a
certain document matches the filter after N readers have been passed
in, it might not match the filter after N+1 readers have been passed
in.  Essentially I need a method like...

    DocIdSet[] getDocIdSets(IndexReader[] readers);

And where the readers are guaranteed to be in order of docBase.

By the way, I notice that the order the readers are passed to the
method is essentially undocumented.  The test code appears to be
assuming they will be passed in the natural order of the documents
(which is logical) but couldn't a future change parallelise segment
searches for performance reasons, thus reordering the calls?  It would
be nice if the API would explicitly pass the docBase for the
IndexReader - this would reduce the need to perform maths to determine
the docBase ourselves, and also make it possible to parallelise those
calls later.

Daniel

--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Filters and multiple, per-segment calls to getDocIdSet

Michael McCandless-2
On Fri, Mar 26, 2010 at 1:17 AM, Daniel Noll <[hidden email]> wrote:

> On Thu, Mar 25, 2010 at 21:41, Michael McCandless
> <[hidden email]> wrote:
>>
>> This depends on the particulars of filter... but in general you
>> shouldn't have to consume more RAM, I think?  Ie you should be able to
>> do your computation against the top-level reader, and then store the
>> results of your computation per-sub-reader.
>
> I am having issues figuring out how to get a reference to the
> top-level reader.  The API passes them in one by one and I can't see a
> way to find the top-level reader for one which was passed in.  I can't
> easily cheat and pass the top-level one into the Filter constructor,
> because filters are serialisable and that kind of thing won't survive
> serialisation.

Probably, for now at least, you'll have to park a reference to the
current IndexReader in some known, static place, where your filter
checks?

> To throw an additional spanner in the works, the behaviour I need is
> that only the *last* document should be returned.  So even if a
> certain document matches the filter after N readers have been passed
> in, it might not match the filter after N+1 readers have been passed
> in.  Essentially I need a method like...
>
>    DocIdSet[] getDocIdSets(IndexReader[] readers);
>
> And where the readers are guaranteed to be in order of docBase.

This is a doozie of a filter :)

> By the way, I notice that the order the readers are passed to the
> method is essentially undocumented. The test code appears to be
> assuming they will be passed in the natural order of the documents
> (which is logical) but couldn't a future change parallelise segment
> searches for performance reasons, thus reordering the calls?

Yeah the order is intentionally not defined, to allow for possible
future optimizations like this.  It is in-docBase-order today, but
won't necessarily always be that way... in fact during 2.9 development
there was a brief time when it was in a different order.

> It would
> be nice if the API would explicitly pass the docBase for the
> IndexReader - this would reduce the need to perform maths to determine
> the docBase ourselves, and also make it possible to parallelise those
> calls later.

Maybe we should do that... or maybe make a new class, containing the
toplevel reader, sub reader, and docBase?  Something like that?  I
think there may be an issue already open for this...

I agree this API is problematic for context-sensitive filters (filters
that need to see all segments in the index in order to decide how to
compute the current segment's DISI).  Most of Lucene's filters are
context-free and so this API was created with that use-case in mind.
Yours is not the only context-sensitive filter out there...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]