How can I search over all documents NOT in a certain subset?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How can I search over all documents NOT in a certain subset?

hiltonc
Hello all,

In my application I want to perform a search over all the documents that are
NOT in a certain subset, and I'm not sure how I should do it.

Specifically, the application performs a search and the top N results are
shown to the user.  The user may then opt to see the next top N results.  By
the time the user chooses to see the next N results, however, there may be
new, highly-relevant documents in the index (as indexing is occurring
concurrently).  So instead of just skipping to the next N, I need to perform
a new search and get the top N that haven't been seen yet.  Is anyone aware
of an efficient way to implement this?

I can think of at least one way: I can keep track of the documents that have
been seen and iterate through all the hits, skipping those that have already
been seen.  I just want to see if there isn't a better way that doesn't
iterate through potentially hundreds of already seen hits, or if anyone has
any pointers on an efficient implementation of this idea.

Thanks!
Hilton Campbell
Reply | Threaded
Open this post in threaded view
|

Re: How can I search over all documents NOT in a certain subset?

steve_rowe
Hi Hilton,

Hilton Campbell wrote:

> Hello all,
>
> In my application I want to perform a search over all the documents
> that are NOT in a certain subset, and I'm not sure how I should do
> it.
>
> Specifically, the application performs a search and the top N results
> are shown to the user. The user may then opt to see the next top N
> results. By the time the user chooses to see the next N results,
> however, there may be new, highly-relevant documents in the index (as
> indexing is occurring concurrently). So instead of just skipping to
> the next N, I need to perform a new search and get the top N that
> haven't been seen yet. Is anyone aware of an efficient way to
> implement this?
>
> I can think of at least one way: I can keep track of the documents
> that have been seen and iterate through all the hits, skipping those
> that have already been seen. I just want to see if there isn't a
> better way that doesn't iterate through potentially hundreds of
> already seen hits, or if anyone has any pointers on an efficient
> implementation of this idea.

Conceptually (caveat: untested), you could:

1. Extend Filter[1] (call it DejaVuFilter) to hold a BitSet per
IndexReader.  The BitSet would hold one bit per doc[2], each initialized
to true.

2. Unset a DejaVuFilter instance's bit for each of your top N docs by
walking the TopDocs returned by Searcher.search(Query,Filter,int)[3].
Initially, you could pass in null for the Filter, and then for all
following calls, an instance of DejaVuFilter.

3. Repeat step #2 as many times as necessary.

Steve

[1]
<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Filter.html>
[2]
<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/index/IndexReader.html#maxDoc()>
[3]
<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Searcher.html#search(org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int)>

--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: How can I search over all documents NOT in a certain subset?

hiltonc
Steve,

Thanks for the great reply!  That worked like a charm.  I really appreciate
it.

Thanks,
Hilton Campbell

-----Original Message-----
From: Steven Rowe [mailto:[hidden email]]
Sent: Tuesday, June 05, 2007 2:50 PM
To: [hidden email]
Subject: Re: How can I search over all documents NOT in a certain subset?

Hi Hilton,

Hilton Campbell wrote:

> Hello all,
>
> In my application I want to perform a search over all the documents
> that are NOT in a certain subset, and I'm not sure how I should do
> it.
>
> Specifically, the application performs a search and the top N results
> are shown to the user. The user may then opt to see the next top N
> results. By the time the user chooses to see the next N results,
> however, there may be new, highly-relevant documents in the index (as
> indexing is occurring concurrently). So instead of just skipping to
> the next N, I need to perform a new search and get the top N that
> haven't been seen yet. Is anyone aware of an efficient way to
> implement this?
>
> I can think of at least one way: I can keep track of the documents
> that have been seen and iterate through all the hits, skipping those
> that have already been seen. I just want to see if there isn't a
> better way that doesn't iterate through potentially hundreds of
> already seen hits, or if anyone has any pointers on an efficient
> implementation of this idea.

Conceptually (caveat: untested), you could:

1. Extend Filter[1] (call it DejaVuFilter) to hold a BitSet per
IndexReader.  The BitSet would hold one bit per doc[2], each initialized
to true.

2. Unset a DejaVuFilter instance's bit for each of your top N docs by
walking the TopDocs returned by Searcher.search(Query,Filter,int)[3].
Initially, you could pass in null for the Filter, and then for all
following calls, an instance of DejaVuFilter.

3. Repeat step #2 as many times as necessary.

Steve

[1]
<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Filter.htm
l>
[2]
<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/index/IndexReader
.html#maxDoc()>
[3]
<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Searcher.h
tml#search(org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter
,%20int)>

--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: How can I search over all documents NOT in a certain subset?

adb
In reply to this post by steve_rowe
Steven Rowe wrote:

> Conceptually (caveat: untested), you could:
>
> 1. Extend Filter[1] (call it DejaVuFilter) to hold a BitSet per
> IndexReader.  The BitSet would hold one bit per doc[2], each initialized
> to true.
>
> 2. Unset a DejaVuFilter instance's bit for each of your top N docs by
> walking the TopDocs returned by Searcher.search(Query,Filter,int)[3].
> Initially, you could pass in null for the Filter, and then for all
> following calls, an instance of DejaVuFilter.

Just a thought...

If Hilton wants to be aware of new Documents in the index since the previous
search, this requires opening a new IndexReader.

If only Documents have been added to the index I expect, but am not sure, that
the bits from the old IndexReader are still valid for the document numbers in
the new Reader.  However, if there have been deletions or optimisation has
occurred between reader instances, then the document ids from the old reader may
not represent the same documents in the new reader, so the Filter for the old
reader will not be valid for the new search against the new reader and you may
get false matches.

I don't think there will be a problem if there are no deletions.

Antony






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: How can I search over all documents NOT in a certain subset?

hiltonc
Yes, that's actually come up.  The document ids are indeed changing which is
causing problems.  I'm still trying to work it out myself, but any help
would most definitely be appreciated.

Thanks,
Hilton Campbell

-----Original Message-----
From: Antony Bowesman [mailto:[hidden email]]
Sent: Wednesday, June 06, 2007 11:36 PM
To: [hidden email]
Subject: Re: How can I search over all documents NOT in a certain subset?

Steven Rowe wrote:

> Conceptually (caveat: untested), you could:
>
> 1. Extend Filter[1] (call it DejaVuFilter) to hold a BitSet per
> IndexReader.  The BitSet would hold one bit per doc[2], each initialized
> to true.
>
> 2. Unset a DejaVuFilter instance's bit for each of your top N docs by
> walking the TopDocs returned by Searcher.search(Query,Filter,int)[3].
> Initially, you could pass in null for the Filter, and then for all
> following calls, an instance of DejaVuFilter.

Just a thought...

If Hilton wants to be aware of new Documents in the index since the previous

search, this requires opening a new IndexReader.

If only Documents have been added to the index I expect, but am not sure,
that
the bits from the old IndexReader are still valid for the document numbers
in
the new Reader.  However, if there have been deletions or optimisation has
occurred between reader instances, then the document ids from the old reader
may
not represent the same documents in the new reader, so the Filter for the
old
reader will not be valid for the new search against the new reader and you
may
get false matches.

I don't think there will be a problem if there are no deletions.

Antony






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: How can I search over all documents NOT in a certain subset?

adb
Hilton Campbell wrote:
> Yes, that's actually come up.  The document ids are indeed changing which is
> causing problems.  I'm still trying to work it out myself, but any help
> would most definitely be appreciated.

If you have an application Id per document, then you could cache that field for
each reader and when you open the new reader, create a new cache of the IDs for
that reader and then re-evaluate the bitmap according to the changed Ids.

You may be able to optimise the case for two readers by calculating the mapping
once and then use that for each bitmap that needs reevaluating.

Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can I search over all documents NOT in a certain subset?

steve_rowe
In reply to this post by hiltonc
Hi Hilton,

Hilton Campbell wrote:

> Yes, that's actually come up.  The document ids are indeed changing which is
> causing problems.  I'm still trying to work it out myself, but any help
> would most definitely be appreciated.
>
> Thanks,
> Hilton Campbell
>
> -----Original Message-----
> From: Antony Bowesman [mailto:[hidden email]]
> Sent: Wednesday, June 06, 2007 11:36 PM
> To: [hidden email]
> Subject: Re: How can I search over all documents NOT in a certain subset?
>
> Steven Rowe wrote:
>> Conceptually (caveat: untested), you could:
>>
>> 1. Extend Filter[1] (call it DejaVuFilter) to hold a BitSet per
>> IndexReader.  The BitSet would hold one bit per doc[2], each initialized
>> to true.
>>
>> 2. Unset a DejaVuFilter instance's bit for each of your top N docs by
>> walking the TopDocs returned by Searcher.search(Query,Filter,int)[3].
>> Initially, you could pass in null for the Filter, and then for all
>> following calls, an instance of DejaVuFilter.
>
> Just a thought...
>
> If Hilton wants to be aware of new Documents in the index since the previous
> search, this requires opening a new IndexReader.
>
> If only Documents have been added to the index I expect, but am not
> sure, that the bits from the old IndexReader are still valid for the
> document numbers in the new Reader. However, if there have been
> deletions or optimisation has occurred between reader instances, then
> the document ids from the old reader may not represent the same
> documents in the new reader, so the Filter for the old reader will
> not be valid for the new search against the new reader and you may
> get false matches.
>
> I don't think there will be a problem if there are no deletions.

My bad for not pointing out this shortcoming.

Karl Wettin's patch may be useful to you:

  <https://issues.apache.org/jira/browse/LUCENE-879>

Steve

--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]