how do I get my own TopDocHitCollector?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

how do I get my own TopDocHitCollector?

Beard, Brian

Question:

The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.

I'd like to use something like TopDocs to return the first 1024 results
that have unique record_id's, but I will want to skip some of the
returned documents that have the same record_id. We're using the
ParallelMultiSearcher.

I read that I could use a HitCollector and throw an exception to get it
to stop, but is there a cleaner way?




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: how do I get my own TopDocHitCollector?

adb
Beard, Brian wrote:

> Question:
>
> The documents that I index have two id's - a unique document id and a
> record_id that can link multiple documents together that belong to a
> common record.
>
> I'd like to use something like TopDocs to return the first 1024 results
> that have unique record_id's, but I will want to skip some of the
> returned documents that have the same record_id. We're using the
> ParallelMultiSearcher.
>
> I read that I could use a HitCollector and throw an exception to get it
> to stop, but is there a cleaner way?

I'm doing a similar thing.  I have external Ids (equivalent to yout record_id),
which have one or more Lucene Documents associated with them.  I wrote a custom
HitCollector that uses a Map to hold the so far collected external ids along
with the collected document.

I had to write my own priority queue to know when an object was dropped of the
bottom of the score sorted queue, but the latest PriorityQueue on the trunk now
has insertWithOverflow(), which does the same thing.

Note that ResultDoc extends ScoreDoc, so that the external Id of the item
dropped off the queue can be used to remove it from my Map.

Code snippet is somewhat as below (I am caching my external Ids, hence the cache
usage)

    protected Map<OfficeId, ScoreDoc> results;

    public void collect(int doc, float score)
     {
         if (score > 0.0f)
         {
             totalHits++;
             if (pq.size() < numHits || score > minScore)
             {
                 OfficeId id = cache.get(doc);
                 ResultDoc rd = results.get(id);
                 //  No current result for this ID yet found
                 if (rd == null)
                 {
                     rd = new ResultDoc(id, doc, score);
                     ResultDoc added = pq.insert(rd);
                     if (added == null)
                     {
                         //  Nothing dropped of the bottom
                         results.put(id, rd);
                     }
                     else
                     {
                         //  Return value dropped of the bottom
                         results.remove(added.id);
                         results.put(id, rd);
                         remaining++;
                     }
                 }
                 //  Already found this ID, so replace high score if necessary
                 else
                 {
                     if (score > rd.score)
                     {
                         pq.remove(rd);
                         rd.score = score;
                         pq.insert(rd);
                     }
                 }
                 //  Readjust our minimum score again from the top entry
                 minScore = pq.peek().score;
             }
             else
                 remaining++;
         }
     }

HTH
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: how do I get my own TopDocHitCollector?

Beard, Brian
In reply to this post by Beard, Brian
Thanks for the post. So you're using the doc id as the key into the
cache to retrieve the external id. Then what mechanism fetches the
external id's from the searcher and places them in the cache?


-----Original Message-----
From: Antony Bowesman [mailto:[hidden email]]
Sent: Wednesday, January 09, 2008 7:19 PM
To: [hidden email]
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:
> Question:
>
> The documents that I index have two id's - a unique document id and a
> record_id that can link multiple documents together that belong to a
> common record.
>
> I'd like to use something like TopDocs to return the first 1024
results
> that have unique record_id's, but I will want to skip some of the
> returned documents that have the same record_id. We're using the
> ParallelMultiSearcher.
>
> I read that I could use a HitCollector and throw an exception to get
it
> to stop, but is there a cleaner way?

I'm doing a similar thing.  I have external Ids (equivalent to yout
record_id),
which have one or more Lucene Documents associated with them.  I wrote a
custom
HitCollector that uses a Map to hold the so far collected external ids
along
with the collected document.

I had to write my own priority queue to know when an object was dropped
of the
bottom of the score sorted queue, but the latest PriorityQueue on the
trunk now
has insertWithOverflow(), which does the same thing.

Note that ResultDoc extends ScoreDoc, so that the external Id of the
item
dropped off the queue can be used to remove it from my Map.

Code snippet is somewhat as below (I am caching my external Ids, hence
the cache
usage)

    protected Map<OfficeId, ScoreDoc> results;

    public void collect(int doc, float score)
     {
         if (score > 0.0f)
         {
             totalHits++;
             if (pq.size() < numHits || score > minScore)
             {
                 OfficeId id = cache.get(doc);
                 ResultDoc rd = results.get(id);
                 //  No current result for this ID yet found
                 if (rd == null)
                 {
                     rd = new ResultDoc(id, doc, score);
                     ResultDoc added = pq.insert(rd);
                     if (added == null)
                     {
                         //  Nothing dropped of the bottom
                         results.put(id, rd);
                     }
                     else
                     {
                         //  Return value dropped of the bottom
                         results.remove(added.id);
                         results.put(id, rd);
                         remaining++;
                     }
                 }
                 //  Already found this ID, so replace high score if
necessary
                 else
                 {
                     if (score > rd.score)
                     {
                         pq.remove(rd);
                         rd.score = score;
                         pq.insert(rd);
                     }
                 }
                 //  Readjust our minimum score again from the top entry
                 minScore = pq.peek().score;
             }
             else
                 remaining++;
         }
     }

HTH
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: how do I get my own TopDocHitCollector?

Beard, Brian
In reply to this post by Beard, Brian
Ok, I've been thinking about this some more. Is the cache mechanism
pulling from the cache if the external id already exists there and then
hitting the searcher if it's not already in the cache (maybe using a
FieldSelector for just retrieving the external id)?

-----Original Message-----
From: Beard, Brian [mailto:[hidden email]]
Sent: Thursday, January 10, 2008 10:08 AM
To: [hidden email]
Subject: RE: how do I get my own TopDocHitCollector?

Thanks for the post. So you're using the doc id as the key into the
cache to retrieve the external id. Then what mechanism fetches the
external id's from the searcher and places them in the cache?


-----Original Message-----
From: Antony Bowesman [mailto:[hidden email]]
Sent: Wednesday, January 09, 2008 7:19 PM
To: [hidden email]
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:
> Question:
>
> The documents that I index have two id's - a unique document id and a
> record_id that can link multiple documents together that belong to a
> common record.
>
> I'd like to use something like TopDocs to return the first 1024
results
> that have unique record_id's, but I will want to skip some of the
> returned documents that have the same record_id. We're using the
> ParallelMultiSearcher.
>
> I read that I could use a HitCollector and throw an exception to get
it
> to stop, but is there a cleaner way?

I'm doing a similar thing.  I have external Ids (equivalent to yout
record_id),
which have one or more Lucene Documents associated with them.  I wrote a
custom
HitCollector that uses a Map to hold the so far collected external ids
along
with the collected document.

I had to write my own priority queue to know when an object was dropped
of the
bottom of the score sorted queue, but the latest PriorityQueue on the
trunk now
has insertWithOverflow(), which does the same thing.

Note that ResultDoc extends ScoreDoc, so that the external Id of the
item
dropped off the queue can be used to remove it from my Map.

Code snippet is somewhat as below (I am caching my external Ids, hence
the cache
usage)

    protected Map<OfficeId, ScoreDoc> results;

    public void collect(int doc, float score)
     {
         if (score > 0.0f)
         {
             totalHits++;
             if (pq.size() < numHits || score > minScore)
             {
                 OfficeId id = cache.get(doc);
                 ResultDoc rd = results.get(id);
                 //  No current result for this ID yet found
                 if (rd == null)
                 {
                     rd = new ResultDoc(id, doc, score);
                     ResultDoc added = pq.insert(rd);
                     if (added == null)
                     {
                         //  Nothing dropped of the bottom
                         results.put(id, rd);
                     }
                     else
                     {
                         //  Return value dropped of the bottom
                         results.remove(added.id);
                         results.put(id, rd);
                         remaining++;
                     }
                 }
                 //  Already found this ID, so replace high score if
necessary
                 else
                 {
                     if (score > rd.score)
                     {
                         pq.remove(rd);
                         rd.score = score;
                         pq.insert(rd);
                     }
                 }
                 //  Readjust our minimum score again from the top entry
                 minScore = pq.peek().score;
             }
             else
                 remaining++;
         }
     }

HTH
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: how do I get my own TopDocHitCollector?

adb
Beard, Brian wrote:
> Ok, I've been thinking about this some more. Is the cache mechanism
> pulling from the cache if the external id already exists there and then
> hitting the searcher if it's not already in the cache (maybe using a
> FieldSelector for just retrieving the external id)?

I am warming searchers in background and each search has one or more query
related caches.  The external Id cache is normally preloaded by simply iterating
terms, e.g.

         String field = fieldName.intern();
         final String[] retArray = new String[reader.maxDoc()];
         TermDocs termDocs = reader.termDocs();
         TermEnum termEnum = reader.terms (new Term (field, ""));
         try
         {
             do
             {
                 Term term = termEnum.term();
                 if (term == null || term.field() != field)
                     break;
                 String termval = term.text();
                 termDocs.seek(termEnum);
                 while (termDocs.next())
                 {
                     retArray[termDocs.doc()] = termval;
                 }
             }
             while (termEnum.next());
         }
         finally
         {
             termDocs.close();
             termEnum.close();
         }
         return retArray;

I do allow for a partial cache, in which case, as you suggest, the searcher uses
a FieldSelector to get the external Id from the document which then is stored to
cache.

Antony



>
> -----Original Message-----
> From: Beard, Brian [mailto:[hidden email]]
> Sent: Thursday, January 10, 2008 10:08 AM
> To: [hidden email]
> Subject: RE: how do I get my own TopDocHitCollector?
>
> Thanks for the post. So you're using the doc id as the key into the
> cache to retrieve the external id. Then what mechanism fetches the
> external id's from the searcher and places them in the cache?
>
>
> -----Original Message-----
> From: Antony Bowesman [mailto:[hidden email]]
> Sent: Wednesday, January 09, 2008 7:19 PM
> To: [hidden email]
> Subject: Re: how do I get my own TopDocHitCollector?
>
> Beard, Brian wrote:
>> Question:
>>
>> The documents that I index have two id's - a unique document id and a
>> record_id that can link multiple documents together that belong to a
>> common record.
>>
>> I'd like to use something like TopDocs to return the first 1024
> results
>> that have unique record_id's, but I will want to skip some of the
>> returned documents that have the same record_id. We're using the
>> ParallelMultiSearcher.
>>
>> I read that I could use a HitCollector and throw an exception to get
> it
>> to stop, but is there a cleaner way?
>
> I'm doing a similar thing.  I have external Ids (equivalent to yout
> record_id),
> which have one or more Lucene Documents associated with them.  I wrote a
> custom
> HitCollector that uses a Map to hold the so far collected external ids
> along
> with the collected document.
>
> I had to write my own priority queue to know when an object was dropped
> of the
> bottom of the score sorted queue, but the latest PriorityQueue on the
> trunk now
> has insertWithOverflow(), which does the same thing.
>
> Note that ResultDoc extends ScoreDoc, so that the external Id of the
> item
> dropped off the queue can be used to remove it from my Map.
>
> Code snippet is somewhat as below (I am caching my external Ids, hence
> the cache
> usage)
>
>     protected Map<OfficeId, ScoreDoc> results;
>
>     public void collect(int doc, float score)
>      {
>          if (score > 0.0f)
>          {
>              totalHits++;
>              if (pq.size() < numHits || score > minScore)
>              {
>                  OfficeId id = cache.get(doc);
>                  ResultDoc rd = results.get(id);
>                  //  No current result for this ID yet found
>                  if (rd == null)
>                  {
>                      rd = new ResultDoc(id, doc, score);
>                      ResultDoc added = pq.insert(rd);
>                      if (added == null)
>                      {
>                          //  Nothing dropped of the bottom
>                          results.put(id, rd);
>                      }
>                      else
>                      {
>                          //  Return value dropped of the bottom
>                          results.remove(added.id);
>                          results.put(id, rd);
>                          remaining++;
>                      }
>                  }
>                  //  Already found this ID, so replace high score if
> necessary
>                  else
>                  {
>                      if (score > rd.score)
>                      {
>                          pq.remove(rd);
>                          rd.score = score;
>                          pq.insert(rd);
>                      }
>                  }
>                  //  Readjust our minimum score again from the top entry
>                  minScore = pq.peek().score;
>              }
>              else
>                  remaining++;
>          }
>      }
>
> HTH
> Antony
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: how do I get my own TopDocHitCollector?

Beard, Brian
In reply to this post by Beard, Brian
Thanks for all this. We're doing warmup searching also, but just for
some common date searches. The warmup would be a good place to add some
pre-caching capability. I'll plan for this eventually and start with the
partial cache for now.

Thanks,

Brian Beard

-----Original Message-----
From: Antony Bowesman [mailto:[hidden email]]
Sent: Thursday, January 10, 2008 3:19 PM
To: [hidden email]
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:
> Ok, I've been thinking about this some more. Is the cache mechanism
> pulling from the cache if the external id already exists there and
then
> hitting the searcher if it's not already in the cache (maybe using a
> FieldSelector for just retrieving the external id)?

I am warming searchers in background and each search has one or more
query
related caches.  The external Id cache is normally preloaded by simply
iterating
terms, e.g.

         String field = fieldName.intern();
         final String[] retArray = new String[reader.maxDoc()];
         TermDocs termDocs = reader.termDocs();
         TermEnum termEnum = reader.terms (new Term (field, ""));
         try
         {
             do
             {
                 Term term = termEnum.term();
                 if (term == null || term.field() != field)
                     break;
                 String termval = term.text();
                 termDocs.seek(termEnum);
                 while (termDocs.next())
                 {
                     retArray[termDocs.doc()] = termval;
                 }
             }
             while (termEnum.next());
         }
         finally
         {
             termDocs.close();
             termEnum.close();
         }
         return retArray;

I do allow for a partial cache, in which case, as you suggest, the
searcher uses
a FieldSelector to get the external Id from the document which then is
stored to
cache.

Antony



>
> -----Original Message-----
> From: Beard, Brian [mailto:[hidden email]]
> Sent: Thursday, January 10, 2008 10:08 AM
> To: [hidden email]
> Subject: RE: how do I get my own TopDocHitCollector?
>
> Thanks for the post. So you're using the doc id as the key into the
> cache to retrieve the external id. Then what mechanism fetches the
> external id's from the searcher and places them in the cache?
>
>
> -----Original Message-----
> From: Antony Bowesman [mailto:[hidden email]]
> Sent: Wednesday, January 09, 2008 7:19 PM
> To: [hidden email]
> Subject: Re: how do I get my own TopDocHitCollector?
>
> Beard, Brian wrote:
>> Question:
>>
>> The documents that I index have two id's - a unique document id and a
>> record_id that can link multiple documents together that belong to a
>> common record.
>>
>> I'd like to use something like TopDocs to return the first 1024
> results
>> that have unique record_id's, but I will want to skip some of the
>> returned documents that have the same record_id. We're using the
>> ParallelMultiSearcher.
>>
>> I read that I could use a HitCollector and throw an exception to get
> it
>> to stop, but is there a cleaner way?
>
> I'm doing a similar thing.  I have external Ids (equivalent to yout
> record_id),
> which have one or more Lucene Documents associated with them.  I wrote
a
> custom
> HitCollector that uses a Map to hold the so far collected external ids
> along
> with the collected document.
>
> I had to write my own priority queue to know when an object was
dropped

> of the
> bottom of the score sorted queue, but the latest PriorityQueue on the
> trunk now
> has insertWithOverflow(), which does the same thing.
>
> Note that ResultDoc extends ScoreDoc, so that the external Id of the
> item
> dropped off the queue can be used to remove it from my Map.
>
> Code snippet is somewhat as below (I am caching my external Ids, hence
> the cache
> usage)
>
>     protected Map<OfficeId, ScoreDoc> results;
>
>     public void collect(int doc, float score)
>      {
>          if (score > 0.0f)
>          {
>              totalHits++;
>              if (pq.size() < numHits || score > minScore)
>              {
>                  OfficeId id = cache.get(doc);
>                  ResultDoc rd = results.get(id);
>                  //  No current result for this ID yet found
>                  if (rd == null)
>                  {
>                      rd = new ResultDoc(id, doc, score);
>                      ResultDoc added = pq.insert(rd);
>                      if (added == null)
>                      {
>                          //  Nothing dropped of the bottom
>                          results.put(id, rd);
>                      }
>                      else
>                      {
>                          //  Return value dropped of the bottom
>                          results.remove(added.id);
>                          results.put(id, rd);
>                          remaining++;
>                      }
>                  }
>                  //  Already found this ID, so replace high score if
> necessary
>                  else
>                  {
>                      if (score > rd.score)
>                      {
>                          pq.remove(rd);
>                          rd.score = score;
>                          pq.insert(rd);
>                      }
>                  }
>                  //  Readjust our minimum score again from the top
entry

>                  minScore = pq.peek().score;
>              }
>              else
>                  remaining++;
>          }
>      }
>
> HTH
> Antony
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]