Migrating from Hit/Hits to TopDocs/TopDocCollector

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Migrating from Hit/Hits to TopDocs/TopDocCollector

Paul J. Lucas
I have existing code that's like:

        final Term t = /* ... */;
         final Iterator i = searcher.search( new  
TermQuery( t ) ).iterator();
         while ( i.hasNext() ) {
             final Hit hit = (Hit)i.next();
            // "FILE" is the field that recorded the original file indexed
             final File f = new File( hit.get( "FILE" ) );
            // ...
         }

It's not clear to me how to rewrite the code using TopDocs/
TopDocCollector and how to iterate over the results.

A little help?  Thanks.  :-)

- Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Ian Lea
Hi


The code below might do the job.  Based on the example at
http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Hits.html

Completely uncompiled and untested of course.

TopDocCollector collector = new TopDocCollector(hitsPerPage);
final Term t = /* ... */;
Query query = new TermQuery( t )
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (int i = 0; i < hits.length; i++) {
     int docId = hits[i].doc;
     Document d = searcher.doc(docId);
     final File f = new File( d.get( "FILE" ) );
}


--
Ian.


On Wed, Jun 10, 2009 at 2:04 AM, Paul J. Lucas<[hidden email]> wrote:

> I have existing code that's like:
>
>        final Term t = /* ... */;
>        final Iterator i = searcher.search( new TermQuery( t ) ).iterator();
>        while ( i.hasNext() ) {
>            final Hit hit = (Hit)i.next();
>            // "FILE" is the field that recorded the original file indexed
>            final File f = new File( hit.get( "FILE" ) );
>            // ...
>        }
>
> It's not clear to me how to rewrite the code using TopDocs/TopDocCollector
> and how to iterate over the results.
>
> A little help?  Thanks.  :-)
>
> - Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Wouter Heijke
In reply to this post by Paul J. Lucas

Will this do?

IndexReader indexReader = searcher.getIndexReader();
TopDocs topDocs = searcher.search(Query query, int n);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
  Document document = indexReader.document( topDocs.scoreDocs[i].doc);
  final File f = new File( document.get( "FILE" ) );
}


> I have existing code that's like:
>
> final Term t = /* ... */;
>          final Iterator i = searcher.search( new
> TermQuery( t ) ).iterator();
>          while ( i.hasNext() ) {
>              final Hit hit = (Hit)i.next();
>    // "FILE" is the field that recorded the original file indexed
>              final File f = new File( hit.get( "FILE" ) );
>    // ...
>          }
>
> It's not clear to me how to rewrite the code using TopDocs/
> TopDocCollector and how to iterate over the results.
>
> A little help?  Thanks.  :-)
>
> - Paul
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

Uwe Schindler
This code snipplet would only work, if you want to iterate over e.g. the
first 20 documents (which is n in your code). If he wants to iterate over
all results, he should think about using a custom (Hit)Collector.

The code below will be very slow for large result sets (because retrieving
stored fields is not effective for a large number of documents, look into
the warning about the "inner search loop" in Wiki). To just retrieve e.g. a
Filename, it may really be better to use a FieldCache on the "FILE" field
and inside the HitCollector, use the doc number to get the filename from the
cache. I think the speed improve will be >>10 times as fast!

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Wouter Heijke [mailto:[hidden email]]
> Sent: Wednesday, June 10, 2009 11:44 AM
> To: [hidden email]
> Subject: Re: Migrating from Hit/Hits to TopDocs/TopDocCollector
>
>
> Will this do?
>
> IndexReader indexReader = searcher.getIndexReader();
> TopDocs topDocs = searcher.search(Query query, int n);
> for (int i = 0; i < topDocs.scoreDocs.length; i++) {
>   Document document = indexReader.document( topDocs.scoreDocs[i].doc);
>   final File f = new File( document.get( "FILE" ) );
> }
>
>
> > I have existing code that's like:
> >
> > final Term t = /* ... */;
> >          final Iterator i = searcher.search( new
> > TermQuery( t ) ).iterator();
> >          while ( i.hasNext() ) {
> >              final Hit hit = (Hit)i.next();
> >    // "FILE" is the field that recorded the original file indexed
> >              final File f = new File( hit.get( "FILE" ) );
> >    // ...
> >          }
> >
> > It's not clear to me how to rewrite the code using TopDocs/
> > TopDocCollector and how to iterate over the results.
> >
> > A little help?  Thanks.  :-)
> >
> > - Paul
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

Wouter Heijke
In reply to this post by Paul J. Lucas
You are wrong.
As the java doc reads: 'Finds the top n hits for query'
You can set n to whatever value you want, 'all' documents (not results!)
indexed in your index if you want, or 10 if you want the top 10.

Anyway, it's just an example to give a direction..

Wouter

> This code snipplet would only work, if you want to iterate over e.g. the
> first 20 documents (which is n in your code). If he wants to iterate over
> all results, he should think about using a custom (Hit)Collector.
>
> The code below will be very slow for large result sets (because retrieving
> stored fields is not effective for a large number of documents, look into
> the warning about the "inner search loop" in Wiki). To just retrieve e.g.
> a
> Filename, it may really be better to use a FieldCache on the "FILE" field
> and inside the HitCollector, use the doc number to get the filename from
> the
> cache. I think the speed improve will be >>10 times as fast!
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>> -----Original Message-----
>> From: Wouter Heijke [mailto:[hidden email]]
>> Sent: Wednesday, June 10, 2009 11:44 AM
>> To: [hidden email]
>> Subject: Re: Migrating from Hit/Hits to TopDocs/TopDocCollector
>>
>>
>> Will this do?
>>
>> IndexReader indexReader = searcher.getIndexReader();
>> TopDocs topDocs = searcher.search(Query query, int n);
>> for (int i = 0; i < topDocs.scoreDocs.length; i++) {
>>   Document document = indexReader.document( topDocs.scoreDocs[i].doc);
>>   final File f = new File( document.get( "FILE" ) );
>> }
>>
>>
>> > I have existing code that's like:
>> >
>> > final Term t = /* ... */;
>> >          final Iterator i = searcher.search( new
>> > TermQuery( t ) ).iterator();
>> >          while ( i.hasNext() ) {
>> >              final Hit hit = (Hit)i.next();
>> >    // "FILE" is the field that recorded the original file indexed
>> >              final File f = new File( hit.get( "FILE" ) );
>> >    // ...
>> >          }
>> >
>> > It's not clear to me how to rewrite the code using TopDocs/
>> > TopDocCollector and how to iterate over the results.
>> >
>> > A little help?  Thanks.  :-)
>> >
>> > - Paul
>> >
>>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

Uwe Schindler
> You are wrong.
> As the java doc reads: 'Finds the top n hits for query'
> You can set n to whatever value you want, 'all' documents (not results!)
> indexed in your index if you want, or 10 if you want the top 10.

You are right, you can, but if you just want to retrieve all hits, this is
ineffective. A HitCollector is the correct way to do this (especially
because the order of hits is mostly not interesting when retrieving all
hits). Hits and TopDocs are intended for paged results lists.

> Anyway, it's just an example to give a direction..

Same here,
I wanted to give Paul a hint, how to do it correctly and effective.

> Wouter
>
> > This code snipplet would only work, if you want to iterate over e.g. the
> > first 20 documents (which is n in your code). If he wants to iterate
> over
> > all results, he should think about using a custom (Hit)Collector.
> >
> > The code below will be very slow for large result sets (because
> retrieving
> > stored fields is not effective for a large number of documents, look
> into
> > the warning about the "inner search loop" in Wiki). To just retrieve
> e.g.
> > a
> > Filename, it may really be better to use a FieldCache on the "FILE"
> field
> > and inside the HitCollector, use the doc number to get the filename from
> > the
> > cache. I think the speed improve will be >>10 times as fast!
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [hidden email]
> >
> >> -----Original Message-----
> >> From: Wouter Heijke [mailto:[hidden email]]
> >> Sent: Wednesday, June 10, 2009 11:44 AM
> >> To: [hidden email]
> >> Subject: Re: Migrating from Hit/Hits to TopDocs/TopDocCollector
> >>
> >>
> >> Will this do?
> >>
> >> IndexReader indexReader = searcher.getIndexReader();
> >> TopDocs topDocs = searcher.search(Query query, int n);
> >> for (int i = 0; i < topDocs.scoreDocs.length; i++) {
> >>   Document document = indexReader.document( topDocs.scoreDocs[i].doc);
> >>   final File f = new File( document.get( "FILE" ) );
> >> }
> >>
> >>
> >> > I have existing code that's like:
> >> >
> >> > final Term t = /* ... */;
> >> >          final Iterator i = searcher.search( new
> >> > TermQuery( t ) ).iterator();
> >> >          while ( i.hasNext() ) {
> >> >              final Hit hit = (Hit)i.next();
> >> >    // "FILE" is the field that recorded the original file indexed
> >> >              final File f = new File( hit.get( "FILE" ) );
> >> >    // ...
> >> >          }
> >> >
> >> > It's not clear to me how to rewrite the code using TopDocs/
> >> > TopDocCollector and how to iterate over the results.
> >> >
> >> > A little help?  Thanks.  :-)
> >> >
> >> > - Paul
> >> >
> >>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Paul J. Lucas
On Jun 10, 2009, at 3:17 AM, Uwe Schindler wrote:

> A HitCollector is the correct way to do this (especially because the  
> order of hits is mostly not interesting when retrieving all hits).

OK, here's what I came up with:

     Term t = /* ... */
     Collection<File> files = new LinkedList<File>();
     FieldSelector fieldSelector = new FieldSelector() {
         public FieldSelectorResult accept( String fieldName ) {
             if ( fieldName.equals( "FILE" ) )
                 return FieldSelectorResult.LOAD_AND_BREAK;
             return FieldSelectorResult.NO_LOAD;
         }
     };
     HitCollector hitCollector = new HitCollector() {
         public void collect( int docID, float score ) {
             try {
                 Document doc = searcher.doc( docID, fieldSelector );
                 files.add( new File( doc.get( "FILE" ) ) );
             }
             catch ( Exception e ) {
                 // ignore
             }
         }
     };
     searcher.search( new TermQuery( t ), hitCollector );

How's that?

- Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

Uwe Schindler
That looks good, but contains the inner search loop (looking up the stored
fields from within the main search loop, which is the hit collector). For
few results this is ok, but if you are collecting thousands of hits from a
very large index that does not fit into memory, the collect gets slow
because of a lot of disk seeking (even when you filter out some fields with
fieldselector, the blocks are read from HDD).

To optimize, store the filename not as stored field, but as a non-tokenized,
indexed term. You can then use

arr = FieldCache.getDefault().getStrings(searcher.getIndexReader(),"FILE");

The returned array contains one entry per document id. Inside the search
loop, just use arr[docID] to get the file name. Please note, on large
indexes the initial field cache loading could take some time.

In Lucene 2.9 this gets better with the new Collectors, that directly work
on segments, if you want to use 2.9 just ask, how the same can be achieved
there. The new collector can there be optimized to get the FieldCaches for
each segment inside Collector.setNextReader()

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Paul J. Lucas [mailto:[hidden email]]
> Sent: Wednesday, June 10, 2009 5:26 PM
> To: [hidden email]
> Subject: Re: Migrating from Hit/Hits to TopDocs/TopDocCollector
>
> On Jun 10, 2009, at 3:17 AM, Uwe Schindler wrote:
>
> > A HitCollector is the correct way to do this (especially because the
> > order of hits is mostly not interesting when retrieving all hits).
>
> OK, here's what I came up with:
>
>      Term t = /* ... */
>      Collection<File> files = new LinkedList<File>();
>      FieldSelector fieldSelector = new FieldSelector() {
>          public FieldSelectorResult accept( String fieldName ) {
>              if ( fieldName.equals( "FILE" ) )
>                  return FieldSelectorResult.LOAD_AND_BREAK;
>              return FieldSelectorResult.NO_LOAD;
>          }
>      };
>      HitCollector hitCollector = new HitCollector() {
>          public void collect( int docID, float score ) {
>              try {
>                  Document doc = searcher.doc( docID, fieldSelector );
>                  files.add( new File( doc.get( "FILE" ) ) );
>              }
>              catch ( Exception e ) {
>                  // ignore
>              }
>          }
>      };
>      searcher.search( new TermQuery( t ), hitCollector );
>
> How's that?
>
> - Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Paul J. Lucas
On Jun 10, 2009, at 10:49 AM, Uwe Schindler wrote:

> To optimize, store the filename not as stored field, but as a non-
> tokenized,
> indexed term.

How do you do that?

- Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Daniel Noll-3-2
In reply to this post by Uwe Schindler
On Wed, Jun 10, 2009 at 20:17, Uwe Schindler<[hidden email]> wrote:
> You are right, you can, but if you just want to retrieve all hits, this is
> ineffective. A HitCollector is the correct way to do this (especially
> because the order of hits is mostly not interesting when retrieving all
> hits). Hits and TopDocs are intended for paged results lists.

As a relevant note, what I have noticed about using HitCollector alone
is that the code effectively loses control of the loop (you get the
same problem with any API where you hand it a callback and let it do
all the work, e.g. SAX.)  The callback is good if you have a
relatively small number of results and/or a relatively fast operation
to perform with each one, but if the process as a whole takes a long
time and the user wants to be able to cancel it, then it isn't great.
It also isn't great if you want to wrap an Iterator or some other
existing API around it.

Our workaround for this is a HitCollector which populates a BitSet
(relatively fast), and then do the slow operation when iterating over
the BitSet.  This also has drawbacks in terms of memory usage, but
that doesn't become a huge problem until you have a very large number
of documents in the index.

It's a shame we don't have an inverted kind of HitCollector where we
can say "give me the next hit", so that we can get the best of both
worlds (like what StAX gives us in the XML world.)

Daniel

--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Yonik Seeley-2-2
On Wed, Jun 10, 2009 at 7:58 PM, Daniel Noll <[hidden email]> wrote:
> It's a shame we don't have an inverted kind of HitCollector where we
> can say "give me the next hit", so that we can get the best of both
> worlds (like what StAX gives us in the XML world.)

You can get a scorer and call next() yourself.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Paul J. Lucas
On Jun 10, 2009, at 5:02 PM, Yonik Seeley wrote:

> On Wed, Jun 10, 2009 at 7:58 PM, Daniel Noll <[hidden email]> wrote:
>> It's a shame we don't have an inverted kind of HitCollector where we
>> can say "give me the next hit", so that we can get the best of both
>> worlds (like what StAX gives us in the XML world.)
>
> You can get a scorer and call next() yourself.

Example code, please?

- Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Ian Lea
This thread seems to be veering well away from your original
straightforward question on how to convert your straighforward code.
If you want or need these advanced solutions, fine, but if your
existing code was fast enough the modified versions suggested earlier
are probably fast enough too.

--
Ian.

On Thu, Jun 11, 2009 at 1:38 AM, Paul J. Lucas<[hidden email]> wrote:

> On Jun 10, 2009, at 5:02 PM, Yonik Seeley wrote:
>
>> On Wed, Jun 10, 2009 at 7:58 PM, Daniel Noll <[hidden email]> wrote:
>>>
>>> It's a shame we don't have an inverted kind of HitCollector where we
>>> can say "give me the next hit", so that we can get the best of both
>>> worlds (like what StAX gives us in the XML world.)
>>
>> You can get a scorer and call next() yourself.
>
> Example code, please?
>
> - Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Migrating from Hit/Hits to TopDocs/TopDocCollector

Paul J. Lucas
On Jun 11, 2009, at 1:49 AM, Ian Lea wrote:

> This thread seems to be veering well away from your original
> straightforward question on how to convert your straighforward code.

So what?  It's about Lucene and hence on-topic.  Why do you care?

> If you want or need these advanced solutions, fine, but if your
> existing code was fast enough the modified versions suggested earlier
> are probably fast enough too.

I never claimed the original code was fast enough.  I'd like to  
optimize it now to future-proof it rather than forget about it and  
never get around to it.

But, again, why do you care?

- Paul

P.S.: My "Why do you care?" questions are rhetorical.  I really don't  
care what your answer is.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]