Limiting Result-Count

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Limiting Result-Count

Dominik Bruhn
Hy,
how can I limit the result-count of a query in order to save time? I searched
the web but didn't find a solution.

Thanks
Dominik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Limiting Result-Count

Otis Gospodnetic-2
Try using HitCollector and break out of it when you collect enough documents.  My guess is that if you are not doing anything crazy with Hits (like looping through the all) this won't be that much faster than using Hits.

Otis

----- Original Message ----
From: Dominik Bruhn <[hidden email]>
To: [hidden email]
Sent: Thursday, June 29, 2006 1:29:31 PM
Subject: Limiting Result-Count

Hy,
how can I limit the result-count of a query in order to save time? I searched
the web but didn't find a solution.

Thanks
Dominik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Limiting Result-Count

Andrzej Białecki-2
Otis Gospodnetic wrote:
> Try using HitCollector and break out of it when you collect enough documents.  My guess is that if you are not doing anything crazy with Hits (like looping through the all) this won't be that much faster than using Hits.
>  

Well, in practice it does help - see the way this is done in Nutch
(src/java/org/apache/nutch/searcher/LuceneQueryOptimizer$LimitedCollector).
Performance-wise, with large indexes this makes a big difference.

The problem that you need to address, though, is how usable are partial
results, i.e. if you are reasonably sure that by collecting only partial
results you are not missing important hits, which would have been found
had you let the search collect all results ... This facility in Nutch is
used only if posting lists are sorted by decreasing document importance
(see IndexSorter for details), so that we collect first the most highly
ranking hits, and skip low ranking ones.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

HitCollector and Sort Objects

James Pine
Hey,

I've looked at the documentation for:

org.apache.lucene.search.Searchable
org.apache.lucene.search.Searcher
org.apache.lucene.search.IndexSearcher

and it struck me that there are no search methods with
these signatures:

void search(Query query, Filter filter, HitCollector
results, Sort sort)

void search(Query query, HitCollector results, Sort
sort)

I have one type of search where I pass in a Query and
a Sort (built with a SortField and Decompresses) and
deal with the Hits object, and another which takes a
Query and a HitCollector, which I then run my own
sorting on the results.

I'd like to combine both into one call so that I can
get the results sorted and have the collected data
available. Is that crazy talk? Does the way document
ids stream through the HitCollector prevent sorting
them first?

Would it be best for me to convert my Decompresses
logic into a Comparator to pass into Arrays.sort() or
something like that and just forget about using the
built in Lucene functionality?

I currently use the HitCollector to count occurrences
of words in the result set, basically trying compute
the TermFrequencies in the result of a search vs.
across the entire index. I've tried using bitsets but
because I can't constrain the number of unique words
people use (currently in the millions in our data),
and the number of document (also currently in the
millions), keeping a cache of bitsets is not feasible
on our hardware (I can save about 10,000 but the cache
hit ratio is very low). If someone thinks my use of
HitCollector is flawed, I'm certainly open to
suggestions. Thanx.

JAMES


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: HitCollector and Sort Objects

Ramana Jelda
Yeah!! There are no methods that you mentioned. But there are some ways to
do this.

TopFieldDocs:search(Query query, Filter filter, int n, Sort sort)

If above method does not solve your purpose, then

My suggestion is to use method search(Query query, Filter filter,
HitCollector results)  and pass your HitCollector which extends
TopFieldDocCollector .

Jelda


> -----Original Message-----
> From: James Pine [mailto:[hidden email]]
> Sent: Friday, June 30, 2006 1:54 AM
> To: [hidden email]
> Subject: HitCollector and Sort Objects
>
> Hey,
>
> I've looked at the documentation for:
>
> org.apache.lucene.search.Searchable
> org.apache.lucene.search.Searcher
> org.apache.lucene.search.IndexSearcher
>
> and it struck me that there are no search methods with these
> signatures:
>
> void search(Query query, Filter filter, HitCollector results,
> Sort sort)
>
> void search(Query query, HitCollector results, Sort
> sort)
>
> I have one type of search where I pass in a Query and a Sort
> (built with a SortField and Decompresses) and deal with the
> Hits object, and another which takes a Query and a
> HitCollector, which I then run my own sorting on the results.
>
> I'd like to combine both into one call so that I can get the
> results sorted and have the collected data available. Is that
> crazy talk? Does the way document ids stream through the
> HitCollector prevent sorting them first?
>
> Would it be best for me to convert my Decompresses logic into
> a Comparator to pass into Arrays.sort() or something like
> that and just forget about using the built in Lucene functionality?
>
> I currently use the HitCollector to count occurrences of
> words in the result set, basically trying compute the
> TermFrequencies in the result of a search vs.
> across the entire index. I've tried using bitsets but because
> I can't constrain the number of unique words people use
> (currently in the millions in our data), and the number of
> document (also currently in the millions), keeping a cache of
> bitsets is not feasible on our hardware (I can save about
> 10,000 but the cache hit ratio is very low). If someone
> thinks my use of HitCollector is flawed, I'm certainly open
> to suggestions. Thanx.
>
> JAMES
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection
> around http://mail.yahoo.com 
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: HitCollector and Sort Objects

Nadav Har'El
In reply to this post by James Pine
On Thu, Jun 29, 2006, James Pine wrote about "HitCollector and Sort Objects":

> I have one type of search where I pass in a Query and
> a Sort (built with a SortField and Decompresses) and
> deal with the Hits object, and another which takes a
> Query and a HitCollector, which I then run my own
> sorting on the results.
>
> I'd like to combine both into one call so that I can
> get the results sorted and have the collected data
> available. Is that crazy talk? Does the way document
> ids stream through the HitCollector prevent sorting
> them first?

The best thing for you to do without changing Lucene is to use the

        search(Query query, HitCollector results)

method. The collect() in the HitCollector you'll write will do what you
want it to do, and finally call collect() of a TopFieldDocCollector
object that you'll creat. TopFieldDocCollector collects the top-ranking
documents using a given Sort (see its documentation).

This will give you a "TopDocs", not a "Hits" object. But when you know
in advance how many results you want (which is almost always the case),
this is enough for you.

I raised the idea of having a search() method which returns a Hits and
calls a HitCollector, but was convinced that TopDocs+HitCollector is
actually better. See:

http://www.gossamer-threads.com/lists/lucene/java-dev/37277

Maybe this should be in the FAQ.

--
Nadav Har'El                        |        Sunday, Jul 2 2006, 6 Tammuz 5766
IBM Haifa Research Lab              |-----------------------------------------
                                    |Recursive, adj.: See Recursive
http://nadav.harel.org.il           |

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

BitSet in a HitCollector

James Pine
Hey Everyone,

I'm using a HitCollector and would like to know the
total number of results that matched a given query.
Based on the JavaDoc, I this will do the trick:

Searcher searcher = new IndexSearcher(indexReader);
   final BitSet bits = new
BitSet(indexReader.maxDoc());
   searcher.search(query, new HitCollector() {
       public void collect(int doc, float score) {
         bits.set(doc);
       }
     });
int numResults = bits.cardinality();

If I want to know the total number of results inside
of the HitCollector, i.e. before the collect method
has ever been called, I think I could pass the Query
and Searcher objects into the HitCollector and do this
in its constructor:

BitSet bits = (new
QueryFilter(query)).bits(searcher.getIndexReader());
int numResults = bits.cardinality();

Is there a performance penalty using the QueryFilter?
Is Lucene executing another pass over the index in
order to populate the BitSet and then doing another
pass while calling the collect method? Thanx.

JAMES



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: BitSet in a HitCollector

Chris Hostetter-3

: I'm using a HitCollector and would like to know the
: total number of results that matched a given query.
: Based on the JavaDoc, I this will do the trick:

you don't need a BitSet in that case, you could find that out just using
an int...

    public CountingCollector extends HitCollector {
      public int count = 0;
      public void collect(int doc, float score) { count++ };
    }
    CountingCollector c = new CountingCollector();
    searcher.search(query, c)
    int numResults = c.count;

: If I want to know the total number of results inside
: of the HitCollector, i.e. before the collect method
: has ever been called, I think I could pass the Query
: and Searcher objects into the HitCollector and do this
: in its constructor:
:
: BitSet bits = (new
: QueryFilter(query)).bits(searcher.getIndexReader());
: int numResults = bits.cardinality();

This question doesn't make a lot of sense to me, why do you need to know
the total number ofresults before the collect method is called? .. what
you are suggesting here (using QueryFilter in this way) is perfectly
legal, but it's going to do just as much work as using a HitCollector will
(possibly more, i can't remember).

: Is Lucene executing another pass over the index in
: order to populate the BitSet and then doing another
: pass while calling the collect method? Thanx.

in your last example, you never us your HitCollector, so i'm not sure what
you mean, but assuming you aresking about combining those examples into
something like this....

  Searcher searcher = new IndexSearcher(indexReader);
  BitSet bits = (new QueryFilter(query)).bits(searcher.getIndexReader());
  final int numResults = bits.cardinality();
  searcher.search(query, new HitCollector() {
       public void collect(int doc, float score) {
          /* do something with numResults and doc and score */
       }
  });

...then yes, you are most definitely making two passes to do do that.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: BitSet in a HitCollector

James Pine
Hey,

Sorry, I will explain a bit more about my collect
method. Currently my collect method is executing
IndexSearcher.doc(id) and storing some stuff in a Map
which I can then retrieve from the HitCollector (much
like the example in the Lucene In Action book). Of
course that's somewhat expensive, so I'd like to do
some statistical sampling based on the result set size
to try and speed things up.

The way I was thinking about doing this was, during
the collect method only executing
IndexSearcher.doc(id) on every Nth document, where N
is calculated dynamically based on a minimum number X.
The rule would be:

N = Max(1,(numResults / X))

In order to do this in the collect method, I need to
know the total number of results before ever invoking
the collect method right? That seemed to make a case
for the BitSet/QueryFilter in the constructor.

In addition, someone else on the list mentioned that
one of the reasons calling IndexSearcher.doc(id) in
the collect method was that it caused the disk to do a
lot of seeking. Maybe that's a moot point if one is
using a RAMDirectory or an FSDirectory small enough
that it gets cached by the OS anyway, but if it's not,
then I thought it might be more performant to have the
hitcollector set the Bits in the collect method and
then do another pass to do the statistical sampling.

Either way it seems that to do the statistical
sampling that I envision I either need to calculate
the total result count/document id set in the
constructor, before calling the collect method, or
calculate the total result count/document id set in
the collect method and then execute some sort of
post-collect method, right? So I was just wondering
which method was better/faster. Thanx.

JAMES

--- Chris Hostetter <[hidden email]> wrote:

>
> : I'm using a HitCollector and would like to know
> the
> : total number of results that matched a given
> query.
> : Based on the JavaDoc, I this will do the trick:
>
> you don't need a BitSet in that case, you could find
> that out just using
> an int...
>
>     public CountingCollector extends HitCollector {
>       public int count = 0;
>       public void collect(int doc, float score) {
> count++ };
>     }
>     CountingCollector c = new CountingCollector();
>     searcher.search(query, c)
>     int numResults = c.count;
>
> : If I want to know the total number of results
> inside
> : of the HitCollector, i.e. before the collect
> method
> : has ever been called, I think I could pass the
> Query
> : and Searcher objects into the HitCollector and do
> this
> : in its constructor:
> :
> : BitSet bits = (new
> :
> QueryFilter(query)).bits(searcher.getIndexReader());
> : int numResults = bits.cardinality();
>
> This question doesn't make a lot of sense to me, why
> do you need to know
> the total number ofresults before the collect method
> is called? .. what
> you are suggesting here (using QueryFilter in this
> way) is perfectly
> legal, but it's going to do just as much work as
> using a HitCollector will
> (possibly more, i can't remember).
>
> : Is Lucene executing another pass over the index in
> : order to populate the BitSet and then doing
> another
> : pass while calling the collect method? Thanx.
>
> in your last example, you never us your
> HitCollector, so i'm not sure what
> you mean, but assuming you aresking about combining
> those examples into
> something like this....
>
>   Searcher searcher = new
> IndexSearcher(indexReader);
>   BitSet bits = (new
> QueryFilter(query)).bits(searcher.getIndexReader());
>   final int numResults = bits.cardinality();
>   searcher.search(query, new HitCollector() {
>        public void collect(int doc, float score) {
>           /* do something with numResults and doc
> and score */
>        }
>   });
>
> ...then yes, you are most definitely making two
> passes to do do that.
>
>
>
> -Hoss
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: BitSet in a HitCollector

pgwillia
Hi James,

    A paper was mentioned on this list in the last couple of months which
presents a solution to your sampling problem without having to know the
total results size in advance.  The paper
(http://www2005.org/cdrom/docs/p245.pdf) presents two solutions which
utilize a random variable.  One solution has you traverse the result set
and select each document with probability p.  P is determined in advance.
Alternately, the paper describes an algorithm (bottom of page 248) for
determining a skip value which, while similar to the traversal, allows you
to jump/skip over documents and save the probability computations for each
document required by the first solution.

    I hope this helps!

Tricia

On Thu, 6 Jul 2006, James Pine wrote:

> Hey,
>
> Sorry, I will explain a bit more about my collect
> method. Currently my collect method is executing
> IndexSearcher.doc(id) and storing some stuff in a Map
> which I can then retrieve from the HitCollector (much
> like the example in the Lucene In Action book). Of
> course that's somewhat expensive, so I'd like to do
> some statistical sampling based on the result set size
> to try and speed things up.
>
> The way I was thinking about doing this was, during
> the collect method only executing
> IndexSearcher.doc(id) on every Nth document, where N
> is calculated dynamically based on a minimum number X.
> The rule would be:
>
> N = Max(1,(numResults / X))
>
> In order to do this in the collect method, I need to
> know the total number of results before ever invoking
> the collect method right? That seemed to make a case
> for the BitSet/QueryFilter in the constructor.
>
> In addition, someone else on the list mentioned that
> one of the reasons calling IndexSearcher.doc(id) in
> the collect method was that it caused the disk to do a
> lot of seeking. Maybe that's a moot point if one is
> using a RAMDirectory or an FSDirectory small enough
> that it gets cached by the OS anyway, but if it's not,
> then I thought it might be more performant to have the
> hitcollector set the Bits in the collect method and
> then do another pass to do the statistical sampling.
>
> Either way it seems that to do the statistical
> sampling that I envision I either need to calculate
> the total result count/document id set in the
> constructor, before calling the collect method, or
> calculate the total result count/document id set in
> the collect method and then execute some sort of
> post-collect method, right? So I was just wondering
> which method was better/faster. Thanx.
>
> JAMES
>
> --- Chris Hostetter <[hidden email]> wrote:
>
>>
>> : I'm using a HitCollector and would like to know
>> the
>> : total number of results that matched a given
>> query.
>> : Based on the JavaDoc, I this will do the trick:
>>
>> you don't need a BitSet in that case, you could find
>> that out just using
>> an int...
>>
>>     public CountingCollector extends HitCollector {
>>       public int count = 0;
>>       public void collect(int doc, float score) {
>> count++ };
>>     }
>>     CountingCollector c = new CountingCollector();
>>     searcher.search(query, c)
>>     int numResults = c.count;
>>
>> : If I want to know the total number of results
>> inside
>> : of the HitCollector, i.e. before the collect
>> method
>> : has ever been called, I think I could pass the
>> Query
>> : and Searcher objects into the HitCollector and do
>> this
>> : in its constructor:
>> :
>> : BitSet bits = (new
>> :
>> QueryFilter(query)).bits(searcher.getIndexReader());
>> : int numResults = bits.cardinality();
>>
>> This question doesn't make a lot of sense to me, why
>> do you need to know
>> the total number ofresults before the collect method
>> is called? .. what
>> you are suggesting here (using QueryFilter in this
>> way) is perfectly
>> legal, but it's going to do just as much work as
>> using a HitCollector will
>> (possibly more, i can't remember).
>>
>> : Is Lucene executing another pass over the index in
>> : order to populate the BitSet and then doing
>> another
>> : pass while calling the collect method? Thanx.
>>
>> in your last example, you never us your
>> HitCollector, so i'm not sure what
>> you mean, but assuming you aresking about combining
>> those examples into
>> something like this....
>>
>>   Searcher searcher = new
>> IndexSearcher(indexReader);
>>   BitSet bits = (new
>> QueryFilter(query)).bits(searcher.getIndexReader());
>>   final int numResults = bits.cardinality();
>>   searcher.search(query, new HitCollector() {
>>        public void collect(int doc, float score) {
>>           /* do something with numResults and doc
>> and score */
>>        }
>>   });
>>
>> ...then yes, you are most definitely making two
>> passes to do do that.
>>
>>
>>
>> -Hoss
>>
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> [hidden email]
>> For additional commands, e-mail:
>> [hidden email]
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]