Getting count of documents matching a query?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting count of documents matching a query?

Tom-89
Hi -

Is there a fast way (not easy, but speedy) of getting the count of
documents that match a query?

I need the count, and don't need the docs at this point. If I had a
simple query, (e.g. "book") I can use docFreq(), and it's lightning
fast. If I just run it as a query it's much slower. I'm just
wondering if I did a custom scorer / similarity / hitcollector, how
much faster than a query could I get it? Or is there a better way?

And I'm not entirely sure how to implement the above idea, pointers
or samples or hints would be wonderful.

Thanks,


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting count of documents matching a query?

Chris Hostetter-3

: I need the count, and don't need the docs at this point. If I had a
: simple query, (e.g. "book") I can use docFreq(), and it's lightning
: fast. If I just run it as a query it's much slower. I'm just
: wondering if I did a custom scorer / similarity / hitcollector, how
: much faster than a query could I get it? Or is there a better way?

A custom HitCollector would be the first big win, something like this
would probably work...

   final int[] count = new int[1]
   searcher.search(query, new HitCollector() {
       public void collect(int doc, float score) {
          count[0]++;
       }
    });
    return count[0]

otherways you might be able to shave time would be...

  * if your query can be represented as in simple set logic logic (you
    don't seem to be concerned with score) then implimenting it as a
    Filter may be faster becuase it won't do any score calculation, just a
    simple match/no-match (which is what you seem to want) ... but it will
    definitely take up more memory then a query

  * if you customize your similarity so that every function returns 0 or 1
    you might shave a little bit of time off by skipping some of the math
    equations ... but i really doubt it.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting count of documents matching a query?

Jason Calabrese
I just wrote some simple code to test this.

For my test I ran the test with 3 queries:
- A 3 term boolean
- A single term query with over 5000 hits
- A single term query with 0 hits

For each query I ran the ran 4 tests of 10,000 searches:
1) using hits.length to get the counts and the standard similarity
2) using hits.length to get the counts and a custom similarity
3) using HitCollector to get the counts and the standard similarity
4) using HitCollector to get the counts and a custom similarity

The custom similarity returns 0 for all methods.  
The results are kind of surprising. It doesn't look like the speed up is
enough to make the change to our application.

Here are the results, the test class is also attached:

time (mills) 14095, useHC=false, standardSimilarity=true, count=47,
query=abstract_recent:(genetically modified organism)
time (mills) 15406, useHC=false, standardSimilarity=false, count=0,
query=abstract_recent:(genetically modified organism)
time (mills) 13768, useHC=true, standardSimilarity=true, count=47,
query=abstract_recent:(genetically modified organism)
time (mills) 14404, useHC=true, standardSimilarity=false, count=47,
query=abstract_recent:(genetically modified organism)


time (mills) 6790, useHC=false, standardSimilarity=true, count=5776,
query=lname:smith
time (mills) 4901, useHC=false, standardSimilarity=false, count=0,
query=lname:smith
time (mills) 5209, useHC=true, standardSimilarity=true, count=5776,
query=lname:smith
time (mills) 5578, useHC=true, standardSimilarity=false, count=5776,
query=lname:smith


time (mills) 47, useHC=false, standardSimilarity=true, count=0,
query=lname:dfdsalkfjdsalkjflsa
time (mills) 37, useHC=false, standardSimilarity=false, count=0,
query=lname:dfdsalkfjdsalkjflsa
time (mills) 41, useHC=true, standardSimilarity=true, count=0,
query=lname:dfdsalkfjdsalkjflsa
time (mills) 198, useHC=true, standardSimilarity=false, count=0,
query=lname:dfdsalkfjdsalkjflsa




On Thursday 06 April 2006 15:19, Chris Hostetter wrote:

> : I need the count, and don't need the docs at this point. If I had a
> : simple query, (e.g. "book") I can use docFreq(), and it's lightning
> : fast. If I just run it as a query it's much slower. I'm just
> : wondering if I did a custom scorer / similarity / hitcollector, how
> : much faster than a query could I get it? Or is there a better way?
>
> A custom HitCollector would be the first big win, something like this
> would probably work...
>
>    final int[] count = new int[1]
>    searcher.search(query, new HitCollector() {
>        public void collect(int doc, float score) {
>           count[0]++;
>        }
>     });
>     return count[0]
>
> otherways you might be able to shave time would be...
>
>   * if your query can be represented as in simple set logic logic (you
>     don't seem to be concerned with score) then implimenting it as a
>     Filter may be faster becuase it won't do any score calculation, just a
>     simple match/no-match (which is what you seem to want) ... but it will
>     definitely take up more memory then a query
>
>   * if you customize your similarity so that every function returns 0 or 1
>     you might shave a little bit of time off by skipping some of the math
>     equations ... but i really doubt it.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

LuceneCountTest.java (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Getting count of documents matching a query?

Chris Hostetter-3

first off: you should double check the correctness ofyour customized
similarity class.  I'm pretty sure it's resulting in a differnet set of
matches then the DefaultSimilarity because your tf function returns 0f
regardless of wether there is a match.  (when i said "every function
returns 0 or 1" i ment you acctually have to have the if (match) return 1
else 0 logic at a minimum)

second: these times really shocked me untill i realized you were
reporting the sum of the execution times for all 10000 iterations (whew!)

third: for queries this simple, i don't think you're going find much
differneces in speed between the differnet models, but if you elliminate
the Similarity as a variable (compare only #1 and #3 for each query), you
can see that HitCollector was fairly consistently "a little faster" then
using Hits (but i'll admit, i thought it would be "more faster")

I think if you tried bigger, more complex queries (nested booleans, with
mandatory/required clauses, span queries, booleans with 20 or more
optional clauses) you might see a bigger discrepency.

it's also not clear how big your test index is, as doug has pointed out
before, it can make a big difference in benchmarking.


: For each query I ran the ran 4 tests of 10,000 searches:
: 1) using hits.length to get the counts and the standard similarity
: 2) using hits.length to get the counts and a custom similarity
: 3) using HitCollector to get the counts and the standard similarity
: 4) using HitCollector to get the counts and a custom similarity
:
: The custom similarity returns 0 for all methods.
: The results are kind of surprising. It doesn't look like the speed up is
: enough to make the change to our application.
:
: Here are the results, the test class is also attached:
:
: time (mills) 14095, useHC=false, standardSimilarity=true, count=47,
: query=abstract_recent:(genetically modified organism)
: time (mills) 15406, useHC=false, standardSimilarity=false, count=0,
: query=abstract_recent:(genetically modified organism)
: time (mills) 13768, useHC=true, standardSimilarity=true, count=47,
: query=abstract_recent:(genetically modified organism)
: time (mills) 14404, useHC=true, standardSimilarity=false, count=47,
: query=abstract_recent:(genetically modified organism)
:
:
: time (mills) 6790, useHC=false, standardSimilarity=true, count=5776,
: query=lname:smith
: time (mills) 4901, useHC=false, standardSimilarity=false, count=0,
: query=lname:smith
: time (mills) 5209, useHC=true, standardSimilarity=true, count=5776,
: query=lname:smith
: time (mills) 5578, useHC=true, standardSimilarity=false, count=5776,
: query=lname:smith
:
:
: time (mills) 47, useHC=false, standardSimilarity=true, count=0,
: query=lname:dfdsalkfjdsalkjflsa
: time (mills) 37, useHC=false, standardSimilarity=false, count=0,
: query=lname:dfdsalkfjdsalkjflsa
: time (mills) 41, useHC=true, standardSimilarity=true, count=0,
: query=lname:dfdsalkfjdsalkjflsa
: time (mills) 198, useHC=true, standardSimilarity=false, count=0,
: query=lname:dfdsalkfjdsalkjflsa
:
:
:
:
: On Thursday 06 April 2006 15:19, Chris Hostetter wrote:
: > : I need the count, and don't need the docs at this point. If I had a
: > : simple query, (e.g. "book") I can use docFreq(), and it's lightning
: > : fast. If I just run it as a query it's much slower. I'm just
: > : wondering if I did a custom scorer / similarity / hitcollector, how
: > : much faster than a query could I get it? Or is there a better way?
: >
: > A custom HitCollector would be the first big win, something like this
: > would probably work...
: >
: >    final int[] count = new int[1]
: >    searcher.search(query, new HitCollector() {
: >        public void collect(int doc, float score) {
: >           count[0]++;
: >        }
: >     });
: >     return count[0]
: >
: > otherways you might be able to shave time would be...
: >
: >   * if your query can be represented as in simple set logic logic (you
: >     don't seem to be concerned with score) then implimenting it as a
: >     Filter may be faster becuase it won't do any score calculation, just a
: >     simple match/no-match (which is what you seem to want) ... but it will
: >     definitely take up more memory then a query
: >
: >   * if you customize your similarity so that every function returns 0 or 1
: >     you might shave a little bit of time off by skipping some of the math
: >     equations ... but i really doubt it.
: >
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [hidden email]
: > For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Indexing Problems

trupti mulajkar
hello,

i have modified the IndexFiles.java to read the document numbers from within the
TREC files, which are also being read correctly, however the index fails to
create the .cfs file.
thus the search query does not return the correct document number.
any suggestions how this can be sorted?

cheers,
trupti mulajkar
MSc Advanced Computer Science


Quoting Chris Hostetter <[hidden email]>:

>
> first off: you should double check the correctness ofyour customized
> similarity class.  I'm pretty sure it's resulting in a differnet set of
> matches then the DefaultSimilarity because your tf function returns 0f
> regardless of wether there is a match.  (when i said "every function
> returns 0 or 1" i ment you acctually have to have the if (match) return 1
> else 0 logic at a minimum)
>
> second: these times really shocked me untill i realized you were
> reporting the sum of the execution times for all 10000 iterations (whew!)
>
> third: for queries this simple, i don't think you're going find much
> differneces in speed between the differnet models, but if you elliminate
> the Similarity as a variable (compare only #1 and #3 for each query), you
> can see that HitCollector was fairly consistently "a little faster" then
> using Hits (but i'll admit, i thought it would be "more faster")
>
> I think if you tried bigger, more complex queries (nested booleans, with
> mandatory/required clauses, span queries, booleans with 20 or more
> optional clauses) you might see a bigger discrepency.
>
> it's also not clear how big your test index is, as doug has pointed out
> before, it can make a big difference in benchmarking.
>
>
> : For each query I ran the ran 4 tests of 10,000 searches:
> : 1) using hits.length to get the counts and the standard similarity
> : 2) using hits.length to get the counts and a custom similarity
> : 3) using HitCollector to get the counts and the standard similarity
> : 4) using HitCollector to get the counts and a custom similarity
> :
> : The custom similarity returns 0 for all methods.
> : The results are kind of surprising. It doesn't look like the speed up is
> : enough to make the change to our application.
> :
> : Here are the results, the test class is also attached:
> :
> : time (mills) 14095, useHC=false, standardSimilarity=true, count=47,
> : query=abstract_recent:(genetically modified organism)
> : time (mills) 15406, useHC=false, standardSimilarity=false, count=0,
> : query=abstract_recent:(genetically modified organism)
> : time (mills) 13768, useHC=true, standardSimilarity=true, count=47,
> : query=abstract_recent:(genetically modified organism)
> : time (mills) 14404, useHC=true, standardSimilarity=false, count=47,
> : query=abstract_recent:(genetically modified organism)
> :
> :
> : time (mills) 6790, useHC=false, standardSimilarity=true, count=5776,
> : query=lname:smith
> : time (mills) 4901, useHC=false, standardSimilarity=false, count=0,
> : query=lname:smith
> : time (mills) 5209, useHC=true, standardSimilarity=true, count=5776,
> : query=lname:smith
> : time (mills) 5578, useHC=true, standardSimilarity=false, count=5776,
> : query=lname:smith
> :
> :
> : time (mills) 47, useHC=false, standardSimilarity=true, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> : time (mills) 37, useHC=false, standardSimilarity=false, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> : time (mills) 41, useHC=true, standardSimilarity=true, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> : time (mills) 198, useHC=true, standardSimilarity=false, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> :
> :
> :
> :
> : On Thursday 06 April 2006 15:19, Chris Hostetter wrote:
> : > : I need the count, and don't need the docs at this point. If I had a
> : > : simple query, (e.g. "book") I can use docFreq(), and it's lightning
> : > : fast. If I just run it as a query it's much slower. I'm just
> : > : wondering if I did a custom scorer / similarity / hitcollector, how
> : > : much faster than a query could I get it? Or is there a better way?
> : >
> : > A custom HitCollector would be the first big win, something like this
> : > would probably work...
> : >
> : >    final int[] count = new int[1]
> : >    searcher.search(query, new HitCollector() {
> : >        public void collect(int doc, float score) {
> : >           count[0]++;
> : >        }
> : >     });
> : >     return count[0]
> : >
> : > otherways you might be able to shave time would be...
> : >
> : >   * if your query can be represented as in simple set logic logic (you
> : >     don't seem to be concerned with score) then implimenting it as a
> : >     Filter may be faster becuase it won't do any score calculation, just
> a
> : >     simple match/no-match (which is what you seem to want) ... but it
> will
> : >     definitely take up more memory then a query
> : >
> : >   * if you customize your similarity so that every function returns 0 or
> 1
> : >     you might shave a little bit of time off by skipping some of the
> math
> : >     equations ... but i really doubt it.
> : >
> : >
> : >
> : >
> : > -Hoss
> : >
> : >
> : > ---------------------------------------------------------------------
> : > To unsubscribe, e-mail: [hidden email]
> : > For additional commands, e-mail: [hidden email]
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]