About counting term hits

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

About counting term hits

lbarcala
Hello:

I am new to LUCENE and I am testing some issues about it. I can retrieve
the number of documents which satisfies a query, but I don't find how to
obtain the number of terms which match it.

For example, if I search for the word "house", I want to obtain the
number of times the word occurs (not the number of documents).

Is it possible to do it in LUCENE?

Thanks in advance,

  Mario Barcala


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: About counting term hits

saiyuwa
yes its quite possible.
1.you need to create term which you need to search.
eg.
Term term = new Term("yourfield","yourword");

2. then create a TermDoc enum.
TermDocs provides an interface for enumerating <document, frequency> pairs
for a term.

TermDocs t = new
FilterIndexReader(IndexReader.open("youindex")).termDocs(term);

3.Iterate through each of the terms and count the occurrence.
int count = 0;
 while(td.next()){
                    count+=td.freq());
  }

Hope it helped,
Regards,
Dipesh

On Thu, Nov 13, 2008 at 4:30 AM, Fco. Mario Barcala Rodríguez <
[hidden email]> wrote:

> Hello:
>
> I am new to LUCENE and I am testing some issues about it. I can retrieve
> the number of documents which satisfies a query, but I don't find how to
> obtain the number of terms which match it.
>
> For example, if I search for the word "house", I want to obtain the
> number of times the word occurs (not the number of documents).
>
> Is it possible to do it in LUCENE?
>
> Thanks in advance,
>
>  Mario Barcala
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
----------------------------------------
"Help Ever Hurt Never"- Baba
Reply | Threaded
Open this post in threaded view
|

Re: About counting term hits

lbarcala
> yes its quite possible.
> 1.you need to create term which you need to search.
> eg.
> Term term = new Term("yourfield","yourword");
>
> 2. then create a TermDoc enum.
> TermDocs provides an interface for enumerating <document, frequency> pairs
> for a term.
>
> TermDocs t = new
> FilterIndexReader(IndexReader.open("youindex")).termDocs(term);
>
> 3.Iterate through each of the terms and count the occurrence.
> int count = 0;
>  while(td.next()){
>                     count+=td.freq());
>   }
>

This helps but, what about combining this with a search criteria? I mean
to obtain the number of times the term "house" occurs in document between
year 1999 and 2005 (another field of documents). I don't find anything
related in classes used by you.

>> Hello:
>>
>> I am new to LUCENE and I am testing some issues about it. I can retrieve
>> the number of documents which satisfies a query, but I don't find how to
>> obtain the number of terms which match it.
>>
>> For example, if I search for the word "house", I want to obtain the
>> number of times the word occurs (not the number of documents).
>>
>> Is it possible to do it in LUCENE?
>>
>> Thanks in advance,
>>
>>  Mario Barcala
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: About counting term hits

Otis Gospodnetic-2
Mario,

Does this help:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/TermFreqVector.html

Plus:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/IndexReader.html#method_summary
(look for "getTerm.Freq...")

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




________________________________
From: "[hidden email]" <[hidden email]>
To: [hidden email]
Sent: Thursday, November 13, 2008 3:35:24 AM
Subject: Re: About counting term hits

> yes its quite possible.
> 1.you need to create term which you need to search.
> eg.
> Term term = new Term("yourfield","yourword");
>
> 2. then create a TermDoc enum.
> TermDocs provides an interface for enumerating <document, frequency> pairs
> for a term.
>
> TermDocs t = new
> FilterIndexReader(IndexReader.open("youindex")).termDocs(term);
>
> 3.Iterate through each of the terms and count the occurrence.
> int count = 0;
>  while(td.next()){
>                     count+=td.freq());
>   }
>

This helps but, what about combining this with a search criteria? I mean
to obtain the number of times the term "house" occurs in document between
year 1999 and 2005 (another field of documents). I don't find anything
related in classes used by you.

>> Hello:
>>
>> I am new to LUCENE and I am testing some issues about it. I can retrieve
>> the number of documents which satisfies a query, but I don't find how to
>> obtain the number of terms which match it.
>>
>> For example, if I search for the word "house", I want to obtain the
>> number of times the word occurs (not the number of documents).
>>
>> Is it possible to do it in LUCENE?
>>
>> Thanks in advance,
>>
>>  Mario Barcala
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: About counting term hits

lbarcala

So, if I undertand the solution, the main steps to do what I propose is:

1) To obtain the documents which match the query (documents which include
the word "house")
2) To loop throw matching documents to access the IndexReader for
obtaining their term frequencies.
3) To obtain from TermFreqVector the frequencies of the Term ("house") to
calculate the result.

And, if it is a very frequent query and there are much documents (> 10.000),
would LUCENE solve it in a reasonable time? A query might match several
hundred documents.

Thank you,

  Mario Barcala

>>> Hello:
>>>
>>> I am new to LUCENE and I am testing some issues about it. I can
>>> retrieve
>>> the number of documents which satisfies a query, but I don't find how
>>> to
>>> obtain the number of terms which match it.
>>>
>>> For example, if I search for the word "house", I want to obtain the
>>> number of times the word occurs (not the number of documents).
>>>
>>> Is it possible to do it in LUCENE?
>>>
>>> Thanks in advance,
>>>
>>>  Mario Barcala
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>>
>> --
>> ----------------------------------------
>> "Help Ever Hurt Never"- Baba
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: About counting term hits

Otis Gospodnetic-2
The more Documents you have to look at the slower it will be, but it may still be fast enough - it's impossible to tell without considering index size, query volume, hardware, number of hits/Docs, etc.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




________________________________
From: "[hidden email]" <[hidden email]>
To: [hidden email]
Sent: Thursday, November 13, 2008 11:35:13 AM
Subject: Re: About counting term hits

So, if I undertand the solution, the main steps to do what I propose is:

1) To obtain the documents which match the query (documents which include
the word "house")
2) To loop throw matching documents to access the IndexReader for
obtaining their term frequencies.
3) To obtain from TermFreqVector the frequencies of the Term ("house") to
calculate the result.

And, if it is a very frequent query and there are much documents (> 10.000),
would LUCENE solve it in a reasonable time? A query might match several
hundred documents.

Thank you,

  Mario Barcala

>>> Hello:
>>>
>>> I am new to LUCENE and I am testing some issues about it. I can
>>> retrieve
>>> the number of documents which satisfies a query, but I don't find how
>>> to
>>> obtain the number of terms which match it.
>>>
>>> For example, if I search for the word "house", I want to obtain the
>>> number of times the word occurs (not the number of documents).
>>>
>>> Is it possible to do it in LUCENE?
>>>
>>> Thanks in advance,
>>>
>>>  Mario Barcala
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>>
>> --
>> ----------------------------------------
>> "Help Ever Hurt Never"- Baba
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: About counting term hits

Michael McCandless-2
In reply to this post by lbarcala

I think to do this efficiently you'd need to modify Lucene's builtin  
query classes (eg TermQuery) such that during the scoring process, in  
addition to simply computing its contribution to the document's score,  
it would also record further information like total number of  
occurrences of each term, which docs had which terms, etc.

I don't think there's a simple efficient way to do this with Lucene  
today, though if your result sets are small enough term vectors might  
be fine.

Mike

[hidden email] wrote:

>> Mario,
>>
>> Does this help:
>> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/TermFreqVector.html
>>
>> Plus:
>> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/IndexReader.html#method_summary
>> (look for "getTerm.Freq...")
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>
> So, if I undertand the solution, the main steps to do what I propose  
> is:
>
> 1) To obtain the documents which match the query (documents which  
> include
> the word "house")
> 2) To loop throw matching documents to access the IndexReader for
> obtaining their term frequencies.
> 3) To obtain from TermFreqVector the frequencies of the Term  
> ("house") to
> calculate the result.
>
> And, if it is a very frequent query and there are much documents (>  
> 10.000),
> would LUCENE solve it in a reasonable time? A query might match  
> several
> hundred documents.
>
> Thank you,
>
>  Mario Barcala
>
>>>> Hello:
>>>>
>>>> I am new to LUCENE and I am testing some issues about it. I can
>>>> retrieve
>>>> the number of documents which satisfies a query, but I don't find  
>>>> how
>>>> to
>>>> obtain the number of terms which match it.
>>>>
>>>> For example, if I search for the word "house", I want to obtain the
>>>> number of times the word occurs (not the number of documents).
>>>>
>>>> Is it possible to do it in LUCENE?
>>>>
>>>> Thanks in advance,
>>>>
>>>> Mario Barcala
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>>
>>> --
>>> ----------------------------------------
>>> "Help Ever Hurt Never"- Baba
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: About counting term hits

hossman

: I think to do this efficiently you'd need to modify Lucene's builtin query
: classes (eg TermQuery) such that during the scoring process, in addition to
: simply computing its contribution to the document's score, it would also
: record further information like total number of occurrences of each term,
: which docs had which terms, etc.

Unless you don't care about the score at all, just the total count.  In
which case you can use a custom Similarity to make the score for a doc be
the count (ignore idf, norms, queryNorm, etc...)  Then use a hit collector
that sums the counts for every doc matched.  

that should be as efficient as possible (it's certianly only one pass) but
you might be able to optimize it by using your other criteria
(date range or whatever) in a Filter to generate a BitSet, then fetch a
TermDocs instance for your term, and iterate through the docs summing up
the frequencies (you can use skipTo(set.nextSetBit()) to optimize away
non-matching docs)

(at least i'm pretty sure that would work)

Another nuance to this question...

: > > > > I am new to LUCENE and I am testing some issues about it. I can
: > > > > retrieve
: > > > > the number of documents which satisfies a query, but I don't find how
: > > > > to
: > > > > obtain the number of terms which match it.

The words "term" and "query" mean very specific, and independent, things
in lucene, ... but Mario seems to be using them interchangably -- if you
want to know how often a Term filtered appears in all docs matching some
criteria, then all of the techniques described so far should work.

but if you want to count the occurances of a more complicated Query (like:
how many times does the phrase "Mario Barcala" appear in docs from
199-2003) the situation gets more complicated ... for that you would want
ot use something like a SpanQuery and iterate through the Spans (counting
them as you go)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]