Quantcast

Total of term frequencies

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Total of term frequencies

manjula wijewickrema
Hi,

Is there any way to get the total count of terms in the Term Frequency
Vector  (tvf)? I need to calculate the Normalized term frequency of each
term in my tvf. I know how to obtain the length of the tvf, but it doesn't
work since I need to count duplicate occurrences as well.

Highly appreciate your kind response.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Total of term frequencies

Michael McCandless-2
I think you want to use the TermsEnum.totalTermFreq method?

Mike McCandless

http://blog.mikemccandless.com

On Sun, Apr 16, 2017 at 11:36 AM, Manjula Wijewickrema <[hidden email]>
wrote:

> Hi,
>
> Is there any way to get the total count of terms in the Term Frequency
> Vector  (tvf)? I need to calculate the Normalized term frequency of each
> term in my tvf. I know how to obtain the length of the tvf, but it doesn't
> work since I need to count duplicate occurrences as well.
>
> Highly appreciate your kind response.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Total of term frequencies

Michael McCandless-2
Ahh I see.

Term vectors are actually an inverted index for a single document, and they
also have the same postings API as the whole index (including
TermsEnum.totalTermFreq), but that method likely always returns -1 for term
vectors because it's not implemented?  Maybe Lucene's default codec should
be improved to store this; maybe open an issue?

In the meantime you could make your own codec that does store it.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Apr 18, 2017 at 9:12 AM, Manjula Wijewickrema <[hidden email]>
wrote:

> Hi Mike,
>
> Thanks for the answer. I think this returns the total number of
> occurrences of a specified term across all the documents in the corpus
> right?
>
> But I need the total number of terms (including multiple occurrences of
> the same term) in each document of the corpus. Any suggestion?
>
> Thanks!
>
> On Tue, Apr 18, 2017 at 2:53 PM, Michael McCandless <
> [hidden email]> wrote:
>
>> I think you want to use the TermsEnum.totalTermFreq method?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sun, Apr 16, 2017 at 11:36 AM, Manjula Wijewickrema <
>> [hidden email]> wrote:
>>
>>> Hi,
>>>
>>> Is there any way to get the total count of terms in the Term Frequency
>>> Vector  (tvf)? I need to calculate the Normalized term frequency of each
>>> term in my tvf. I know how to obtain the length of the tvf, but it
>>> doesn't
>>> work since I need to count duplicate occurrences as well.
>>>
>>> Highly appreciate your kind response.
>>>
>>
>>
>
Loading...