DelimitedTermFrequencyTokenFilter

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DelimitedTermFrequencyTokenFilter

Edward Ribeiro
Hi,

Please, anyone has an example of DelimitedTermFrequencyTokenFilter use that could share? 

I have been banging my head against the wall trying to make it work ( https://gist.github.com/eribeiro/ebb24feb3fd84931b7c288b9b716ed49 ) and idk what I am doing wrong. 

I am creating a custom analyzer that uses a WhitespaceTokenizer to parse a string like "a|10 b|2 c|9", and pass it to DelimitedTermFrequencyTokenFilter. I am inserting a custom field that is added to the document to prevent it from having positions and offsets.

The debugger shows the string is being correctly parsed by DTFTF and its char and term attributes are properly set up. But the term frequency of each term is 1 when I inspect the index via Luke. Curiously, the output of my snippet shows the correct total term frequency as seen below:

field="text",maxDoc=1,docCount=1,sumTotalTermFreq=123,sumDocFreq=3
a|10 b|23 c|90
SumTotalTermFreq: 123
SumDocFreq: 3

Cheers,
Edward
PS: I am a Lucene newbie so it may be something quite stupid. 

Reply | Threaded
Open this post in threaded view
|

Re: DelimitedTermFrequencyTokenFilter

Alan Woodward-3
I think it’s working fine - Luke is showing you the docFreq of the term, which will be 1 as it only appears in a single document.

On 28 Nov 2019, at 21:51, Edward Ribeiro <[hidden email]> wrote:

Hi,

Please, anyone has an example of DelimitedTermFrequencyTokenFilter use that could share? 

I have been banging my head against the wall trying to make it work ( https://gist.github.com/eribeiro/ebb24feb3fd84931b7c288b9b716ed49 ) and idk what I am doing wrong. 

I am creating a custom analyzer that uses a WhitespaceTokenizer to parse a string like "a|10 b|2 c|9", and pass it to DelimitedTermFrequencyTokenFilter. I am inserting a custom field that is added to the document to prevent it from having positions and offsets.

The debugger shows the string is being correctly parsed by DTFTF and its char and term attributes are properly set up. But the term frequency of each term is 1 when I inspect the index via Luke. Curiously, the output of my snippet shows the correct total term frequency as seen below:

field="text",maxDoc=1,docCount=1,sumTotalTermFreq=123,sumDocFreq=3
a|10 b|23 c|90
SumTotalTermFreq: 123
SumDocFreq: 3

Cheers,
Edward
PS: I am a Lucene newbie so it may be something quite stupid. 


Reply | Threaded
Open this post in threaded view
|

Re: DelimitedTermFrequencyTokenFilter

Edward Ribeiro
Oh, silly of me. :)

Thanks,
Edward

Em sex, 29 de nov de 2019 07:13, Alan Woodward <[hidden email]> escreveu:
I think it’s working fine - Luke is showing you the docFreq of the term, which will be 1 as it only appears in a single document.

On 28 Nov 2019, at 21:51, Edward Ribeiro <[hidden email]> wrote:

Hi,

Please, anyone has an example of DelimitedTermFrequencyTokenFilter use that could share? 

I have been banging my head against the wall trying to make it work ( https://gist.github.com/eribeiro/ebb24feb3fd84931b7c288b9b716ed49 ) and idk what I am doing wrong. 

I am creating a custom analyzer that uses a WhitespaceTokenizer to parse a string like "a|10 b|2 c|9", and pass it to DelimitedTermFrequencyTokenFilter. I am inserting a custom field that is added to the document to prevent it from having positions and offsets.

The debugger shows the string is being correctly parsed by DTFTF and its char and term attributes are properly set up. But the term frequency of each term is 1 when I inspect the index via Luke. Curiously, the output of my snippet shows the correct total term frequency as seen below:

field="text",maxDoc=1,docCount=1,sumTotalTermFreq=123,sumDocFreq=3
a|10 b|23 c|90
SumTotalTermFreq: 123
SumDocFreq: 3

Cheers,
Edward
PS: I am a Lucene newbie so it may be something quite stupid.