Getting totalTermFreq and docFreq for terms

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting totalTermFreq and docFreq for terms

Shai Erera
Hi

I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "<a href="http://localhost:8983/solr/mycollection/select?q=*:*&amp;rows=1&amp;fl=ttf(text,&#39;t1&#39;),docfreq(text,&#39;t1">http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1')"

Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for).

I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug?

Shai
Reply | Threaded
Open this post in threaded view
|

Re: Getting totalTermFreq and docFreq for terms

Joel Bernstein
Hi Shai,

Do ttf and docfreq return global stats in distributed mode? I wasn't aware that there was a mechanism for aggregating values in the field list.



On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <[hidden email]> wrote:
Hi

I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "<a href="http://localhost:8983/solr/mycollection/select?q=*:*&amp;rows=1&amp;fl=ttf(text,&#39;t1&#39;),docfreq(text,&#39;t1" target="_blank">http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1')"

Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for).

I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug?

Shai

Reply | Threaded
Open this post in threaded view
|

Re: Getting totalTermFreq and docFreq for terms

Shai Erera
No, they are not global distributed stats. I am willing to live with approximated stats though (unless again, there's an API which can give me both). I wonder why doesn't Terms component return ttf in addition to docfreq. The API (at the Lucene level) is right there already.

On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <[hidden email]> wrote:
Hi Shai,

Do ttf and docfreq return global stats in distributed mode? I wasn't aware that there was a mechanism for aggregating values in the field list.


On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <[hidden email]> wrote:
Hi

I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "<a href="http://localhost:8983/solr/mycollection/select?q=*:*&amp;rows=1&amp;fl=ttf(text,&#39;t1&#39;),docfreq(text,&#39;t1" class="gmail_msg" target="_blank">http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1')"

Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for).

I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug?

Shai

Reply | Threaded
Open this post in threaded view
|

Re: Getting totalTermFreq and docFreq for terms

Joel Bernstein
Yeah, I think expanding the functionality of the terms component looks like the right place to add these stats. 

I plan on exposing these types of terms stats as Streaming Expression functions but I would likely use the terms component under the covers.




On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera <[hidden email]> wrote:
No, they are not global distributed stats. I am willing to live with approximated stats though (unless again, there's an API which can give me both). I wonder why doesn't Terms component return ttf in addition to docfreq. The API (at the Lucene level) is right there already.

On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <[hidden email]> wrote:
Hi Shai,

Do ttf and docfreq return global stats in distributed mode? I wasn't aware that there was a mechanism for aggregating values in the field list.


On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <[hidden email]> wrote:
Hi

I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "<a href="http://localhost:8983/solr/mycollection/select?q=*:*&amp;rows=1&amp;fl=ttf(text,&#39;t1&#39;),docfreq(text,&#39;t1" class="m_-2800006672856443310gmail_msg" target="_blank">http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1')"

Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for).

I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug?

Shai


Reply | Threaded
Open this post in threaded view
|

Re: Getting totalTermFreq and docFreq for terms

Shai Erera
Looks like this could be a very easy addition to TermsComponent? From what I read in the code, it uses TermContext to compute/hold the stats, and the latter already has docFreq and totalTermFreq (!!). It's just that TermsComponent does not output TTF (only computes it...):

    for(int i=0; i<terms.length; i++) {
      if(termContexts[i] != null) {
        String outTerm = fieldType.indexedToReadable(terms[i].bytes().utf8ToString());
        int docFreq = termContexts[i].docFreq();
        termsMap.add(outTerm, docFreq);
      }
    }


On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein <[hidden email]> wrote:
Yeah, I think expanding the functionality of the terms component looks like the right place to add these stats. 

I plan on exposing these types of terms stats as Streaming Expression functions but I would likely use the terms component under the covers.



On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera <[hidden email]> wrote:
No, they are not global distributed stats. I am willing to live with approximated stats though (unless again, there's an API which can give me both). I wonder why doesn't Terms component return ttf in addition to docfreq. The API (at the Lucene level) is right there already.

On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <[hidden email]> wrote:
Hi Shai,

Do ttf and docfreq return global stats in distributed mode? I wasn't aware that there was a mechanism for aggregating values in the field list.


On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <[hidden email]> wrote:
Hi

I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "<a href="http://localhost:8983/solr/mycollection/select?q=*:*&amp;rows=1&amp;fl=ttf(text,&#39;t1&#39;),docfreq(text,&#39;t1" class="m_7905060777458502570m_-2800006672856443310gmail_msg gmail_msg" target="_blank">http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1')"

Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for).

I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug?

Shai


Reply | Threaded
Open this post in threaded view
|

Re: Getting totalTermFreq and docFreq for terms

Shai Erera
Hmm .. so if I want to add totalTermFreq to the response, it will break the current output format of TermsComponent, which returns for each term only the docFreq. What's our BWC policy for such API and is there a way to handle it?

I can add a new terms.ttf parameter, and so if you set it to true, the response will look different (each term will have both docFreq and totalTermFreq elements), but if you didn't, you will get the same response. Is that acceptable?

Somewhat related, but can be handled separately, I noticed that if you specify terms.list and multiple terms.fl parameters, you only receive stats for the first field (the rest are ignored), but if you don't specify terms.list, you get results for all fields. I don't see any reason not to support multiple fields with terms list, what do you think?

On Wed, Feb 22, 2017 at 10:08 PM Shai Erera <[hidden email]> wrote:
Looks like this could be a very easy addition to TermsComponent? From what I read in the code, it uses TermContext to compute/hold the stats, and the latter already has docFreq and totalTermFreq (!!). It's just that TermsComponent does not output TTF (only computes it...):

    for(int i=0; i<terms.length; i++) {
      if(termContexts[i] != null) {
        String outTerm = fieldType.indexedToReadable(terms[i].bytes().utf8ToString());
        int docFreq = termContexts[i].docFreq();
        termsMap.add(outTerm, docFreq);
      }
    }


On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein <[hidden email]> wrote:
Yeah, I think expanding the functionality of the terms component looks like the right place to add these stats. 

I plan on exposing these types of terms stats as Streaming Expression functions but I would likely use the terms component under the covers.



On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera <[hidden email]> wrote:
No, they are not global distributed stats. I am willing to live with approximated stats though (unless again, there's an API which can give me both). I wonder why doesn't Terms component return ttf in addition to docfreq. The API (at the Lucene level) is right there already.

On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <[hidden email]> wrote:
Hi Shai,

Do ttf and docfreq return global stats in distributed mode? I wasn't aware that there was a mechanism for aggregating values in the field list.


On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <[hidden email]> wrote:
Hi

I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "<a href="http://localhost:8983/solr/mycollection/select?q=*:*&amp;rows=1&amp;fl=ttf(text,&#39;t1&#39;),docfreq(text,&#39;t1" class="m_3145254210042217113m_7905060777458502570m_-2800006672856443310gmail_msg gmail_msg" target="_blank">http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1')"

Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for).

I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug?

Shai


Reply | Threaded
Open this post in threaded view
|

Re: Getting totalTermFreq and docFreq for terms

Joel Bernstein
The idea of adding a terms.ttf parameter sounds fine to me. And It would be good to get terms.list better integrated into the TermsComponent.  In general I think it's time for more attention to be paid to the TermsComponent. 


On Wed, Feb 22, 2017 at 4:12 PM, Shai Erera <[hidden email]> wrote:
Hmm .. so if I want to add totalTermFreq to the response, it will break the current output format of TermsComponent, which returns for each term only the docFreq. What's our BWC policy for such API and is there a way to handle it?

I can add a new terms.ttf parameter, and so if you set it to true, the response will look different (each term will have both docFreq and totalTermFreq elements), but if you didn't, you will get the same response. Is that acceptable?

Somewhat related, but can be handled separately, I noticed that if you specify terms.list and multiple terms.fl parameters, you only receive stats for the first field (the rest are ignored), but if you don't specify terms.list, you get results for all fields. I don't see any reason not to support multiple fields with terms list, what do you think?

On Wed, Feb 22, 2017 at 10:08 PM Shai Erera <[hidden email]> wrote:
Looks like this could be a very easy addition to TermsComponent? From what I read in the code, it uses TermContext to compute/hold the stats, and the latter already has docFreq and totalTermFreq (!!). It's just that TermsComponent does not output TTF (only computes it...):

    for(int i=0; i<terms.length; i++) {
      if(termContexts[i] != null) {
        String outTerm = fieldType.indexedToReadable(terms[i].bytes().utf8ToString());
        int docFreq = termContexts[i].docFreq();
        termsMap.add(outTerm, docFreq);
      }
    }


On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein <[hidden email]> wrote:
Yeah, I think expanding the functionality of the terms component looks like the right place to add these stats. 

I plan on exposing these types of terms stats as Streaming Expression functions but I would likely use the terms component under the covers.



On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera <[hidden email]> wrote:
No, they are not global distributed stats. I am willing to live with approximated stats though (unless again, there's an API which can give me both). I wonder why doesn't Terms component return ttf in addition to docfreq. The API (at the Lucene level) is right there already.

On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <[hidden email]> wrote:
Hi Shai,

Do ttf and docfreq return global stats in distributed mode? I wasn't aware that there was a mechanism for aggregating values in the field list.


On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <[hidden email]> wrote:
Hi

I am currently using function queries to obtain these two statistics, as I didn't see a better or more explicit API and the Terms component only returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "<a href="http://localhost:8983/solr/mycollection/select?q=*:*&amp;rows=1&amp;fl=ttf(text,&#39;t1&#39;),docfreq(text,&#39;t1" class="m_-3071272850860090233m_3145254210042217113m_7905060777458502570m_-2800006672856443310gmail_msg m_-3071272850860090233gmail_msg" target="_blank">http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1')"

Today I noticed that it sometimes returns 0 for these stats for existing terms. After debugging and going through the code, I noticed that it performs analysis on the value that's given. So if I provide an already stemmed value, it analyzes the value further and in some cases it results in a non-existing term (and in other cases I get stats for a term I didn't ask for).

I want to get the stats of the indexed version of the terms, and that's why I send the already stemmed one. In my case I tried to get the stats for the term 'disguis' which is the stem of 'disguise' and 'disguised', however it further analyzed the value to 'disgui' (per the analysis chain) and that term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics? I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with single and double quotes, but of course that does not affect the analysis ... is analysis an intended behavior or a bug?

Shai