Total term frequency in solr includes deleted documents

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Total term frequency in solr includes deleted documents

Vijaymhaskar
Currently I am working on getting term frequency (not document frequency) of term in particular field for whole index. For that I am using function query  ttf(field_name,'term'), This returns me total occurrences of term in that field. But It seems it is also considering deleted documents while calculating count. I have verified this using index optimization, after optimization It is showing me correct count.

How can we get exact term frequency with excluding deleted documents term frequency, and that is without optimization because optimization is expensive in our case ?
Is there any other way we can get term frequency for entire collection in solr?

I have also tried following solutions,
I have also explored other options like,
1. term vector component - It returns per document term frequency for the documents which matched the query.
2. facet - it returns document frequency
3. Luke request handler - returns top terms from given field (document frequency)
4. terms component - returns document frequency


Reply | Threaded
Open this post in threaded view
|

Re: Total term frequency in solr includes deleted documents

Shawn Heisey-2
On 10/28/2014 7:16 AM, nutchsolruser wrote:
> How can we get exact term frequency with excluding deleted documents term
> frequency, and that is without optimization because optimization is
> expensive in our case ?
> Is there any other way we can get term frequency for entire collection in
> solr?


This is not possible except through index optimization.  Lucene is
amazingly efficient at computing information across the entire index.
If it were possible to keep that efficiency while also excluding info
from deleted documents, I'm sure it would have already been implemented.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Total term frequency in solr includes deleted documents

Alexandre Rafalovitch
Merge policy would probably affect at how often _some_ of the deleted
documents are purged at the cost lower than the full optimization.
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments

But it is still not a 100% solution.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 28 October 2014 09:42, Shawn Heisey <[hidden email]> wrote:

> On 10/28/2014 7:16 AM, nutchsolruser wrote:
>> How can we get exact term frequency with excluding deleted documents term
>> frequency, and that is without optimization because optimization is
>> expensive in our case ?
>> Is there any other way we can get term frequency for entire collection in
>> solr?
>
>
> This is not possible except through index optimization.  Lucene is
> amazingly efficient at computing information across the entire index.
> If it were possible to keep that efficiency while also excluding info
> from deleted documents, I'm sure it would have already been implemented.
>
> Thanks,
> Shawn
>