[jira] [Commented] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282264#comment-16282264 ]

Hoss Man commented on SOLR-11733:
---------------------------------



bq. I mentioned in SOLR-11729 the refinement algorithm being different (and for a single-level facet field, simpler).

FWIW, here's yonik's comment from SOLR-11729 which seems to specifically be on point for this issue (emphaiss mine)...

bq. It seems like there are many logical ways to refine results - I originally thought about using refine:simple because I imagined we would have other implementations in the future.  Anyway, this one is the simplest one to think about and implement: *the top buckets to return for all facets are determined in the first phase.* The second phase only gets contributions from other shards for those buckets.

bq. i.e. simple refinement doesn't change the buckets you get back.

Ah ... ok.  I didn't realize the refinement approach in {{json.facet}} wasn't as sophisticated as {{facet.field}}

To summarize again (in my own words to ensure I'm understanding you correctly):

# do a first pass, requesting "#limit + #overrequest" buckets from each shard
#* use the accumulated results of the first pass to determine the "top #limit buckets"
# do a second passs, in which we back-fill the "top #limit buckets" with data from any shards that have no yet contributed.

In which case, in my example above, the reason {{yyy}} isn't refined, even though it has the same "first pass" total as {{x1}}, is because during the first pass {{x1}} sorts higher (due to a secondary tie breaker sort on the terms) pushing {{yyy}} out of the "top 6".  (likewise {{x2}} and {{tail}} are never considered because they were never part of the "top 6" even w/o a tie breaker sort)

Do I have that correct?

----

The Bottom line: even if i don't fully grasp the current refinement mechanism you've described, is that you're saying the behavior i described with the above sample documents is *not* a bug: it's the intended/expected behavior of {{refine:true}} (aka {{refine:simple}} )

If so i'll edit this jira into an "Improvement" & update the summary/description to clarify how {{facet.pivot}} refinement differs from {{json.facet}} + {{refine:simple}} & leave open for future improvement

----
----

As far as discussion on potential improvements....


bq. From a correctness POV, smarter faceting is equivalent to increasing the overrequest amount... we still can't make guarantees.

Hmmm... I'm not sure that i agree with that assessment.  I guess "mathematically" speaking it's true that compared to a "smarter" refinement method, this "simple" refine method can product equally "correct" top terms solely by increasing the overrequest amount -- but that's like saying we don't even need any refinement method at all as long as we specify an infinite amount of overrequest.

With the refinement approach used by {{facet.field}} (and {{facet.pivot}}) we *can* make garuntees about the correctness of the top terms -- regardless of if/how-much overrequesting is used -- _for any term that is in the "top buckets" of at least one shard_.

IIUC the current {{json.facet}} refinement method can't make _any_ similar garuntees at all, regardless of what (finite) overrequest value is specified ... but {{facet.field}} certainly can:

In {{facet.field}} today, If:
* A term is in the "top buckets" (limit + overrequest) returned by at least one shard
* And the sort value (ie: count) returned by that shard (along with the lowest sort-value/count returned by all other shards) indicates that the term _might_ be competitive realtive to the other terms returned by other shards
...then that term is refined. That's a garuntee we can make.

Meaning that even if you have shards with widely diff term stats (ie: time partioned shards, or docs co-located due to multi-level compositeId, or block join, etc..) we can/will refine the top terms from each shard.

In {{facet.field}} the overrequest helps to:
* increase the scope of how deep we look to find the "top (candidate) terms" from each shard
* decreases the amount of data we have to request when refineing

...but the *distribution* of terms across shards has very little (none? ... not certain) impact on the "correctness" of the "top N" in the aggregate.  Even if the first pass "top terms" from each shard is 100% unique, the *realtive* "bottom" counts from each shard is considered before assuming that the "higher" counts should win -- meaning that if the shards have very different sizes, "top terms" from the smaller shards still have a chance of being considered as an "aggregated top term" as long as the "bottom count" from the (larger) shards is high enough to indicate that those (missing) terms might still be competitive.

But in the {{json.facet}} approach to refinement, IIUC: A term returned by only one shard won't be considered unless the count from _just that one shard_ is high enough to help it dominate over the *cumulative* counts from each of the top terms of the other shards.

Which seems to not only make the amount of overrequesting _much_ more important to consider when requesting refinement, but also requires you to consider the comparative *sizes* of the shards, and the potential term distribution variances between them.  


Or to put it another way...

*TL,DR: IIUC, the amount of overrequest is _much_ more important to consider when requesting refinement on {{json.facet}} then it has ever been with {{facet.field}}, but when picking an overrequest amount for {{json.facet}}, people also need to consider the relative differences in _sizes_ of their shards, and the potential term distribution variances that may exist between them.*


(correct?)

----

bq. We could easily implement a mode for some field facets that does the "could this possibly be in the top N" logic to consider more buckets in the first phase... but only if it's not a sub-facet of another partial facet (a facet with something like a limit). If we're sorting by something other than count (like stddev for instance) then I guess we'd have to discard smart pruning and just try to get all buckets we saw in the first phase.

You lost me there.... If the sort is on some criteria other then count (ex: stddev), why can't we compute a hypothetical "best case" sort value for the candidates based on the pre-aggregation values returned by the "bottom" of the other shards (ex: the sum, sumsq, and num_values already needed from each shard for the aggregated stddev) in combination with the values from the one shard that *does* have that term?

bq. If a partial facet is a sub-facet of another partial-facet, the logic of what one can exclude seems to get harder, ...

You _completely_ lost me there ... I *think* maybe you're alluding to the need for multi-stage refinement depending on how deep the nested facets go?  which FWIW is exactly what {{facet.pivot}} does today.




> json.facet refinement fails to bubble up some long tail (overrequested) terms?
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-11733
>                 URL: https://issues.apache.org/jira/browse/SOLR-11733
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>          Components: Facet Module
>            Reporter: Hoss Man
>
> Something wonky is happening with {{json.facet}} refinement.
> "Long Tail" terms that may not be in the "top n" on every shard, but are in the "top n + overrequest" for at least 1 shard aren't getting refined and included in the aggragated response in some cases.
> I don't understand the code enough to explain this, but I have some steps to reproduce that i'll post in a comment shortly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]