[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541000#comment-16541000 ]

Lucene/Solr QA commented on SOLR-12343:
---------------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m  7s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green}  2m 55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green}  2m 41s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} Validate source patterns {color} | {color:red}  2m 41s{color} | {color:red} Validate source patterns validate-source-patterns failed {color} |
| {color:red}-1{color} | {color:red} Validate ref guide {color} | {color:red}  2m 41s{color} | {color:red} Validate source patterns validate-source-patterns failed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 19s{color} | {color:red} core in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}104m 44s{color} | {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | solr.cloud.autoscaling.IndexSizeTriggerTest |
|   | solr.cloud.api.collections.ShardSplitTest |
|   | solr.search.facet.TestJsonFacetRefinement |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-12343 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931226/SOLR-12343.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  validatesourcepatterns  validaterefguide  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
| git revision | master / fe180bb |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 8 2015 |
| Default Java | 1.8.0_172 |
| Validate source patterns | https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-validate-source-patterns-root.txt |
| Validate ref guide | https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-validate-source-patterns-root.txt |
| unit | https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-unit-solr_core.txt |
|  Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/143/testReport/ |
| modules | C: solr/core solr/solr-ref-guide U: solr |
| Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/143/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-12343
>                 URL: https://issues.apache.org/jira/browse/SOLR-12343
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Hoss Man
>            Assignee: Yonik Seeley
>            Priority: Major
>         Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, __incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement can cause _refined_ buckets to be "bumped out" of the topN based on the refined counts/stats depending on the sort - causing _unrefined_ buckets originally discounted in phase#2 to bubble up into the topN and be returned to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the client have counts/stats that are the cumulation of all shards, but termY only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. Additional overrequest just increases the number of "extra" terms needed in the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]