[jira] [Commented] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281188#comment-16281188 ]

Hoss Man commented on SOLR-11733:
---------------------------------


Steps to reproduce..

*Build Collection & Index Some Data*

{noformat}
# start up a small solr cluster
$ bin/solr -e cloud -noprompt
...

# NOTE: we're ignoring the getting started collection that was created
# we'll make our own using the implicit router with one shard per node

$ curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=test&router.name=implicit&numShards=2&shards=shardX,shardY'
...

# Index 5 docs to *each* shards with:
# - the same "top 5" terms in all 5 docs on both shards
# - a common "tail" term in 2 docs on *both* shards
#   - w/a total of 4 docs,
# - some shard specific "distrating" terms that each appear in only 3 docs, and always on single shard
#   - On the 1st shard: there are 5 of these terms, such that 'tail' will be the #11 ranked term (on this shard)
#   - On the 2nd shard: 'tail' will be the #7 ranked term (on this shard)

$ curl -H 'Content-Type: application/json' 'http://localhost:8983/solr/test/update?commit=true' --data-binary '[
{ "id": "1_1", "foo_t": "a1 a2 a3 a4 a5   x1 x2 x3 x4 x5" },
{ "id": "1_2", "foo_t": "a1 a2 a3 a4 a5   x1 x2 x3 x4 x5" },
{ "id": "1_3", "foo_t": "a1 a2 a3 a4 a5   x1 x2 x3 x4 x5" },
{ "id": "1_4", "foo_t": "a1 a2 a3 a4 a5                   tail" },
{ "id": "1_5", "foo_t": "a1 a2 a3 a4 a5                   tail" },
]'
...
$ curl -H 'Content-Type: application/json' 'http://localhost:7574/solr/test/update?commit=true' --data-binary '[
{ "id": "2_1", "foo_t": "a1 a2 a3 a4 a5   yyy" },
{ "id": "2_2", "foo_t": "a1 a2 a3 a4 a5   yyy" },
{ "id": "2_3", "foo_t": "a1 a2 a3 a4 a5   yyy" },
{ "id": "2_4", "foo_t": "a1 a2 a3 a4 a5        tail" },
{ "id": "2_5", "foo_t": "a1 a2 a3 a4 a5        tail" },
]'
...
{noformat}


*Sanity Check Queries*

With an excessive 'limit' or 'overrequest' we can verify that 'tail' is the #6 ranked term overall (even with refinement explicitly disabled)

{noformat}
$ curl http://localhost:7574/solr/test/select -d 'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:7,overrequest:100,refine:false}}'
...
  "response":{"numFound":10,"start":0,"maxScore":1.0,"docs":[]
  },
  "facets":{
    "count":10,
    "foo":{
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"tail",
          "count":4},
        {
          "val":"x1",
          "count":3}]}}}

$ curl http://localhost:7574/solr/test/select -d 'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:100,overrequest:0,refine:false}}'
...
  "facets":{
    "count":10,
    "foo":{
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"tail",
          "count":4},
        {
          "val":"x1",
          "count":3},
        ...
{noformat}

Likewise, if we query each shard individual (w/ {{distrib=false}} ) we confirm that the "tail" term shows up in it's expected ranking...

{noformat}
$ curl http://localhost:8983/solr/test/select -d 'distrib=false&q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:11}}'
...
      "buckets":[{
          "val":"a1",
          "count":5},
        {
          "val":"a2",
          "count":5},
        {
          "val":"a3",
          "count":5},
        {
          "val":"a4",
          "count":5},
        {
          "val":"a5",
          "count":5},
        {
          "val":"x1",
          "count":3},
        {
          "val":"x2",
          "count":3},
        {
          "val":"x3",
          "count":3},
        {
          "val":"x4",
          "count":3},
        {
          "val":"x5",
          "count":3},
        {
          "val":"tail",
          "count":2}]}}}

$ curl http://localhost:7574/solr/test/select -d 'distrib=false&q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:7}}'
...
      "buckets":[{
          "val":"a1",
          "count":5},
        {
          "val":"a2",
          "count":5},
        {
          "val":"a3",
          "count":5},
        {
          "val":"a4",
          "count":5},
        {
          "val":"a5",
          "count":5},
        {
          "val":"yyy",
          "count":3},
        {
          "val":"tail",
          "count":2}]}}}
{noformat}



*Queries that Fail*

w/refinement, a limit of 6 (plus the implicit default overrequest) should be enough to find 'tail' -- but it's not included in the response from this query...

{noformat}
$ curl http://localhost:7574/solr/test/select -d 'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:6,refine:true}}'
...
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"x1",
          "count":3}]}}}
{noformat}

Even if we assume the implicit overrequest calculation may be broken, a "limit" of 6 + an explicit overrequest of "1" should be enough to discover 'tail' on the 2nd shard, and (w/refinement) it should bubble up into the top 6 -- but again, this {{limit:6,overrequest:1}} query doesn't find tail...

{noformat}
$ curl http://localhost:7574/solr/test/select -d 'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}'
...
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"x1",
          "count":3}]}}}
{noformat}

Here's the log messages from each node when the last request ( {{limit:6,overrequest:1,refine:true}} ) was executed...

{noformat}
INFO  - 2017-12-07 00:27:37.821; [c:test s:shardY r:core_node4 x:test_shardY_replica_n2] org.apache.solr.core.SolrCore; [test_shardY_replica_n2]  webapp=/solr path=/select params={df=_text_&distrib=false&_facet_={}&fl=id&fl=score&shards.purpose=1048580&start=0&fsv=true&shard.url=http://127.0.1.1:8983/solr/test_shardY_replica_n2/&rows=0&version=2&q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&NOW=1512606457819&isShard=true&wt=javabin} hits=5 status=0 QTime=0

==> example/cloud/node2/logs/solr.log <==
INFO  - 2017-12-07 00:27:37.821; [c:test s:shardX r:core_node3 x:test_shardX_replica_n1] org.apache.solr.core.SolrCore; [test_shardX_replica_n1]  webapp=/solr path=/select params={df=_text_&distrib=false&_facet_={}&fl=id&fl=score&shards.purpose=1048580&start=0&fsv=true&shard.url=http://127.0.1.1:7574/solr/test_shardX_replica_n1/&rows=0&version=2&q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&NOW=1512606457819&isShard=true&wt=javabin} hits=5 status=0 QTime=0
INFO  - 2017-12-07 00:27:37.823; [c:test s:shardX r:core_node3 x:test_shardX_replica_n1] org.apache.solr.core.SolrCore; [test_shardX_replica_n1]  webapp=/solr path=/select params={df=_text_&distrib=false&_facet_={"refine":{"foo":{"_l":["x1"]}}}&shards.purpose=2097152&shard.url=http://127.0.1.1:7574/solr/test_shardX_replica_n1/&rows=0&version=2&q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&NOW=1512606457819&isShard=true&facet=false&wt=javabin} hits=5 status=0 QTime=0
INFO  - 2017-12-07 00:27:37.824; [c:test s:shardX r:core_node3 x:test_shardX_replica_n1] org.apache.solr.core.SolrCore; [test_shardX_replica_n1]  webapp=/solr path=/select params={q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&rows=0&wt=json} hits=10 status=0 QTime=5
{noformat}


...note that this appears to show:

* an explicit {{"refine"}} request for " {{"_l":\["x1"]}} " logged by port 7574
** port 7574  doesn't have the term "x1" at all so would not have returned it in it's initial results
* *NO* indication of attempting to refine "x2", "yyy", or "tail"
** this in spite of the fact that they should have all been in the "top 6+1" from one shard, with counts making them competitive in the final results

What strikes me as most odd, is that even if there was some sort of "off by one" error preventing "x2" & "tail" (which should have been the "last" bucket from each of their respective shards) from being refined, "yyy" would have had the exact same count, and been in the exact same (shard specific) bucket as "x1" -- so why isn't there at a request to port #8983 to refine it?!  How is it different from "x1" ???



> json.facet refinement fails to bubble up some long tail (overrequested) terms?
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-11733
>                 URL: https://issues.apache.org/jira/browse/SOLR-11733
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Hoss Man
>
> Something wonky is happening with {{json.facet}} refinement.
> "Long Tail" terms that may not be in the "top n" on every shard, but are in the "top n + overrequest" for at least 1 shard aren't getting refined and included in the aggragated response in some cases.
> I don't understand the code enough to explain this, but I have some steps to reproduce that i'll post in a comment shortly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]