Nested facet complete wrong counts

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Nested facet complete wrong counts

kennyk
Hi all,

We are doing some tests in solr 6.6 with json facet api and we get
completely wrong counts for some combination of  facets

Setting: We have a set of fields for 376k documents in our query (total
120M documents). We work with 2 shards. When doing first a faceting over
the first facet and keeping these numbers, we subsequently do a nested
faceting over both facets.

Then we add the numbers of sub-facet and expect to get the
(approximately) the same numbers back. Sometimes we get rounding errors
of about 1% difference. But on other occasions it seems to way off

for example

Gender (3 values) Country (211 values)
16226 - 18424 = -2198 (-13.5461604832%)
282854 - 464387 = -181533 (-64.1790464338%)
40489 - 47902 = -7413 (-18.3086764306%)
36672 - 49749 = -13077 (-35.6593586387%)

Gender (3 values)  Status (17 Values)
16226 - 16273 = -47 (-0.289658572661%)
282854 - 435974 = -153120 (-54.1339348215%)
40489 - 49925 = -9436 (-23.305095211%)
36672 - 54019 = -17347 (-47.3031195462%)

...

These are the typical requests we submit. So note that we have refine
and an overrequest, but we in the case of Gender vs Request we should
query all the buckets anyway.

{"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]}

{"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]}

Is this a known bug? Would switching to old facet api resolve this? Are
there other parameters we miss?


Thanks


kenny


Reply | Threaded
Open this post in threaded view
|

Re: Nested facet complete wrong counts

Amrit Sarkar
Kenny,

This is a known behavior in multi-sharded collection where the field values
belonging to same facet doesn't reside in same shard. Yonik Seeley has
improved the Json Facet feature by introducing "overrequest" and "refine"
parameters.

Kindly checkout Jira:
https://issues.apache.org/jira/browse/SOLR-7452
https://issues.apache.org/jira/browse/SOLR-9432

Relevant blog: https://medium.com/@abb67cbb46b/1acfa77cd90c

On 10 Nov 2017 10:02 p.m., "kenny" <[hidden email]> wrote:

> Hi all,
>
> We are doing some tests in solr 6.6 with json facet api and we get
> completely wrong counts for some combination of  facets
>
> Setting: We have a set of fields for 376k documents in our query (total
> 120M documents). We work with 2 shards. When doing first a faceting over
> the first facet and keeping these numbers, we subsequently do a nested
> faceting over both facets.
>
> Then we add the numbers of sub-facet and expect to get the (approximately)
> the same numbers back. Sometimes we get rounding errors of about 1%
> difference. But on other occasions it seems to way off
>
> for example
>
> Gender (3 values) Country (211 values)
> 16226 - 18424 = -2198 (-13.5461604832%)
> 282854 - 464387 = -181533 (-64.1790464338%)
> 40489 - 47902 = -7413 (-18.3086764306%)
> 36672 - 49749 = -13077 (-35.6593586387%)
>
> Gender (3 values)  Status (17 Values)
> 16226 - 16273 = -47 (-0.289658572661%)
> 282854 - 435974 = -153120 (-54.1339348215%)
> 40489 - 49925 = -9436 (-23.305095211%)
> 36672 - 54019 = -17347 (-47.3031195462%)
>
> ...
>
> These are the typical requests we submit. So note that we have refine and
> an overrequest, but we in the case of Gender vs Request we should query all
> the buckets anyway.
>
> {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll(
> Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"S
> tatus_sf\",\"missing\":true,\"refine\":true,\"overrequest\":
> 50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]}
>
> {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\"
> :\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine
> \":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\"
> facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Statu
> s_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\
> "limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_
> sf)\"}","q":"*:*","fq":["type:\"something\""]}
>
> Is this a known bug? Would switching to old facet api resolve this? Are
> there other parameters we miss?
>
>
> Thanks
>
>
> kenny
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nested facet complete wrong counts

Yonik Seeley
In reply to this post by kennyk
I do notice you are using hll (hyper-log-log) which is a distributed
cardinality *estimate* : https://en.wikipedia.org/wiki/HyperLogLog

-Yonik


On Fri, Nov 10, 2017 at 11:32 AM, kenny <[hidden email]> wrote:

> Hi all,
>
> We are doing some tests in solr 6.6 with json facet api and we get
> completely wrong counts for some combination of  facets
>
> Setting: We have a set of fields for 376k documents in our query (total 120M
> documents). We work with 2 shards. When doing first a faceting over the
> first facet and keeping these numbers, we subsequently do a nested faceting
> over both facets.
>
> Then we add the numbers of sub-facet and expect to get the (approximately)
> the same numbers back. Sometimes we get rounding errors of about 1%
> difference. But on other occasions it seems to way off
>
> for example
>
> Gender (3 values) Country (211 values)
> 16226 - 18424 = -2198 (-13.5461604832%)
> 282854 - 464387 = -181533 (-64.1790464338%)
> 40489 - 47902 = -7413 (-18.3086764306%)
> 36672 - 49749 = -13077 (-35.6593586387%)
>
> Gender (3 values)  Status (17 Values)
> 16226 - 16273 = -47 (-0.289658572661%)
> 282854 - 435974 = -153120 (-54.1339348215%)
> 40489 - 49925 = -9436 (-23.305095211%)
> 36672 - 54019 = -17347 (-47.3031195462%)
>
> ...
>
> These are the typical requests we submit. So note that we have refine and an
> overrequest, but we in the case of Gender vs Request we should query all the
> buckets anyway.
>
> {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]}
>
> {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]}
>
> Is this a known bug? Would switching to old facet api resolve this? Are
> there other parameters we miss?
>
>
> Thanks
>
>
> kenny
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nested facet complete wrong counts

kennyk
In reply to this post by Amrit Sarkar
Thank you. But as I showed in my example we used refine and overrequest is
not strictly needed because we need all buckets anyway. But that can hardly
explain an error of 60%, right?

Op 10-nov.-2017 19:29 schreef "Amrit Sarkar" <[hidden email]>:

> Kenny,
>
> This is a known behavior in multi-sharded collection where the field values
> belonging to same facet doesn't reside in same shard. Yonik Seeley has
> improved the Json Facet feature by introducing "overrequest" and "refine"
> parameters.
>
> Kindly checkout Jira:
> https://issues.apache.org/jira/browse/SOLR-7452
> https://issues.apache.org/jira/browse/SOLR-9432
>
> Relevant blog: https://medium.com/@abb67cbb46b/1acfa77cd90c
>
> On 10 Nov 2017 10:02 p.m., "kenny" <[hidden email]> wrote:
>
> > Hi all,
> >
> > We are doing some tests in solr 6.6 with json facet api and we get
> > completely wrong counts for some combination of  facets
> >
> > Setting: We have a set of fields for 376k documents in our query (total
> > 120M documents). We work with 2 shards. When doing first a faceting over
> > the first facet and keeping these numbers, we subsequently do a nested
> > faceting over both facets.
> >
> > Then we add the numbers of sub-facet and expect to get the
> (approximately)
> > the same numbers back. Sometimes we get rounding errors of about 1%
> > difference. But on other occasions it seems to way off
> >
> > for example
> >
> > Gender (3 values) Country (211 values)
> > 16226 - 18424 = -2198 (-13.5461604832%)
> > 282854 - 464387 = -181533 (-64.1790464338%)
> > 40489 - 47902 = -7413 (-18.3086764306%)
> > 36672 - 49749 = -13077 (-35.6593586387%)
> >
> > Gender (3 values)  Status (17 Values)
> > 16226 - 16273 = -47 (-0.289658572661%)
> > 282854 - 435974 = -153120 (-54.1339348215%)
> > 40489 - 49925 = -9436 (-23.305095211%)
> > 36672 - 54019 = -17347 (-47.3031195462%)
> >
> > ...
> >
> > These are the typical requests we submit. So note that we have refine and
> > an overrequest, but we in the case of Gender vs Request we should query
> all
> > the buckets anyway.
> >
> > {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll(
> > Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"S
> > tatus_sf\",\"missing\":true,\"refine\":true,\"overrequest\":
> > 50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]}
> >
> > {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\"
> > :\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine
> > \":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\"
> > facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Statu
> > s_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\
> > "limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_
> > sf)\"}","q":"*:*","fq":["type:\"something\""]}
> >
> > Is this a known bug? Would switching to old facet api resolve this? Are
> > there other parameters we miss?
> >
> >
> > Thanks
> >
> >
> > kenny
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Nested facet complete wrong counts

kennyk
In reply to this post by Yonik Seeley
Hi Yonik,

I am aware of the estimate on the hll. But we don't use the hll as a
baseline for comparison. We ask the values for one facet (for example
Gender). We store these counts for each bucket. Next we do another request.
This time for a facet and a subfacet (for example Gender x Type). We sum
all the values of Type with the same Gender and compare these sums with the
numbers of previous request. These numbers differ by 60% which is quite
worrying. Not always it depends on the facet, but still.
Did you get any reports like this?

Thanks

Kenny

Op 11-nov.-2017 01:47 schreef "Yonik Seeley" <[hidden email]>:

> I do notice you are using hll (hyper-log-log) which is a distributed
> cardinality *estimate* : https://en.wikipedia.org/wiki/HyperLogLog
>
> -Yonik
>
>
> On Fri, Nov 10, 2017 at 11:32 AM, kenny <[hidden email]> wrote:
> > Hi all,
> >
> > We are doing some tests in solr 6.6 with json facet api and we get
> > completely wrong counts for some combination of  facets
> >
> > Setting: We have a set of fields for 376k documents in our query (total
> 120M
> > documents). We work with 2 shards. When doing first a faceting over the
> > first facet and keeping these numbers, we subsequently do a nested
> faceting
> > over both facets.
> >
> > Then we add the numbers of sub-facet and expect to get the
> (approximately)
> > the same numbers back. Sometimes we get rounding errors of about 1%
> > difference. But on other occasions it seems to way off
> >
> > for example
> >
> > Gender (3 values) Country (211 values)
> > 16226 - 18424 = -2198 (-13.5461604832%)
> > 282854 - 464387 = -181533 (-64.1790464338%)
> > 40489 - 47902 = -7413 (-18.3086764306%)
> > 36672 - 49749 = -13077 (-35.6593586387%)
> >
> > Gender (3 values)  Status (17 Values)
> > 16226 - 16273 = -47 (-0.289658572661%)
> > 282854 - 435974 = -153120 (-54.1339348215%)
> > 40489 - 49925 = -9436 (-23.305095211%)
> > 36672 - 54019 = -17347 (-47.3031195462%)
> >
> > ...
> >
> > These are the typical requests we submit. So note that we have refine
> and an
> > overrequest, but we in the case of Gender vs Request we should query all
> the
> > buckets anyway.
> >
> > {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"
> hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\"
> :\"Status_sf\",\"missing\":true,\"refine\":true,\"
> overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq"
> :["type:\"something\""]}
> >
> > {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"
> type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\
> "refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0,
> \"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"
> Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\"
> :50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(
> Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]}
> >
> > Is this a known bug? Would switching to old facet api resolve this? Are
> > there other parameters we miss?
> >
> >
> > Thanks
> >
> >
> > kenny
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Nested facet complete wrong counts

Yonik Seeley
On Sat, Nov 11, 2017 at 9:18 AM, Kenny Knecht <[hidden email]> wrote:

> Hi Yonik,
>
> I am aware of the estimate on the hll. But we don't use the hll as a
> baseline for comparison. We ask the values for one facet (for example
> Gender). We store these counts for each bucket. Next we do another request.
> This time for a facet and a subfacet (for example Gender x Type). We sum
> all the values of Type with the same Gender and compare these sums with the
> numbers of previous request. These numbers differ by 60% which is quite
> worrying. Not always it depends on the facet, but still.
> Did you get any reports like this?

Nope.  The counts for the scenario you describe should add up exactly
for single-valued fields.  Are you sure you're adding in the "missing"
bucket?

When you some up the sub-facets on Type, do you get a value under or
over the counts on the parent facet?
Verify that Type is single-valued.  One would not expect facets on a
multi-valued field to add up in the same way.
Verify that you're getting all of the Type constraints by using a
limit of -1on that sub-facet.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Nested facet complete wrong counts

Yonik Seeley
Also, If you're looking at all constraints, you shouldn't need refine:true
But if you do need it, it was only added in Solr 7.0 (and I see you're
using 6.6)

-Yonik


On Sat, Nov 11, 2017 at 9:48 AM, Yonik Seeley <[hidden email]> wrote:

> On Sat, Nov 11, 2017 at 9:18 AM, Kenny Knecht <[hidden email]> wrote:
>> Hi Yonik,
>>
>> I am aware of the estimate on the hll. But we don't use the hll as a
>> baseline for comparison. We ask the values for one facet (for example
>> Gender). We store these counts for each bucket. Next we do another request.
>> This time for a facet and a subfacet (for example Gender x Type). We sum
>> all the values of Type with the same Gender and compare these sums with the
>> numbers of previous request. These numbers differ by 60% which is quite
>> worrying. Not always it depends on the facet, but still.
>> Did you get any reports like this?
>
> Nope.  The counts for the scenario you describe should add up exactly
> for single-valued fields.  Are you sure you're adding in the "missing"
> bucket?
>
> When you some up the sub-facets on Type, do you get a value under or
> over the counts on the parent facet?
> Verify that Type is single-valued.  One would not expect facets on a
> multi-valued field to add up in the same way.
> Verify that you're getting all of the Type constraints by using a
> limit of -1on that sub-facet.
>
> -Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Nested facet complete wrong counts

kennyk
AAAARRGG - [banging my head against the wall]
Of course. You are abolutely right about the multi valuedness
Thanks for the 7.0 hint. Gives a reason to upgrade.
Need to re-index when upgrading?

Kenny



Kenny Knecht, PhD
CTO and technical lead
<a href="tel:0032498464291" style="color:#5d6d81;text-decoration:none;outline:none" target="_blank"> +32 486 75 66 16
[hidden email]
www.ontoforce.com
Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium
CIC, One Broadway, MA 02142 Cambridge, United States

On 11 November 2017 at 15:52, Yonik Seeley <[hidden email]> wrote:
Also, If you're looking at all constraints, you shouldn't need refine:true
But if you do need it, it was only added in Solr 7.0 (and I see you're
using 6.6)

-Yonik


On Sat, Nov 11, 2017 at 9:48 AM, Yonik Seeley <[hidden email]> wrote:
> On Sat, Nov 11, 2017 at 9:18 AM, Kenny Knecht <[hidden email]> wrote:
>> Hi Yonik,
>>
>> I am aware of the estimate on the hll. But we don't use the hll as a
>> baseline for comparison. We ask the values for one facet (for example
>> Gender). We store these counts for each bucket. Next we do another request.
>> This time for a facet and a subfacet (for example Gender x Type). We sum
>> all the values of Type with the same Gender and compare these sums with the
>> numbers of previous request. These numbers differ by 60% which is quite
>> worrying. Not always it depends on the facet, but still.
>> Did you get any reports like this?
>
> Nope.  The counts for the scenario you describe should add up exactly
> for single-valued fields.  Are you sure you're adding in the "missing"
> bucket?
>
> When you some up the sub-facets on Type, do you get a value under or
> over the counts on the parent facet?
> Verify that Type is single-valued.  One would not expect facets on a
> multi-valued field to add up in the same way.
> Verify that you're getting all of the Type constraints by using a
> limit of -1on that sub-facet.
>
> -Yonik