[jira] [Comment Edited] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281837#comment-16281837 ]

Yonik Seeley edited comment on SOLR-11725 at 12/7/17 2:03 PM:
--------------------------------------------------------------

+1 for changing... "N-1" is the more standard form.

bq. Attaching a trivial patch containing the change Hoss spelled out above.
Note that the accumulator needs to be changed as well for non-distributed stddev calculation.  The Merger is not used in that case.

This does bring up the question of what to do when N=1 (or N=0 for that matter).
Standard deviation of a population of N=1 is 0, but of a sample of N=1 is undefined (or infinity?)

When N=0, the current code produces 0, but I don't think that's the best choice.
In general we've been moving toward omitting undefined functions.  Stats like min() and max() already do this.

TestJsonFacets has this:
{code}
    // stats at top level, matching documents, but no values in the field
    // NOTE: this represents the current state of what is returned, not the ultimate desired state.
    client.testJQ(params(p, "q", "id:3"
        , "json.facet", "{ sum1:'sum(${num_d})', sumsq1:'sumsq(${num_d})', avg1:'avg(${num_d})', min1:'min(${num_d})', max1:'max(${num_d})'" +
            ", numwhere:'unique(${where_s})', unique_num_i:'unique(${num_i})', unique_num_d:'unique(${num_d})', unique_date:'unique(${date})'" +
            ", where_hll:'hll(${where_s})', hll_num_i:'hll(${num_i})', hll_num_d:'hll(${num_d})', hll_date:'hll(${date})'" +
            ", med:'percentile(${num_d},50)', perc:'percentile(${num_d},0,50.0,100)', variance:'variance(${num_d})', stddev:'stddev(${num_d})' }"
        )
        , "facets=={count:1 " +
            ",sum1:0.0," +
            " sumsq1:0.0," +
            " avg1:0.0," +   // TODO: undesirable. omit?
            // " min1:'NaN'," +
            // " max1:'NaN'," +
            " numwhere:0," +
            " unique_num_i:0," +
            " unique_num_d:0," +
            " unique_date:0," +
            " where_hll:0," +
            " hll_num_i:0," +
            " hll_num_d:0," +
            " hll_date:0," +
            " variance:0.0," +
            " stddev:0.0" +
            " }"
    );
{code}

I'd be tempted to treat N=0 and N=1 as undefined, and omit them.  Note that we need to be careful to have the N=1 case still contribute to a distributed bucket, even if it's undefined locally!
In the distributed case, N=0 is normally handled generically for anything that doesn't produce a result (they are "missing"/null and are sorted after anything that has a value).  Things may work if we make getDouble() return 0 (for sorting), but then override getMergedResult() to return null when N <= 1.

Oh, and whatever treatment we give stddev(), we should presumably give to variance()?

When thinking about the sorting, it occurs to me that there is probably a minor sorting bug.
N=0 is treated the same as a true 0 standard deviation, but they shouldn't sort equivalently.
But we still have a sorting bug locally anyway with respect to sort-missing-last: SOLR-10618


was (Author: [hidden email]):
+1 for changing... "N-1" is the more standard form.

bq. Attaching a trivial patch containing the change Hoss spelled out above.
Note that the accumulator needs to be changed as well for non-distributed stddev calculation.  The Merger is not used in that case.

This does bring up the question of what to do when N=1 (or N=0 for that matter).
Standard deviation of a population of N=1 is 0, but of a sample of N=1 is undefined (or infinity?)

When N=0, the current code produces 0, but I don't think that's the best choice.
In general we've been moving toward omitting undefined functions.  Stats like min() and max() already do this.

TestJsonFacets has this:
{code}
    // stats at top level, matching documents, but no values in the field
    // NOTE: this represents the current state of what is returned, not the ultimate desired state.
    client.testJQ(params(p, "q", "id:3"
        , "json.facet", "{ sum1:'sum(${num_d})', sumsq1:'sumsq(${num_d})', avg1:'avg(${num_d})', min1:'min(${num_d})', max1:'max(${num_d})'" +
            ", numwhere:'unique(${where_s})', unique_num_i:'unique(${num_i})', unique_num_d:'unique(${num_d})', unique_date:'unique(${date})'" +
            ", where_hll:'hll(${where_s})', hll_num_i:'hll(${num_i})', hll_num_d:'hll(${num_d})', hll_date:'hll(${date})'" +
            ", med:'percentile(${num_d},50)', perc:'percentile(${num_d},0,50.0,100)', variance:'variance(${num_d})', stddev:'stddev(${num_d})' }"
        )
        , "facets=={count:1 " +
            ",sum1:0.0," +
            " sumsq1:0.0," +
            " avg1:0.0," +   // TODO: undesirable. omit?
            // " min1:'NaN'," +
            // " max1:'NaN'," +
            " numwhere:0," +
            " unique_num_i:0," +
            " unique_num_d:0," +
            " unique_date:0," +
            " where_hll:0," +
            " hll_num_i:0," +
            " hll_num_d:0," +
            " hll_date:0," +
            " variance:0.0," +
            " stddev:0.0" +
            " }"
    );
{code}

I'd be tempted to treat N=0 and N=1 as undefined, and omit them.  Note that we need to be careful to have the N=1 case still contribute to a distributed bucket, even if it's undefined locally!
In the distributed case, N=0 is normally handled generically for anything that doesn't produce a result (they are "missing"/null and are sorted after anything that has a value).  Things may work if we make getDouble() return 0 (for sorting), but then override getMergedResult() to return null when N <= 1.

Oh, and whatever treatment we give stddev(), we should presumably give to variance()?



> json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11725
>                 URL: https://issues.apache.org/jira/browse/SOLR-11725
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Hoss Man
>         Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} calculations done between the two code paths can be measurably different, and realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat nerds I know online to help me sanity check if these equations (some how) reduced to eachother (In which case the discrepancies I was seeing in my results might have just been due to the order of intermediate operation execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, and explained that the code JSON Faceting is using is equivalent to the "Uncorrected sample stddev" formula, while StatsComponent's code is equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and pressed them to explain which one was the "most canonical" (or "most generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a Solr result set (or against a sub-set of results defined by a facet constraint) is probably to compare that distribution to a different Solr result set (or to compare N sub-sets of results defined by N facet constraints)
> * the size of the sets of documents (values) can be relatively small when computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected sample stddev" equation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]