Similarity plugins which are normalized

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Similarity plugins which are normalized

Tanya Bompi
Hi,
  As I am tuning the relevancy of my query parser, I see that 2 different
queries with  phrase matches get very different scores primarily influenced
by the Term Frequency component. Since I am using a threshold to filter the
results for a matched record based off the SOLR score, a somewhat
normalized score is needed.
Are there any similarity classes that are more suitable to my needs?

Thanks,
Tanu
Reply | Threaded
Open this post in threaded view
|

Re: Similarity plugins which are normalized

Doug Turnbull
The usual advice is relevance scores don’t exist on a scale where a
threshold is useful. As these are just heuristics used for ranking , not a
confidence level.

I would instead focus on what attributes of a document consider it relevant
or not (strong match in certain fields).

A couple of things prevent field scores from being comparable:
- doc freq differs per field
- field length/ avg field length differs per field
- typical term frequency of a term in a field differs

You might find this article useful:

https://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/

Doug

On Thu, Nov 29, 2018 at 4:44 PM Tanya Bompi <[hidden email]> wrote:

> Hi,
>   As I am tuning the relevancy of my query parser, I see that 2 different
> queries with  phrase matches get very different scores primarily influenced
> by the Term Frequency component. Since I am using a threshold to filter the
> results for a matched record based off the SOLR score, a somewhat
> normalized score is needed.
> Are there any similarity classes that are more suitable to my needs?
>
> Thanks,
> Tanu
>
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
Reply | Threaded
Open this post in threaded view
|

Re: Similarity plugins which are normalized

Tanya Bompi
Thanks a lot Doug. Maybe setting more importance to certain fields is the
way to go in conjunction with the overall match.

Tanu

On Thu, Nov 29, 2018 at 1:52 PM Doug Turnbull <
[hidden email]> wrote:

> The usual advice is relevance scores don’t exist on a scale where a
> threshold is useful. As these are just heuristics used for ranking , not a
> confidence level.
>
> I would instead focus on what attributes of a document consider it relevant
> or not (strong match in certain fields).
>
> A couple of things prevent field scores from being comparable:
> - doc freq differs per field
> - field length/ avg field length differs per field
> - typical term frequency of a term in a field differs
>
> You might find this article useful:
>
>
> https://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/
>
> Doug
>
> On Thu, Nov 29, 2018 at 4:44 PM Tanya Bompi <[hidden email]> wrote:
>
> > Hi,
> >   As I am tuning the relevancy of my query parser, I see that 2 different
> > queries with  phrase matches get very different scores primarily
> influenced
> > by the Term Frequency component. Since I am using a threshold to filter
> the
> > results for a matched record based off the SOLR score, a somewhat
> > normalized score is needed.
> > Are there any similarity classes that are more suitable to my needs?
> >
> > Thanks,
> > Tanu
> >
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>