Solr sort by score not working properly

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr sort by score not working properly

Prathyusha Kondeti
Hi,
I am using *Solr v6.2.1* .We are not getting accurate results using "sort
score desc".

let's assume we have a list of documents in our index as below

[{ "id": "1", "content": ["*java* developer"] },

{ "id": "2", "content": ["*Java* is object oriented.*Java* robust
language.Core *java* "] },

{ "id": "3", "content": ["*java* is platform independent. *Java* language."]
}]

Content is defined as multivalued field in the schema

<field name="content" type="text_general" *multiValued*="true"
indexed="true" stored="true"/>

when I search for java using below query

curl
http://localhost:8983/solr/test/select?fl=score,id&q=(java)&wt=json&sort=score
 desc

I am expecting the content with *Id :2* should come first as it contains
more matches related to java.But solr is giving inconsistent results.

Please suggest why I am not able to get desired results.



--

Thanks & Regards,

Prathyusha Kondeti  | Software Engineer

Software Development


[image: website-logo-org.png]

CEIPAL Solutions Pvt Ltd

Prashanthi Towers, 4th Floor, Road No: 92, Jubilee Hills, Hyderabad -
500033, INDIA

[O] +91-40-43515100  [M] +91 9848143513  [E]  [hidden email]  [W]
www.ceipal.com

<http://www.ceipal.com>

<http://www.ceipal.com>

[image: consider.png]
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system manager.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and
delete this e-mail from your system. If you are not the intended recipient
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.
Reply | Threaded
Open this post in threaded view
|

Re: Solr sort by score not working properly

Alessandro Benedetti
Hi,
if you add to the request the param : debugQuery=on you will see what
happens under the hood and understand how the score is assigned.

If you are new to the Lucene Similarity that Solr version uses ( BM25[1])
you can paste here the debug score response and we can briefly explain it to
you the first time.

First of all we are not even sure if the content field is actually used for
scoring in your case, if it is and it is alone used, it may be related to
the field length ( But it would be suspicious as they are quite similar in
length in your example).
Are you sorting by score for any reason ?
It's been a while I have not checked but I doubt you get any benefit from
the default ( which rank by score).

So I recommend you to send here the debug response and then possibly your
select request handler config.

Cheers



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Solr sort by score not working properly

Shawn Heisey-2
In reply to this post by Prathyusha Kondeti
On 6/22/2018 9:29 AM, Prathyusha Kondeti wrote:

> when I search for java using below query
>
> curl
> http://localhost:8983/solr/test/select?fl=score,id&q=(java)&wt=json&sort=score
>  desc
>
> I am expecting the content with *Id :2* should come first as it contains
> more matches related to java.But solr is giving inconsistent results.
>
> Please suggest why I am not able to get desired results.

Solr relies on Lucene for score calculations.

Years of effort has gone into tuning the Lucene code that calculates
scores.  It is almost certain that the score is working as designed, but
the design does not fit your expectations.

Lucene's score calculation (which defaults to the BM25 similarity in
Solr 6.x and later) takes term frequency (TF) into account, but that is
not the whole story.  Another part of the calculation is inverse
document frequency (IDF).  BM25 is more complicated than just those two
factors, but I they are large influences in the final score.

One thing that taking both TF and IDF into account does is reduce the
score when the size of the document is large -- because the term showing
up in a short document probably means that it's more relevant there. 
The actual calculation is certainly a lot more complex than what I'm
going to describe, but the simple idea below illustrates what is
probably happening:

For the doc with id 1, there are two terms, and the search for java
matches one of them - it's half of the document, which makes it pretty
important for that document.  For the doc with id 2, the search term
appears three times, but there are nine terms total, so the term only
contributes a third of that document.  For id 3, the importance is also
about one third.  This means that id 1 probably outscores both id 2 and
id 3 for a search term of "java".

Here's a detailed article about TF and IDF.  Older versions of Solr
(before 6.x) used this kind of calcuation:

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Here's an article about BM25, default in 6.0 and later.  This relevance
calculation does work a lot like TF-IDF, but aims to produce even better
ranking with a more complex mathematical model:

https://en.wikipedia.org/wiki/Okapi_BM25

Thanks,
Shawn