relevance ranking and scoring

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

relevance ranking and scoring

Andrew Nagy-2
I have 2 questions about the SOLR relevancy system.

1. Why is it when I search for an exact phrase of a title of a record I
have it generally does not come up as the 1st record in the results?

ex: title:(gone with the wind), the record comes up 3rd.  A record with
the term "wind" as the first word in the title comes up 1st.
ex: title:"gone with the wind", the record comes up 1st.

Is this because the word "wind" is the only noun?

2. The "score" that is associated with each value is quite odd, what
does it represent.  I generally get results with the top record being
somewhere around 3.0 or 2.0 and most records are below 1.


Thanks!
Andrew


Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Yonik Seeley-2
On 1/23/07, Andrew Nagy <[hidden email]> wrote:
> I have 2 questions about the SOLR relevancy system.

As far as scoring, it's pretty much stock lucene with some other stuff
added on (like function query).
http://lucene.apache.org/java/docs/scoring.html

> 1. Why is it when I search for an exact phrase of a title of a record I
> have it generally does not come up as the 1st record in the results?
>
> ex: title:(gone with the wind), the record comes up 3rd.  A record with
> the term "wind" as the first word in the title comes up 1st.
> ex: title:"gone with the wind", the record comes up 1st.

Well, you could do an exact or sloppy phrase match
title:"gone with the wind"
But I get your point... if you want to also match records with just "wind".

> Is this because the word "wind" is the only noun?

Yes, this probably came about because of lucene's length normalization
in the default similarity.  It's 1/sqrt(num_terms_in_field)

So a document with a title of "wind" has a "norm" of 1.0, while a
document with 4 terms has a "norm" of .7
Still, it seems like the coord factor (number of terms matching)
should have been more than enough to overcome the length
normalization.  What were the exact titles?  I assume you were not
using any type if index-time boosting?

Things you can try:
- post the debugging output (including score explain) for the query
- try disabling length normalization for the title field, then remove
the entire index and re-idnex.
- try the dismax handler, which can generate sloppy phrase queries to
boost results containing all terms.
- try a different similarity implementation
(org.apache.lucene.misc.SweetSpotSimilarity from lucene)


> 2. The "score" that is associated with each value is quite odd, what
> does it represent.  I generally get results with the top record being
> somewhere around 3.0 or 2.0 and most records are below 1.

Scores aren't too comparable across different queries... the scores
are only meant to rank documents with respect to a single query.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Yonik Seeley-2
On 1/23/07, Yonik Seeley <[hidden email]> wrote:
> - try disabling length normalization for the title field, then remove
> the entire index and re-idnex.

Forgot to tell you how to disable length normalization:
set omitNorms="true" on the field in schema.xml

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Andrew Nagy-2
In reply to this post by Yonik Seeley-2
Yonik Seeley wrote:
> Things you can try:
> - post the debugging output (including score explain) for the query
I have attached the output.
> - try disabling length normalization for the title field, then remove
> the entire index and re-idnex.
> - try the dismax handler, which can generate sloppy phrase queries to
> boost results containing all terms.
> - try a different similarity implementation
> (org.apache.lucene.misc.SweetSpotSimilarity from lucene)
Can you explain what these 3 options mean?  I would like to get a better
understanding of the guts of SOLR/Lucene but I am too busy working on my
application that uses it to spend time with the internals.

Thanks
Andrew

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">102</int>
</lst>
<result name="response" numFound="324" start="0" maxScore="2.7548285">
 <doc>
  <float name="score">2.7548285</float>
  <arr name="author"><str>Farnol, Jeffery,</str></arr>
  <str name="callnumber">PR6011.A75.W56 1939</str>
  <str name="format">Book</str>
  <str name="id">97525</str>
  <str name="language">eng</str>
  <str name="publishDate">1939, c1934.</str>
  <str name="publisher">Triangle Books,</str>
  <str name="title">Winds of chance /</str>
  <str name="title2">Winds of change [sic]</str>
 </doc>
 <doc>
  <float name="score">2.5437002</float>
  <arr name="author"><str>Simpson, John E.,</str></arr>
  <str name="callnumber">QC939.L37S56 1994</str>
  <str name="format">Book</str>
  <str name="id">433518</str>
  <str name="isbn">0521452112</str>
  <str name="language">eng</str>
  <str name="physical">ill., maps ;</str>
  <str name="publishDate">1994.</str>
  <str name="publisher">Cambridge University Press,</str>
  <arr name="subject4a"><str>Sea breeze.</str></arr>
  <arr name="subject4x"><str/></arr>
  <str name="title">Sea breeze and local winds /</str>
  <str name="title2">Sea breeze and local wind.</str>
 </doc>
 <doc>
  <float name="score">2.438136</float>
  <arr name="author"><str>Hobbs, William Herbert,</str></arr>
  <str name="callnumber">G743.H6 1968</str>
  <str name="format">Book</str>
  <str name="id">192408</str>
  <str name="language">eng</str>
  <str name="physical">illus., maps, ports. ;</str>
  <str name="publishDate">[1968, c1930]</str>
  <str name="publisher">Greenwood Press,</str>
  <arr name="subject4a"><str>Meteorology</str></arr>
  <arr name="subject4x"><str/></arr>
  <arr name="subject5"><str>Arctic regions.</str></arr>
  <str name="title">Exploring about the North Pole of the winds.</str>
  <str name="title2">North Pole of the winds.</str>
 </doc>
 <doc>
  <float name="score">2.4319565</float>
  <arr name="author"><str>Mitchell, Margaret,</str></arr>
  <str name="callnumber">PS3525.I972G6 1996</str>
  <str name="format">Book</str>
  <str name="id">426657</str>
  <str name="isbn">0684826259 (alk. paper)</str>
  <str name="language">eng</str>
  <str name="physical">ill. ;</str>
  <str name="publishDate">c1996.</str>
  <str name="publisher">Scribner,</str>
  <arr name="subject4a"><str>Women</str></arr>
  <arr name="subject4x"><str>History</str></arr>
  <arr name="subject5"><str>Georgia</str></arr>
  <str name="title">Gone with the wind /</str>
 </doc>
 <doc>
  <float name="score">2.4319565</float>
  <arr name="author"><str>Gable, Clark,</str><str>Leigh, Vivien,</str><str>Howard, Leslie,</str>
        <str>De Havilland, Olivia.</str><str>Mitchell, Thomas,</str><str>McDaniel, Hattie,</str><str>McQueen, Butterfly.</str>
        <str>Fleming, Victor,</str><str>Mitchell, Margaret,</str></arr>
  <str name="callnumber">VT3188 VHS</str>
  <str name="format">Video</str>
  <str name="id">529954</str>
  <str name="language">eng</str>
  <str name="physical">sd., col. ;</str>
  <str name="publishDate">c1999.</str>
  <str name="publisher">Time Warner Co.,</str>
  <arr name="subject4a"><str>War films.</str><str>Feature films.</str></arr>
  <arr name="subject4x"><str/><str/></arr>
  <arr name="subject5"><str>United States</str></arr>
  <str name="title">Gone with the wind</str>
 </doc>
 <doc>
  <float name="score">2.4319565</float>
  <arr name="author"><str>Mitchell, Margaret,</str></arr>
  <str name="callnumber">PS3525.I972G6 1993</str>
  <str name="format">Book</str>
  <str name="id">534773</str>
  <str name="isbn">0446365386</str>
  <str name="language">eng</str>
  <str name="publishDate">[1993], c1936.</str>
  <str name="publisher">Warner Books,</str>
  <arr name="subject5"><str>United States</str><str>Georgia</str></arr>
  <str name="title">Gone with the wind /</str>
 </doc>
 <doc>
  <float name="score">1.7023697</float>
  <arr name="author"><str>Pyron, Darden Asbury.</str></arr>
  <str name="callnumber">PS3525.I972G687 1983</str>
  <str name="format">Book</str>
  <str name="id">27783</str>
  <str name="isbn">081300747X (pbk. : alk. paper)</str>
  <str name="language">eng</str>
  <str name="publishDate">c1983.</str>
  <str name="publisher">University Presses of Florida,</str>
  <arr name="subject1"><str>Mitchell, Margaret,</str></arr>
  <arr name="subject3"><str>Gone with the wind (Motion picture)</str></arr>
  <arr name="subject5"><str>Southern States</str></arr>
  <str name="title">Recasting :"Gone with the wind" in American culture /</str>
 </doc>
 <doc>
  <float name="score">1.6493776</float>
  <arr name="author"><str>Stuttgarter Bläserquintett.</str><str>Haydn, Joseph,</str><str>Reicha, Anton,</str>
        <str>Danzi, Franz,</str><str>Lickl, Johann Georg,</str></arr>
  <str name="callnumber">CD257</str>
  <str name="contents">Divertimento, Nr. 1, B-Dur : Chorale St. Antoni / Joseph Haydn (10:24) -- Bläserquintett Es-Dur, op. 88, 2 / Anton Reicha (14:06) -- Bläserquintett B-Dur, op. 56, 1 / Franz Danzi (13:31) -- Quintetto concertante, F-Dur / Johann Georg Lickl (20:41).</str>
  <str name="format">Audio</str>
  <str name="id">555810</str>
  <str name="language">   </str>
  <str name="physical">digital, stereo. ;</str>
  <str name="publishDate">1989.</str>
  <str name="publisher">Pilz,</str>
  <arr name="series"><str>Vienna master series</str><str>Vienna master series</str></arr>
  <arr name="subject4a"><str>Wind quintets (Bassoon, clarinet, flute, horn, oboe)</str><str>Suites (Bassoon, clarinet, flute, horn, oboe)</str></arr>
  <arr name="subject4x"><str/><str/></arr>
  <str name="title">Bläserquintett San Antoni und andere BläserquintetteWind-player quintet San Antoni and other wind-player quintets /</str>
  <str name="title2">Wind-player quintet San Antoni and other wind-player quintets</str>
 </doc>
 <doc>
  <float name="score">1.4591739</float>
  <arr name="author"><str>Taylor, Helen,</str></arr>
  <str name="callnumber">PS3525.I972G688 1989</str>
  <str name="format">Book</str>
  <str name="id">312906</str>
  <str name="isbn">0813514800 :</str>
  <str name="language">eng</str>
  <str name="publishDate">1989.</str>
  <str name="publisher">Rutgers University Press,</str>
  <arr name="subject1"><str>Mitchell, Margaret,</str><str>Mitchell, Margaret,</str><str>Mitchell, Margaret,</str></arr>
  <arr name="subject3"><str>Gone with the wind (Motion picture)</str></arr>
  <arr name="subject4a"><str>Women</str><str>Historical fiction, American</str><str>Motion picture audiences</str></arr>
  <arr name="subject4x"><str>Books and reading</str><str>Film and video adaptations.</str><str/></arr>
  <str name="title">Scarlett's women :Gone with the wind and its female fans /</str>
 </doc>
 <doc>
  <float name="score">1.4591739</float>
  <arr name="author"><str>Vertrees, Alan David,</str></arr>
  <str name="callnumber">PN1997.G59V47 1997</str>
  <str name="format">Book</str>
  <str name="id">508240</str>
  <str name="isbn">0292787294 (pbk. : alk. paper)</str>
  <str name="language">eng</str>
  <str name="physical">ill. ;</str>
  <str name="publishDate">c1997.</str>
  <str name="publisher">University of Texas Press,</str>
  <arr name="series"><str>Texas film studies series</str><str>Texas film studies series</str></arr>
  <arr name="subject1"><str>Selznick, David O.,</str></arr>
  <arr name="subject3"><str>Gone with the wind (Motion picture)</str></arr>
  <str name="title">Selznick's vision :Gone with the wind and Hollywood filmmaking /</str>
 </doc>
</result>
<lst name="debug">
 <str name="rawquerystring">title:(gone with the wind) OR title2:(gone with the wind)</str>
 <str name="querystring">title:(gone with the wind) OR title2:(gone with the wind)</str>
 <str name="parsedquery">(title:gone title:wind) (title2:gone title2:wind)</str>
 <str name="parsedquery_toString">(title:gone title:wind) (title2:gone title2:wind)</str>
 <lst name="explain">
  <str name="id=97525,internal_docid=490046">
2.7548285 = (MATCH) sum of:
  1.0556406 = (MATCH) product of:
    2.1112812 = (MATCH) sum of:
      2.1112812 = (MATCH) weight(title:wind in 490046), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        5.3581324 = (MATCH) fieldWeight(title:wind in 490046), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.625 = fieldNorm(field=title, doc=490046)
    0.5 = coord(1/2)
  1.6991879 = (MATCH) product of:
    3.3983757 = (MATCH) sum of:
      3.3983757 = (MATCH) weight(title2:wind in 490046), product of:
        0.55892086 = queryWeight(title2:wind), product of:
          12.16049 = idf(docFreq=6)
          0.045962032 = queryNorm
        6.080245 = (MATCH) fieldWeight(title2:wind in 490046), product of:
          1.0 = tf(termFreq(title2:wind)=1)
          12.16049 = idf(docFreq=6)
          0.5 = fieldNorm(field=title2, doc=490046)
    0.5 = coord(1/2)
</str>
  <str name="id=433518,internal_docid=326785">
2.5437002 = (MATCH) sum of:
  0.8445124 = (MATCH) product of:
    1.6890248 = (MATCH) sum of:
      1.6890248 = (MATCH) weight(title:wind in 326785), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        4.2865057 = (MATCH) fieldWeight(title:wind in 326785), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.5 = fieldNorm(field=title, doc=326785)
    0.5 = coord(1/2)
  1.6991879 = (MATCH) product of:
    3.3983757 = (MATCH) sum of:
      3.3983757 = (MATCH) weight(title2:wind in 326785), product of:
        0.55892086 = queryWeight(title2:wind), product of:
          12.16049 = idf(docFreq=6)
          0.045962032 = queryNorm
        6.080245 = (MATCH) fieldWeight(title2:wind in 326785), product of:
          1.0 = tf(termFreq(title2:wind)=1)
          12.16049 = idf(docFreq=6)
          0.5 = fieldNorm(field=title2, doc=326785)
    0.5 = coord(1/2)
</str>
  <str name="id=192408,internal_docid=83772">
2.438136 = (MATCH) sum of:
  0.7389483 = (MATCH) product of:
    1.4778966 = (MATCH) sum of:
      1.4778966 = (MATCH) weight(title:wind in 83772), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        3.7506924 = (MATCH) fieldWeight(title:wind in 83772), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.4375 = fieldNorm(field=title, doc=83772)
    0.5 = coord(1/2)
  1.6991879 = (MATCH) product of:
    3.3983757 = (MATCH) sum of:
      3.3983757 = (MATCH) weight(title2:wind in 83772), product of:
        0.55892086 = queryWeight(title2:wind), product of:
          12.16049 = idf(docFreq=6)
          0.045962032 = queryNorm
        6.080245 = (MATCH) fieldWeight(title2:wind in 83772), product of:
          1.0 = tf(termFreq(title2:wind)=1)
          12.16049 = idf(docFreq=6)
          0.5 = fieldNorm(field=title2, doc=83772)
    0.5 = coord(1/2)
</str>
  <str name="id=426657,internal_docid=319418">
2.4319568 = (MATCH) product of:
  4.8639135 = (MATCH) sum of:
    4.8639135 = (MATCH) sum of:
      2.7526321 = (MATCH) weight(title:gone in 319418), product of:
        0.44991833 = queryWeight(title:gone), product of:
          9.788913 = idf(docFreq=74)
          0.045962032 = queryNorm
        6.1180706 = (MATCH) fieldWeight(title:gone in 319418), product of:
          1.0 = tf(termFreq(title:gone)=1)
          9.788913 = idf(docFreq=74)
          0.625 = fieldNorm(field=title, doc=319418)
      2.1112812 = (MATCH) weight(title:wind in 319418), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        5.3581324 = (MATCH) fieldWeight(title:wind in 319418), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.625 = fieldNorm(field=title, doc=319418)
  0.5 = coord(1/2)
</str>
  <str name="id=529954,internal_docid=416311">
2.4319568 = (MATCH) product of:
  4.8639135 = (MATCH) sum of:
    4.8639135 = (MATCH) sum of:
      2.7526321 = (MATCH) weight(title:gone in 416311), product of:
        0.44991833 = queryWeight(title:gone), product of:
          9.788913 = idf(docFreq=74)
          0.045962032 = queryNorm
        6.1180706 = (MATCH) fieldWeight(title:gone in 416311), product of:
          1.0 = tf(termFreq(title:gone)=1)
          9.788913 = idf(docFreq=74)
          0.625 = fieldNorm(field=title, doc=416311)
      2.1112812 = (MATCH) weight(title:wind in 416311), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        5.3581324 = (MATCH) fieldWeight(title:wind in 416311), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.625 = fieldNorm(field=title, doc=416311)
  0.5 = coord(1/2)
</str>
  <str name="id=534773,internal_docid=420440">
2.4319568 = (MATCH) product of:
  4.8639135 = (MATCH) sum of:
    4.8639135 = (MATCH) sum of:
      2.7526321 = (MATCH) weight(title:gone in 420440), product of:
        0.44991833 = queryWeight(title:gone), product of:
          9.788913 = idf(docFreq=74)
          0.045962032 = queryNorm
        6.1180706 = (MATCH) fieldWeight(title:gone in 420440), product of:
          1.0 = tf(termFreq(title:gone)=1)
          9.788913 = idf(docFreq=74)
          0.625 = fieldNorm(field=title, doc=420440)
      2.1112812 = (MATCH) weight(title:wind in 420440), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        5.3581324 = (MATCH) fieldWeight(title:wind in 420440), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.625 = fieldNorm(field=title, doc=420440)
  0.5 = coord(1/2)
</str>
  <str name="id=27783,internal_docid=161556">
1.7023696 = (MATCH) product of:
  3.4047391 = (MATCH) sum of:
    3.4047391 = (MATCH) sum of:
      1.9268426 = (MATCH) weight(title:gone in 161556), product of:
        0.44991833 = queryWeight(title:gone), product of:
          9.788913 = idf(docFreq=74)
          0.045962032 = queryNorm
        4.2826495 = (MATCH) fieldWeight(title:gone in 161556), product of:
          1.0 = tf(termFreq(title:gone)=1)
          9.788913 = idf(docFreq=74)
          0.4375 = fieldNorm(field=title, doc=161556)
      1.4778966 = (MATCH) weight(title:wind in 161556), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        3.7506924 = (MATCH) fieldWeight(title:wind in 161556), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.4375 = fieldNorm(field=title, doc=161556)
  0.5 = coord(1/2)
</str>
  <str name="id=555810,internal_docid=440217">
1.6493776 = (MATCH) sum of:
  0.44787034 = (MATCH) product of:
    0.8957407 = (MATCH) sum of:
      0.8957407 = (MATCH) weight(title:wind in 440217), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        2.273263 = (MATCH) fieldWeight(title:wind in 440217), product of:
          1.4142135 = tf(termFreq(title:wind)=2)
          8.573011 = idf(docFreq=252)
          0.1875 = fieldNorm(field=title, doc=440217)
    0.5 = coord(1/2)
  1.2015072 = (MATCH) product of:
    2.4030144 = (MATCH) sum of:
      2.4030144 = (MATCH) weight(title2:wind in 440217), product of:
        0.55892086 = queryWeight(title2:wind), product of:
          12.16049 = idf(docFreq=6)
          0.045962032 = queryNorm
        4.299382 = (MATCH) fieldWeight(title2:wind in 440217), product of:
          1.4142135 = tf(termFreq(title2:wind)=2)
          12.16049 = idf(docFreq=6)
          0.25 = fieldNorm(field=title2, doc=440217)
    0.5 = coord(1/2)
</str>
  <str name="id=312906,internal_docid=196258">
1.4591739 = (MATCH) product of:
  2.9183478 = (MATCH) sum of:
    2.9183478 = (MATCH) sum of:
      1.6515791 = (MATCH) weight(title:gone in 196258), product of:
        0.44991833 = queryWeight(title:gone), product of:
          9.788913 = idf(docFreq=74)
          0.045962032 = queryNorm
        3.6708422 = (MATCH) fieldWeight(title:gone in 196258), product of:
          1.0 = tf(termFreq(title:gone)=1)
          9.788913 = idf(docFreq=74)
          0.375 = fieldNorm(field=title, doc=196258)
      1.2667686 = (MATCH) weight(title:wind in 196258), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        3.2148793 = (MATCH) fieldWeight(title:wind in 196258), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.375 = fieldNorm(field=title, doc=196258)
  0.5 = coord(1/2)
</str>
  <str name="id=508240,internal_docid=396218">
1.4591739 = (MATCH) product of:
  2.9183478 = (MATCH) sum of:
    2.9183478 = (MATCH) sum of:
      1.6515791 = (MATCH) weight(title:gone in 396218), product of:
        0.44991833 = queryWeight(title:gone), product of:
          9.788913 = idf(docFreq=74)
          0.045962032 = queryNorm
        3.6708422 = (MATCH) fieldWeight(title:gone in 396218), product of:
          1.0 = tf(termFreq(title:gone)=1)
          9.788913 = idf(docFreq=74)
          0.375 = fieldNorm(field=title, doc=396218)
      1.2667686 = (MATCH) weight(title:wind in 396218), product of:
        0.394033 = queryWeight(title:wind), product of:
          8.573011 = idf(docFreq=252)
          0.045962032 = queryNorm
        3.2148793 = (MATCH) fieldWeight(title:wind in 396218), product of:
          1.0 = tf(termFreq(title:wind)=1)
          8.573011 = idf(docFreq=252)
          0.375 = fieldNorm(field=title, doc=396218)
  0.5 = coord(1/2)
</str>
 </lst>
</lst>
</response>
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Yonik Seeley-2
On 1/23/07, Andrew Nagy <[hidden email]> wrote:

> Yonik Seeley wrote:
> > Things you can try:
> > - post the debugging output (including score explain) for the query
> I have attached the output.
> > - try disabling length normalization for the title field, then remove
> > the entire index and re-idnex.
> > - try the dismax handler, which can generate sloppy phrase queries to
> > boost results containing all terms.
> > - try a different similarity implementation
> > (org.apache.lucene.misc.SweetSpotSimilarity from lucene)
> Can you explain what these 3 options mean?  I would like to get a better
> understanding of the guts of SOLR/Lucene but I am too busy working on my
> application that uses it to spend time with the internals.

Let's start with the first... add a debugQuery=on
parameter to your request and post the full result here.
You can get the same effect through the
query form on the solr admin pages by checking the "Debug: explain" checkbox.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Andrew Nagy-2
Yonik Seeley wrote:

> On 1/23/07, Andrew Nagy <[hidden email]> wrote:
>> Yonik Seeley wrote:
>> > Things you can try:
>> > - post the debugging output (including score explain) for the query
>> I have attached the output.
>> > - try disabling length normalization for the title field, then remove
>> > the entire index and re-idnex.
>> > - try the dismax handler, which can generate sloppy phrase queries to
>> > boost results containing all terms.
>> > - try a different similarity implementation
>> > (org.apache.lucene.misc.SweetSpotSimilarity from lucene)
>> Can you explain what these 3 options mean?  I would like to get a better
>> understanding of the guts of SOLR/Lucene but I am too busy working on my
>> application that uses it to spend time with the internals.
>
> Let's start with the first... add a debugQuery=on
> parameter to your request and post the full result here.
> You can get the same effect through the
> query form on the solr admin pages by checking the "Debug: explain"
> checkbox.
I attached the results to my last email, are you not able to see them?

Andrew
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Yonik Seeley-2
On 1/24/07, Andrew Nagy <[hidden email]> wrote:
> > Let's start with the first... add a debugQuery=on
> > parameter to your request and post the full result here.
> > You can get the same effect through the
> > query form on the solr admin pages by checking the "Debug: explain"
> > checkbox.
> I attached the results to my last email, are you not able to see them?

Ahh, I missed it.

Ok, here is your query:
 <str name="rawquerystring">title:(gone with the wind) OR title2:(gone
with the wind)</str>
And here it is parsed:
 <str name="parsedquery">(title:gone title:wind) (title2:gone title2:wind)</str>

First, notice how stopwords were removed, so "with" and "the" will not
count in the results.

You are querying across two different fields.
Notice how the first two documents both have "wind" in both title and title2,
while the third document "gone with the wind" has no title2 field (and
hence can't match on it).

In the first documents, the scores for the matches on title and title2
both contribute to the score.  For the third document, it's penalized
by not matching in both the title and title2 field.

You could look at the dismax handler... it helps constructs queries, a
component of which are DisjunctionMaxQueries (they don't add together
scores from different fields, but just take the highest score from any
matching field for a term).

You could also see how changing or removing the stopword list affects relevance.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Andrew Nagy-2
Yonik Seeley wrote:

> Ok, here is your query:
> <str name="rawquerystring">title:(gone with the wind) OR title2:(gone
> with the wind)</str>
> And here it is parsed:
> <str name="parsedquery">(title:gone title:wind) (title2:gone
> title2:wind)</str>
>
> First, notice how stopwords were removed, so "with" and "the" will not
> count in the results.
>
> You are querying across two different fields.
> Notice how the first two documents both have "wind" in both title and
> title2,
> while the third document "gone with the wind" has no title2 field (and
> hence can't match on it).
>
> In the first documents, the scores for the matches on title and title2
> both contribute to the score.  For the third document, it's penalized
> by not matching in both the title and title2 field.
>
> You could look at the dismax handler... it helps constructs queries, a
> component of which are DisjunctionMaxQueries (they don't add together
> scores from different fields, but just take the highest score from any
> matching field for a term).
>
> You could also see how changing or removing the stopword list affects
> relevance.
Wow, thanks for the verbose response.  This gives me a lot to go on!

What about term ranking, could I rank the phrases searched in title
higher than title2?

Thanks!
Andrew
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Yonik Seeley-2
On 1/24/07, Andrew Nagy <[hidden email]> wrote:

> Yonik Seeley wrote:
> > Ok, here is your query:
> > <str name="rawquerystring">title:(gone with the wind) OR title2:(gone
> > with the wind)</str>
> > And here it is parsed:
> > <str name="parsedquery">(title:gone title:wind) (title2:gone
> > title2:wind)</str>
> >
> > First, notice how stopwords were removed, so "with" and "the" will not
> > count in the results.
> >
> > You are querying across two different fields.
> > Notice how the first two documents both have "wind" in both title and
> > title2,
> > while the third document "gone with the wind" has no title2 field (and
> > hence can't match on it).
> >
> > In the first documents, the scores for the matches on title and title2
> > both contribute to the score.  For the third document, it's penalized
> > by not matching in both the title and title2 field.
> >
> > You could look at the dismax handler... it helps constructs queries, a
> > component of which are DisjunctionMaxQueries (they don't add together
> > scores from different fields, but just take the highest score from any
> > matching field for a term).
> >
> > You could also see how changing or removing the stopword list affects
> > relevance.
> Wow, thanks for the verbose response.  This gives me a lot to go on!
>
> What about term ranking, could I rank the phrases searched in title
> higher than title2?

Absolutely... standard lucene syntax for boosting will give you that
in the standard query handler.

title:(gone with the wind)^3.0 OR title2:(gone with the wind)

For dismax, you give the query separate from the fields, and you can
express different weights on the fields via qf=title^3.0 title2

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Andrew Nagy-2
Yonik Seeley wrote:
>>
>> What about term ranking, could I rank the phrases searched in title
>> higher than title2?
>
> Absolutely... standard lucene syntax for boosting will give you that
> in the standard query handler.
>
> title:(gone with the wind)^3.0 OR title2:(gone with the wind)
That did it!  Thanks for the Help!
What value do the numbers carry in the ranking?  I arbitrarily choose
the number 5 cause it's an easy number :)

I am a bit nervous about the dismax query system as I have quite a bit
of other content that could skew the results.
Whats the difference between the dismax query handler and listing all of
the fields in my search and separating them with an OR?

Thanks!
Andrew


Reply | Threaded
Open this post in threaded view
|

Re: relevance ranking and scoring

Chris Hostetter-3

: > title:(gone with the wind)^3.0 OR title2:(gone with the wind)
: That did it!  Thanks for the Help!
: What value do the numbers carry in the ranking?  I arbitrarily choose
: the number 5 cause it's an easy number :)

query boosts are in fact pretty arbitrary ... what you should pick really
depends on what boosts you put on other clauses, and what kinds of values
the tf, idf, and coord functions of your Similarity are going to return.

: I am a bit nervous about the dismax query system as I have quite a bit
: of other content that could skew the results.

i'm really not sure what you mean by that ... dismax will only look at the
fields you tell it to, and the factors that contribute to the score each
term/document pair in a dismax query will be the same as those from the
standard request handler -- the only differnece is how those individual
TermQuery scores are combined.

: Whats the difference between the dismax query handler and listing all of
: the fields in my search and separating them with an OR?

the best way to udnerstand this is too look at the debug output you get
from each query, and read the "Explanation" section ... some of the deep
detals may not make much sense, but the overall structure of score
calculation should be helpful

in a nutshell, when you ask the StandardRequestHandler for docs
matching...
     q = title:(foo bar) other:(foo bar)

if a document matches both title:foo, other:foo, and other:bar then the
score for that document is (esentially) the sum of the scores from
matching the individual terms

with dismax, if you ask for

     q = foo bar  & qf = title other

then the score for the same document is different: the matches on
the word "foo" are considered together regardless of field, and only the
field that resulted in the highest score is used (with a small portion of
hte matches on the otherfields being included to help break ties).  the
score contribution from matching on other;bar is basically the same as
before.

The driving motivation for the DisjunctionMaxQuery was so that if you
wanted to search for the words "Java" or "Lucene" in 3 differnet fields:
title, description, and body a document that matched Lucene once in the
body field, but matched Java dozens of times and at least once in each
field wouldn't overshadow a documetn that matched both Lucene and Java
just once in each field.


-Hoss