Multiple languages, boosting and, stemming and KeywordRepeat

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Multiple languages, boosting and, stemming and KeywordRepeat

Markus Jelsma-2
Hello,

First, apologies for the weird subject line, and apologies for cross-posting, but last week it got no replies on the Solr user mailing list.

We index many languages and search over all those languages at once, but boost the language of the user's preference. To differentiate between stemmed tokens and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this works very well.

However, we just stumbled over the following example, q=australia is not stemmed in English, but its suffix is removed by the Romanian stemmer, causing the Romanian results to be returned on top of English results, despite language boosting.

This is because the Romanian part of the query consists of the stemmed and unstemmed version of the word, but the English part of the query is just one clause per field (title, content etc). Thus the Romanian results score roughtly twice that of English results.

Now, this is of course really obvious, but the 'solution' is not. To work around the problem i removed the RemoveDuplicates filter so i get two clauses for English as well, really ugly but it works. What i don't understand is the debug output, it doesn't list two identical clauses, instead, it doubled the boost on the field, so instead of:

    27.048403 = PayloadSpanQuery, product of:
      27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], result of:
        27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
          7.4 = boost
          3.084852 = idf(docFreq=14539, docCount=317894)
          1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            4.0 = phraseFreq=4.0
            0.3 = parameter k1
            0.5 = parameter b
            15.08689 = avgFieldLength
            24.0 = fieldLength
      1.0 = AveragePayloadFunction.docScore()

I now get

    54.096806 = PayloadSpanQuery, product of:
      54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], result of:
        54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
          14.8 = boost
          3.084852 = idf(docFreq=14539, docCount=317894)
          1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            4.0 = phraseFreq=4.0
            0.3 = parameter k1
            0.5 = parameter b
            15.08689 = avgFieldLength
            24.0 = fieldLength
      1.0 = AveragePayloadFunction.docScore()

So instead of expecting two clauses in the debug, i get one but with a doubled boost.

The question is, is this supposed to be like this?

Also, are there any real solutions to this problem? Removing the RemoveDuplicates filter looks really silly.

Many thanks!
Markus


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]