Fieldnorm solr 4 -> specialchars(worddelimiter)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Fieldnorm solr 4 -> specialchars(worddelimiter)

roySolr
Hello,

I had some differences in solr score between solr 3.1 and solr 4.
I have a searchfield with the following type:

<fieldType name="text_delimiter" class="solr.TextField">
      <analyzer type="index">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" />
      </analyzer>
      <analyzer type="query">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" />
      </analyzer>
    </fieldType>


An example of fieldnorms:

SearchTerm = barcelona

solr 3.1:
fc barcelona soccer club -> 0.5
fc-barcelona soccer club -> 0.5

solr 4:
fc barcelona soccer club -> 0.5
fc-barcelona soccer club -> 0.4375

It could be the catenateWords of the fieldtype conf: fc,barcelona,fcbarcelona,soccer,club(5 terms = 0.4375)
Strange that in solr 3.1 it was just counting for 4 terms with the same filter.

Why is fieldnorm different? I need some help with this:)

Thanks
Roy



Reply | Threaded
Open this post in threaded view
|

Re: Fieldnorm solr 4 -> specialchars(worddelimiter)

roySolr
I have done some more testing with different examples.

It's really the worddelimiter that influence the fieldnorm. When i search for "Barcelona" the doc with "FC Barcelona" scores higher than "FC-Barcelona".

Fieldnorm for "FC Barcelona" = 0.625 and the fieldnorm for "FC-Barcelona" = 0.5.

Analyze:

fc | barcelona
fc | barcelona | fcbarcelona

So it's 2 terms against 3 and this explains the difference in score. In solr 3.1 the score is the same, fieldnorm is 0.625 for both docs. It looks like the catenatewords has no influence in solr 3.1.

I want that the score is the same, with or without the catenatewords, like it's in solr 3.1. Is this possible?

Thanks
Roy



Reply | Threaded
Open this post in threaded view
|

Re: Fieldnorm solr 4 -> specialchars(worddelimiter)

Jack Krupansky-2
What field type were you using in 3.1 vs 4.x? If you were using some default
field type, maybe it changed. If you do need to achieve exactly the same
results as in 3.1, maybe you need to use the same field-type/analyzer. In
some cases there may have been bugs that got fixed.

-- Jack Krupansky

-----Original Message-----
From: roySolr
Sent: Monday, January 28, 2013 4:11 AM
To: [hidden email]
Subject: Re: Fieldnorm solr 4 -> specialchars(worddelimiter)

I have done some more testing with different examples.

It's really the worddelimiter that influence the fieldnorm. When i search
for "Barcelona" the doc with "FC Barcelona" scores higher than
"FC-Barcelona".

Fieldnorm for "FC Barcelona" = 0.625 and the fieldnorm for "FC-Barcelona" =
0.5.

Analyze:

fc | barcelona
fc | barcelona | fcbarcelona

So it's 2 terms against 3 and this explains the difference in score. In solr
3.1 the score is the same, fieldnorm is 0.625 for both docs. It looks like
the catenatewords has no influence in solr 3.1.

I want that the score is the same, with or without the catenatewords, like
it's in solr 3.1. Is this possible?

Thanks
Roy







--
View this message in context:
http://lucene.472066.n3.nabble.com/Fieldnorm-solr-4-specialchars-worddelimiter-tp4036248p4036679.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Fieldnorm solr 4 -> specialchars(worddelimiter)

roySolr
Hello Jack,

I'm using exactly the same fieldtype:

<fieldType name="text_delimiter" class="solr.TextField">
      <analyzer type="index">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" />
      </analyzer>
      <analyzer type="query">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" />
      </analyzer>
    </fieldType>

It looks like the catenatewords has another influence in solr 4.1 than in previous version.(3.1)
The analyze is the same in both versions. I want exactly the same results but can't get it.




Reply | Threaded
Open this post in threaded view
|

Re: Fieldnorm solr 4 -> specialchars(worddelimiter)

Jack Krupansky-2
As I said, maybe there might have been bugs fixed since 3.1. WDF has changed
over time. Expecting it to give identical results across releases is a
classic Fool's Errand. Ditto for scoring in general - it's subject to change
across major releases.

I mean, sure, we could track down what specific change caused the
discrepancy, but what good would that do you? If it does happen to be a bug,
then of course it can be fixed in a future release, but as of this moment,
there is no evidence to suggest that it is the result of a bug, especially
considering WDF's evolution over time.

Compare the analyzer output between the two releases again. Maybe toggling
one of the attributes will cause the 4.x output to more closely match the
3.1 output.

Rereading your previous message - maybe catenateWords was indeed broken in
3.1. If that was the case, then that explains the difference and that is a
GOOD difference, nothing to fret over. Or, maybe you need to turn that
attribute off if that is your own preference.

-- Jack Krupansky

-----Original Message-----
From: roySolr
Sent: Monday, January 28, 2013 9:16 AM
To: [hidden email]
Subject: Re: Fieldnorm solr 4 -> specialchars(worddelimiter)

Hello Jack,

I'm using exactly the same fieldtype:

<fieldType name="text_delimiter" class="solr.TextField">
      <analyzer type="index">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" catenateWords="1" splitOnCaseChange="0"
splitOnNumerics="0" stemEnglishPossessive="0" />
      </analyzer>
      <analyzer type="query">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
catenateWords="0" splitOnCaseChange="0" splitOnNumerics="0"
stemEnglishPossessive="0" />
      </analyzer>
    </fieldType>

It looks like the catenatewords has another influence in solr 4.1 than in
previous version.(3.1)
The analyze is the same in both versions. I want exactly the same results
but can't get it.








--
View this message in context:
http://lucene.472066.n3.nabble.com/Fieldnorm-solr-4-specialchars-worddelimiter-tp4036248p4036749.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Fieldnorm solr 4 -> specialchars(worddelimiter)

roySolr
Hello Jack,

Thanks for your answer. It's clear, i think it was a bug in 3.1. The difference in fieldnorm was just not what i expected. I will tweak the schema to get it closer to the expected results.

Thanks Jack,
Roy