[jira] Created: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
Similarity javadocs for scoring function to relate more tightly to scoring models in effect
-------------------------------------------------------------------------------------------

                 Key: LUCENE-1908
                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
            Reporter: Doron Cohen
            Assignee: Doron Cohen
            Priority: Minor
             Fix For: 2.9


See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1908:
--------------------------------

    Attachment: LUCENE-1908.patch

Attached patch adds explanations on how current scoring function is derived from VSM.
I hope it is not to "colorful", and that it indeed clarifies rather than confuses.
I marked it for 2.9, because if others like it I think it would be good to include it, but if there is no concenzus on this modification let's move it to a future release.

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754373#action_12754373 ]

Doron Cohen commented on LUCENE-1908:
-------------------------------------

Suggested Similarity javadoc can be reiewed here: http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754471#action_12754471 ]

Shai Erera commented on LUCENE-1908:
------------------------------------

This looks great ! Really helpful. Few comments:

* typo: "perscoective" should be "perspective"
* Where you describe the different components, some bullets are missing their "component" name, like query-boost (you write it for doc-boost for example).
* I think if you make those "component" slanted or something, they'll stand out better? This is in general for this page, not just this section.

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754470#action_12754470 ]

Mark Miller commented on LUCENE-1908:
-------------------------------------

Looks great!

bq. Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions (?).

Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor?

That's how it appears to me anyway -

for |V(d)| you have many options right?

the cosine normalization - your standard euclidean length - |V(d)|
or none (eg 1)
or pivoted normalized doc length
or SweetSpotSimilarity's formula
or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default
or Okapi's formula,
or ...

So we are replacing (which everyone generally does) not dropping right?

And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor.
Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them
an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another?
(the one we have chosen isnt great either - long docs are punished too much and short preferred too much)

Again, I'm not an IR guy, but thats what my modest take is.

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754470#action_12754470 ]

Mark Miller edited comment on LUCENE-1908 at 9/11/09 10:37 PM:
---------------------------------------------------------------

Looks great!

bq. Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions (?).

Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor?

That's how it appears to me anyway -

for |V(d)| you have many options right?

the cosine normalization - your standard euclidean length - |V(d)|
or none (eg 1)
or pivoted normalized doc length
or SweetSpotSimilarity's formula
or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default
or Okapi's formula,
or ...

So we are replacing (which everyone generally does) not dropping right?

And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor.
Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them
an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another?
(the one we have chosen isnt great either - long docs are punished too much and short preferred too much)

Again, I'm not an IR guy, but thats what my modest take is.

* edit: I suppose you could argue that you could do cosine normalization and then a further normalization approach on that, and in that sense we are dropping the cosine normalization because its too expensive. But from what I can see, it appears more the case that you try and use a normalization factor that can just replace cosine normalization - like the pivoted normalization which rotates the cosine normalization curve. I think pivoted is something like 1/root(stuff, ie unique terms), so our norm of 1/root(L) is of a similar, simpler, vein. So thats why I think we are not really dropping it - we are choosing one of the variety of replacements.

      was (Author: [hidden email]):
    Looks great!

bq. Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions (?).

Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor?

That's how it appears to me anyway -

for |V(d)| you have many options right?

the cosine normalization - your standard euclidean length - |V(d)|
or none (eg 1)
or pivoted normalized doc length
or SweetSpotSimilarity's formula
or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default
or Okapi's formula,
or ...

So we are replacing (which everyone generally does) not dropping right?

And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor.
Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them
an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another?
(the one we have chosen isnt great either - long docs are punished too much and short preferred too much)

Again, I'm not an IR guy, but thats what my modest take is.
 

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754571#action_12754571 ]

Doron Cohen commented on LUCENE-1908:
-------------------------------------

Mark and Shai Thanks for reviewing!

Mark, I think you have a point here (and I am definitely no more an IR guy than you are :)).

Truth is I was surprised to find out (through your comments in LUCENE-1896) that this component of the score is "missing", and I indeed thought that the "right thing to do" (if there is such thing as "right") really is to do both: normalize to the unit vector, and then normalize by length to compensate for "unfair" advantage of long documents.

But you're right, and the way I presented V(d) normalization and doc-length normalization is incorrect, as if it is a the right thing to do both, and the way it is presented is not doing justice to Lucene. I will change the writing.

Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm and Lucene's default similarity's length norm are the same: 1/sqrt(N). But if you double that doc to have two occurrences of each of the N distinct terms, its length would be 2N, 1/Euclidean-norm would be 1/sqrt(4N) but Lucene's default similarity's length norm would be 1/sqrt(2N). So we will punish documents with duplicate terms less than would the Euclidean norm...  

I am not aware of an evaluation or discussion of this - I mean - why was this approach selected, and so I assumed (under question) that it was merely for performance considerations. You said in Lucene-1896:
bq. not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it.
Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.

Anyhow this issue more about describing what Lucene is doing today than on what should Lucene do, so think I have the correct picture now (except for historical justification which is interesting but not a show stopper).

Shai thanks for the fixes.

(updated patch to follow).

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754577#action_12754577 ]

Mark Miller commented on LUCENE-1908:
-------------------------------------

bq. Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.

Its only desired behavior if you have a corpus that favors it, but most do. I think you can think of the |V(d)| as taking out information about the document length - you start with an m space vector, with each term given a weight depending on how many times it occurs - in other words, there is information about the length of that document there, and when you normalize by |V(d)|, you will take out that information - but it will appear more similar the more unique terms it started with and the higher the tf's. So that method favors long docs, witch will naturally have more of both, though of course not always be more similar.

All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something). All of the others also try and make up for this doc length problem - by messing with the curve so that ultra long docs are not favored too highly. Our default method is a known not very good one - buts its also very fast and efficient compared to the better ones. Sweetspot is much better and I think still efficient - but you need to tune the right params I believe.

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754577#action_12754577 ]

Mark Miller edited comment on LUCENE-1908 at 9/12/09 12:12 PM:
---------------------------------------------------------------

bq. Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.

Its only desired behavior if you have a corpus that favors it, but most do. I think you can think of the |V(d)| as taking out information about the document length - you start with an m space vector, with each term given a weight depending on how many times it occurs - in other words, there is information about the length of that document there, and when you normalize by |V(d)|, you will take out that information - but it will appear more similar the more unique terms it started with and the higher the tf's. So that method favors long docs, witch will naturally have more of both, though of course not always be more similar.

All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something). All of the others also try and make up for this doc length problem - by messing with the curve so that ultra long docs are not favored too highly. Our default method is a known not very good one - buts its also very fast and efficient compared to the better ones. Sweetspot is much better and I think still efficient - but you need to tune the right params I believe.

* edit *

I'm still a little confused I guess :) The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass.

      was (Author: [hidden email]):
    bq. Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.

Its only desired behavior if you have a corpus that favors it, but most do. I think you can think of the |V(d)| as taking out information about the document length - you start with an m space vector, with each term given a weight depending on how many times it occurs - in other words, there is information about the length of that document there, and when you normalize by |V(d)|, you will take out that information - but it will appear more similar the more unique terms it started with and the higher the tf's. So that method favors long docs, witch will naturally have more of both, though of course not always be more similar.

All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something). All of the others also try and make up for this doc length problem - by messing with the curve so that ultra long docs are not favored too highly. Our default method is a known not very good one - buts its also very fast and efficient compared to the better ones. Sweetspot is much better and I think still efficient - but you need to tune the right params I believe.
 

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1908:
--------------------------------

    Attachment: LUCENE-1908.patch

Attached fixes according to comments by Shai and Mark:
* spell/typos
* slanting definitions and in-text formulas.
* normalization of doc-vector/doc-length is description was fixed.
* more "eye friendly" colors in the two large formulas.
* added link to Euclidean-norm in the last paragraph on query-norm computation.

For convenient review of the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html 

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754610#action_12754610 ]

Doron Cohen commented on LUCENE-1908:
-------------------------------------

{quote}
I'm still a little confused I guess
{quote}

That makes too of us... :)

{quote}
The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass.
{quote}

In that work we got best results from Lucene (for TREC) with SweetSpot similarity and normalizing tf by average term-freq in doc.

For me it was mainly experimental and intuitive, but I was not able to convince Hoss (or even convince myslf once I read Hoss comments) that this was justified theoretically.

At that time I was not aware of the V(d) normalization delicacy we are discussing now. I think I understand things better now, and still I am not sure. Need to read http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html and sleep on it...

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754615#action_12754615 ]

Mark Miller commented on LUCENE-1908:
-------------------------------------

bq. In that work we got best results from Lucene

Thats funny - I didn't notice you were an author one that one! Small world.

The original idea of why the cosine normalization for the doc vector is bad, I got from the free intro to info retrieval book thats out there -
but what it says doesn't fully jive with the info I am finding elsewhere, or my own common sense. Thats what has me most confused
at the moment - the intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says
the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great.

I dunno :) - though at least I know a lot more than I did a few days ago - it never even occurred to me how the scoring we did equated to any kind
of dot product before this - I used to read Lucene's scoring algorithm and then look at the code and it was like .... okay - sure ... - so I've come a long way.

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754716#action_12754716 ]

Doron Cohen commented on LUCENE-1908:
-------------------------------------

{quote}
The intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great.
{quote}

To my (current) understanding it goes like this: normalizing all V(d)'s to unit vector is losing all information about lengths of documents. For a large document made by duplicating a smaller one this is probably ok. For a large document which just contains lots of "unique" text this is probably wrong. To solve this, a different normalization is sometimes preferred, one that would not normalize V(d) to the unit vector. (very much in line with what you (Mark) wrote above, finally I understand this...).

The pivoted length normalization which you mentioned is one different such normalization. Juru in fact is using this document length normalization. In our TREC experiments with Lucene we tried this approach (we modified Lucene indexing such that all require components were indexed as stored/cached fields and at search time we could try various scoring algorithms). It is interesting that pivoted length normalization did not work well - by our experiments - with Lucene for TREC.

The document length normalization of Lucene's DefaultSimilarity (DS) now seems to me - intuitively - not so good - at least for the previously mentioned two edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N distinct terms, but its length is 2N because each term appears twice. For doc1 DS will normalize to the unit vector same as EN, and for doc2 DS will normalize to a vector larger than the unit vector. However I think the desired behavior is the other way around - for doc2 to be the same as EN, and for doc1 to be normalized to a vector larger than the unit vector.

Back to the documentation patch, again it seems wrong presenting as if both EU and some additional doc length normalization are required - fixed patch to follow...

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1908:
--------------------------------

    Attachment: LUCENE-1908.patch

New patch is more accurate about doc-length-normalization choices.  

For  easy reviewing I updated http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html accordingly.

I plan to commit this in a day or two if there are no objections.

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754731#action_12754731 ]

Ted Dunning commented on LUCENE-1908:
-------------------------------------


I apologize for complaining here without actually editing this text to illustrate what I would rather see, but the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>".

There are also many small errors such as claiming that tf is *proportional* to term frequency and idf is *proportional* to inverse of document frequency.  Proportional means that there is a linear relationship which is definitely not the case here.  It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used.  IDF, on the other hand, decreases with increasing document frequency.


> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754743#action_12754743 ]

Doron Cohen commented on LUCENE-1908:
-------------------------------------

Thanks for reviewing this Ted.

{quote}
the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>".
{quote}

I see what you mean.

I tried to take the reader of this from VSM to the actual elements computed and aggregated in Lucene scoring code. This would also answer questions several times asked in the lists: "but what is the scoring model of Lucene" - it is not that straightforward to tell why a certain method is called during scoring.

But I think you have a good point - the reader is told "this is the scoring formula" just to discover 20 lines ahead that in fact "that is the formula" and yet again the same thing in another paragraph.

I think all 3 formulas are required, just the gluing text should improve. Might have helped to have better English than mine for this, but I'll give it a try, I think I know how to write it better in this sense.

{quote}
There are also many small errors such as claiming that tf is proportional to term frequency and idf is proportional to inverse of document frequency. Proportional means that there is a linear relationship which is definitely not the case here. It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used. IDF, on the other hand, decreases with increasing document frequency.
{quote}

I agree. "Proportional" is wrong. Thanks for catching this! In fact it appears wrongly in two other places in Similarity - idf() and in idfExplain().  In these two other places I think replacing it to "related" would be correct, i.e. like this:

{noformat}
Note that Searcher.maxDoc() is used instead of
org.apache.lucene.index.IndexReader.numDocs()
because it is related to Searcher.docFreq(Term) ,
i.e., when one is inaccurate, so is the other, and
in the same direction.
{noformat}

For tf and idf I think this will do: (?)

{noformat}
Tf and Idf are described in more detail below,
but for now, for completion, let's just say that
for given term t and document (or query) x,
Tf(t,x) is related to the number of occurrences of
term t in x - when one increases so does
the other - and idf(t) is similarly related to the
inverse of the number of index documents
containing term t.
{noformat}


> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754745#action_12754745 ]

Ted Dunning commented on LUCENE-1908:
-------------------------------------


bq. Thanks for reviewing this Ted.  ...  Might have helped to have better English than mine for this

No problem reviewing this.  Thanks to you for doing the hard part of actually writing it.  Being a critic is easy in comparison.

And you English is doing fine.  If you get the ideas down, a hundred people can chime in with grammatical and spelling fixes.  And frankly, your English is so much better than any of my other languages that I would never be brave enough to complain.

bq. I think all 3 formulas are required, just the gluing text should improve. ... I think I know how to write it better in this sense.

Good point.  This is the essence of my grumpiness.

bq. Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is related to Searcher.docFreq(Term), i.e., when one is inaccurate, so is the other, and in the same direction.

Actually, in this case, I think that the motivation is that maxDoc is very commonly exactly correct and typically vastly more efficient.  As you say, when it is wrong, docFreq can also be wrong the same way.  My suggestion would be this:
{quote}
Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is more efficient to compute and is pretty much correct except for when many documents have been deleted.  In any case Searcher.docFreq(Term) is likely to have a similar problem.
{quote}

Regarding the proportional/related issue, I think that your language is fine.  At most, I would suggest "varies with" instead of "related" because it is slightly stronger, but you make the relationship abundantly clear in your text.



> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754759#action_12754759 ]

Doron Cohen commented on LUCENE-1908:
-------------------------------------

Thanks Ted (for both your comments and kindness).

I modified to "varies" as you suggested, and added the note about efficient computation of maxDoc(), but omitted  the part of "pretty much correct except" because this might not be correct for certain applications.

Patch to follow...

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1908:
--------------------------------

    Attachment: LUCENE-1908.patch

Updated patch according to comments by Ted.

To review without applying the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html 

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754852#action_12754852 ]

Jiri Kuhn commented on LUCENE-1908:
-----------------------------------

The last paragraph starts with:

bq. However the resulted norm value is encoded as a single byte before being stored.

Maybe this is an ideal time to document reasons why norms are encoded as bytes.


> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12