[jira] Created: (LUCENE-1896) Modify confusing javadoc for queryNorm

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
Modify confusing javadoc for queryNorm
--------------------------------------

                 Key: LUCENE-1896
                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Javadocs
            Reporter: Jiri Kuhn
            Priority: Minor
             Fix For: 2.9


See http://markmail.org/message/arai6silfiktwcer

The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752270#action_12752270 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

Suggestions? I am no IR guy.

It would appear to me that this component is simply the part where you convert the vectors to unit vectors. Or are you are just dividing by the product of the euclidean lengths - it appears to be the same in either case to me.

So it looks like we kind of factor that out as a constant (I barley understand myself too - don't worry) -

In which case, why do we need it? Is it just there as a reminent of the math? Yes, it will give us the sim measure as the cosine - by why do we care if its a scaled cosine (eg not the cosine, but the same *relative* scores) for less work? What do we need 1/(Sum(W^2))^1/2 for?  Seems like wasted calculations. Who is the academic that kept this in! ... :)

Actually, I don't have a clue if it makes sense to keep it or not. Or even whether or not I am talking out my ...

Wherefore art thou Doug Cutting.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752271#action_12752271 ]

Mark Miller edited comment on LUCENE-1896 at 9/7/09 2:55 PM:
-------------------------------------------------------------

Hmm .. so perhaps the idea that the query scores are more comparable comes from:

if we take it out, that means the score will be the cosine times the euclidean distance - the distance will be rather large for a small query vector and a large doc vector (in magnitude). So a query that matched many large docs would could scale a lot harder than a query that hit small docs?

I guess that makes sense - scores not being very comparable anyway though, it hardly seems worth the extra compute cost ...

      was (Author: [hidden email]):
    Hmm .. so perhaps the idea that they query scores are more comparable comes from:

if we take it out, that means the score will be the cosine X the euclidean distance - the distance will be rather large for a small query vector and a large doc vector (in magnitude). So a query that matched many large docs would could scale a lot harder than a query that hit small docs?

I guess that makes sense - scores not being very comparable anyway though, it hardly seems worth the extra compute cost ...
 

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752271#action_12752271 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

Hmm .. so perhaps the idea that they query scores are more comparable comes from:

if we take it out, that means the score will be the cosine X the euclidean distance - the distance will be rather large for a small query vector and a large doc vector (in magnitude). So a query that matched many large docs would could scale a lot harder than a query that hit small docs?

I guess that makes sense - scores not being very comparable anyway though, it hardly seems worth the extra compute cost ...

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752273#action_12752273 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

heh - the javadoc actually makes sense to me now. It is correct, and its even not confusing - perhaps cryptic, but not confusing :)

It makes scores from different queries more comparable :)

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752302#action_12752302 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

bq.  it hardly seems worth the extra compute cost

Marvin mentioned the compute cost is essentially 0, so that statement is likely bs ...

At essentially no cost, I can see why the IR dudes leave it in.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752311#action_12752311 ]

Marvin Humphrey commented on LUCENE-1896:
-----------------------------------------

FWIW, after all that [fuss|http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200802.mbox/%3C9396E8E7-46FF-4B78-9427-13E9A7E584E4@...%3E],
I would up leaving it in.

From the standpoint of ordinary users, queryNorm() is harmless or mildly
beneficial. Scores are never going to be comparable across multiple queries
without what _I_ normally think of as "normalization" (given my background
in audio): setting the top score to 1.0, and multiplying all other scores by the
same factor. Nevertheless, it's better for them to be closer together than
farther apart.

From the standpoint of users trying to write Query subclasses, it's a wash.
On the one hand, it's not the most important method, since it doesn't affect
ranking within a single query -- and zapping it would mean one less thing to
think about.  On the other hand, it's nice to have it in there for the sake of
completeness in the implementation of cosine similarity.

I eventually would up messing with *how* the query norm gets applied to
achieve my de-voodoo-fication goals.  Essentially, I hid away queryNorm() so
that you don't need to think about it unless you really need it.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752530#action_12752530 ]

Grant Ingersoll commented on LUCENE-1896:
-----------------------------------------

I know the queryNorm was based on some long ago research on it, but I don't know that it has proven to be all that useful.  There are certainly others who have done work on this, too.  In the end, I don't think the current implementation is all that useful and we could likely save a few ops by removing it, or giving people more powerful ways of changing it.

The bottom line, I think, is you shouldn't compare scores across queries.  Often times, you can't even compare scores for the same query if the underlying index changed.  I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752533#action_12752533 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".

Its not about being able to compare scores across queries per say.

The cosine similarity is calc'd using the definition

queryNorm corresponds to the denominator of the right side of that equation. Suppose we take it out - that like multiplying the cos(a) by the product of the magnitude of the vectors:

 |V(q)||V(d)| * cos(a) = V(q) dot V(d)

So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean length of the query/doc vectors.

Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector mostly).

To get rid of that skew at no essentially no cost makes a lot of sense to me - which jives with why IR lit and Doug keep it around. Its not there to make scores between queries comparable - but it makes them way more comparable than they would be, and it adds to the "completeness in the implementation of cosine similarity" - at what I am trusting is essentially no cost. Its a keeper from my point of view. Its not based on research - its just the math of the formula - and if it had any real expense, it would likely have been tossed long ago (in the IR world).

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752533#action_12752533 ]

Mark Miller edited comment on LUCENE-1896 at 9/8/09 7:21 AM:
-------------------------------------------------------------

bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".

Its not about being able to compare scores across queries per say.

The cosine similarity is calc'd using the definition

{code}
 cos(a) = V(q) dot V(d) /  |V(q)||V(d)|
{code}

queryNorm corresponds to the denominator of the right side of that equation. Suppose we take it out - that like multiplying the cos(a) by the product of the magnitude of the vectors:
{code}
 |V(q)||V(d)| * cos(a) = V(q) dot V(d)
{code}
So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean length of the query/doc vectors.

Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector mostly).

To get rid of that skew at no essentially no cost makes a lot of sense to me - which jives with why IR lit and Doug keep it around. Its not there to make scores between queries comparable - but it makes them way more comparable than they would be, and it adds to the "completeness in the implementation of cosine similarity" - at what I am trusting is essentially no cost. Its a keeper from my point of view. Its not based on research - its just the math of the formula - and if it had any real expense, it would likely have been tossed long ago (in the IR world).

      was (Author: [hidden email]):
    bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".

Its not about being able to compare scores across queries per say.

The cosine similarity is calc'd using the definition

queryNorm corresponds to the denominator of the right side of that equation. Suppose we take it out - that like multiplying the cos(a) by the product of the magnitude of the vectors:

 |V(q)||V(d)| * cos(a) = V(q) dot V(d)

So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean length of the query/doc vectors.

Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector mostly).

To get rid of that skew at no essentially no cost makes a lot of sense to me - which jives with why IR lit and Doug keep it around. Its not there to make scores between queries comparable - but it makes them way more comparable than they would be, and it adds to the "completeness in the implementation of cosine similarity" - at what I am trusting is essentially no cost. Its a keeper from my point of view. Its not based on research - its just the math of the formula - and if it had any real expense, it would likely have been tossed long ago (in the IR world).
 

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752544#action_12752544 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

In other words - we get a measure that relates to the angle between vectors, not a measure that relates to the angle as well as the magnitude of the vectors (which doesn't help as a sim measurement and skews the scores across queries). Another way to think of it is that we are converting the vectors to unit vectors, and just considering the angles between them - not entirely necessary, but nice if its free.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752591#action_12752591 ]

Marvin Humphrey commented on LUCENE-1896:
-----------------------------------------

> at what I am trusting is essentially no cost.

Here's the snippet from TermQuery.score() where queryNorm() actually
gets applied to each document's score:

{code}
float raw =                                   // compute tf(f)*weight
  f < SCORE_CACHE_SIZE                        // check cache
  ? scoreCache[f]                             // cache hit
  : getSimilarity().tf(f)*weightValue;        // cache miss
{code}

At this point, queryNorm() has already been factored into weightValue (and
scoreCache).  It happens during setup. You can either scale weightValue by
queryNorm() during setup or not -- the per-document computational cost is
unaffected.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752763#action_12752763 ]

Hoss Man commented on LUCENE-1896:
----------------------------------

as i mentioned in the linked thread, the biggest advantage i know of for queryNorm is that it is a reduction factor applied to the constituent parts of a complex score prior to multiplication -- so it helps prevent loss of information due to floating point accuracy that could arrise otherwise.

but then again: that'swhat the *default* queryNorm does ... an alternate queryNorm could do something (like be a no-op)

Since the target audience of the Similarity javadocs are mainly people who are interested in customizing the scoring, perhaps it should be something like...

{quote}
The queryNorm is a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query.  In some cases this can be useful for attempting to keep scores from simple queries semi-comparable.  The Default implementation is...
{quote}

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752775#action_12752775 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

We can't be that worried about decimal accuracy - we use float which is good for 6 digits - if we were, wouldn't we use double (good for over double that) -

something about this argument hasn't clicked in my head yet.

If that was helpful (to scale by a some what arbitrary dampening factor) we would just start by multiplying every score by .01 or whatever. It doesn't make sense to me.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752780#action_12752780 ]

Hoss Man commented on LUCENE-1896:
----------------------------------

you could have a queryNorm implementation that always returned 0.01f ... but if you were dealing with weights that were all really large, it might not be enough ... and ify ou were dealing with weights that were *extremely* small low, it might actually be counter productive.  that's why the default isn't arbitrary -- it's a function of the weight.

i never said it was a *reason* why queryNorm was there ... i just said it was an advatnge i've observed in having it.

I also didn't argue in favor of adding anything about that to hte javadocs -- i mentioned it only to explain one type of benefit that can arise from have "a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query"

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752786#action_12752786 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

I'm not arguing about it, I'm just trying to understand it, and whether its actually helpful or not. Since it doesn't click in my head yet, I wouldn't have a leg to stand on in terms of an argument.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752787#action_12752787 ]

Hoss Man commented on LUCENE-1896:
----------------------------------

argue was a poor choice of word, it implies someone is arguing back.  what i should have said was that i don't *advocate* adding a mention of it to the docs.

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753377#action_12753377 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

Okay - think I was a tad off base -

Here is the cosine def used:

{code}
cos(a) = V(q) dot V(d) /  |V(q)||V(d)|
{code}

So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, |V(q)||V(d)| is a normalization factor that takes the vectors to unit vectors (so you get the real cosine)

{code}
cos(a) = v(q) dot v(d)
{code}

This is because the magnitude of a unit vector is 1 be definition.

But we don't care about absolute numbers, just relative numbers (as has been often pointed out) - so the IR guys already fudge this stuff.

While I thought that the queryNorm correlates to |V(q)||V(d)| before, I was off - its just |V(q)|.  |V(d)| is replaced with the document length normalization, a much faster calculation with similar properties - a longer doc would have a larger magnitude most likely.

So one strategy is just to not normalize query - though the lit i see doing this is very inefficiently calculating the query norm in the inner loop - we are not doing that, and so its not much of an optimization for us.

{code}
cos(a) = V(q) dot V(d) /  |V(d)| == cos(a) * |V(q)| = v(q) dot v(d)
{code}

And it does make queries less comparable (an odd goal I know, but for free?) ;)

Sorry I was a little off earlier - just tried to learn all this myself - and linear alg was years ago - and open book tests lured my younger, more irresponsible self to not go to the classes ...

Anyhow, thats my current understanding - please point out if you know I have something wrong.


> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753377#action_12753377 ]

Mark Miller edited comment on LUCENE-1896 at 9/9/09 7:09 PM:
-------------------------------------------------------------

Okay - think I was a tad off base -

Here is the cosine def used:

{code}
cos(a) = V(q) dot V(d) /  |V(q)||V(d)|
{code}

So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, |V(q)||V(d)| is a normalization factor that takes the vectors to unit vectors (so you get the real cosine)

{code}
cos(a) = v(q) dot v(d)
{code}

This is because the magnitude of a unit vector is 1 be definition.

But we don't care about absolute numbers, just relative numbers (as has been often pointed out) - so the IR guys already fudge this stuff.

While I thought that the queryNorm correlates to |V(q)||V(d)| before, I was off - its just |V(q)|.  |V(d)| is replaced with the document length normalization, a much faster calculation with similar properties - a longer doc would have a larger magnitude most likely. *edit* not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it.

So one strategy is just to not normalize query - though the lit i see doing this is very inefficiently calculating the query norm in the inner loop - we are not doing that, and so its not much of an optimization for us.

{code}
cos(a) = V(q) dot V(d) /  |V(d)| == cos(a) * |V(q)| = v(q) dot v(d)
{code}

And it does make queries less comparable (an odd goal I know, but for free?) ;)

Sorry I was a little off earlier - just tried to learn all this myself - and linear alg was years ago - and open book tests lured my younger, more irresponsible self to not go to the classes ...

Anyhow, thats my current understanding - please point out if you know I have something wrong.


      was (Author: [hidden email]):
    Okay - think I was a tad off base -

Here is the cosine def used:

{code}
cos(a) = V(q) dot V(d) /  |V(q)||V(d)|
{code}

So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, |V(q)||V(d)| is a normalization factor that takes the vectors to unit vectors (so you get the real cosine)

{code}
cos(a) = v(q) dot v(d)
{code}

This is because the magnitude of a unit vector is 1 be definition.

But we don't care about absolute numbers, just relative numbers (as has been often pointed out) - so the IR guys already fudge this stuff.

While I thought that the queryNorm correlates to |V(q)||V(d)| before, I was off - its just |V(q)|.  |V(d)| is replaced with the document length normalization, a much faster calculation with similar properties - a longer doc would have a larger magnitude most likely.

So one strategy is just to not normalize query - though the lit i see doing this is very inefficiently calculating the query norm in the inner loop - we are not doing that, and so its not much of an optimization for us.

{code}
cos(a) = V(q) dot V(d) /  |V(d)| == cos(a) * |V(q)| = v(q) dot v(d)
{code}

And it does make queries less comparable (an odd goal I know, but for free?) ;)

Sorry I was a little off earlier - just tried to learn all this myself - and linear alg was years ago - and open book tests lured my younger, more irresponsible self to not go to the classes ...

Anyhow, thats my current understanding - please point out if you know I have something wrong.

 

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753383#action_12753383 ]

Mark Miller commented on LUCENE-1896:
-------------------------------------

The real problem with this, is that while it wants to pluggable, because the sumOfSquared scores is calculated in the Weight, its really not so pluggable.

The alg is not properly contained - this queryNorm is one divided by the square root of the sumOfSquared weights - but thats the whole norm. If you wanted to
replace it with one, you shouldnt still be doing sumOfSquaredWeights.

Perhaps its a compromise on a difficult piece (have not considered) - but technically its kind of off..

> Modify confusing javadoc for queryNorm
> --------------------------------------
>
>                 Key: LUCENE-1896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1896
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Javadocs
>            Reporter: Jiri Kuhn
>            Priority: Minor
>             Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12