[jira] Created: (LUCENE-460) hashCode improvements

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
hashCode improvements
---------------------

         Key: LUCENE-460
         URL: http://issues.apache.org/jira/browse/LUCENE-460
     Project: Lucene - Java
        Type: Improvement
  Components: Search  
    Versions: CVS Nightly - Specify date in submission    
    Reporter: Yonik Seeley
    Priority: Minor
     Fix For: CVS Nightly - Specify date in submission


It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356325 ]

Yonik Seeley commented on LUCENE-460:
-------------------------------------

A couple of guidelines off the top of my head...
 - hash codes should strive to be unique across the Query hierarchy, not just unique within one specific subclass.  For example, TermQuery(t) and SpanTermQuery(t) will generate the exact same hash codes.
- mix bits between different components that have any hashCode parts in common...
   for example RangeQuery will produce the same hashCode whenever lowerTerm==upperTerm.
   Also, field[x TO y] will produce the same hashCode for *any* field since the fieldname parts of the
  terms will always cancel eachother out.  This will also cause the hashCode of field{x TO x} to equal field:x
  The hashCode of FilteredQuery will also cause many collisions because the bits aren't mixed inbetween
   the query and the filter.
  Remember that every query as a boost component... never just xor two query hashCodes together.
- make things position dependent.
  Currently, field[x TO y] will produce the same hasCode as field[y TO x]... not particularly important for RangeQuery, but
   you get the idea.
- don't be afraid of using "+" instead of "^".  They both take a single CPU cycle, but "+" is not quite so easily (accidentally) reversed.
- flipping more than a single bit when hashing a boolean might be a good idea - it will make collisions harder.

http://www.concentric.net/~Ttwang/tech/inthash.htm is an interesting link on integer hash codes (what we are in effect doing when we combine multiple hash codes).  Esp interesting is the section "Parallel Operations"

> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356326 ]

Yonik Seeley commented on LUCENE-460:
-------------------------------------

Oh, and preserve entropy by using reversible integer hash functions (see the previous link).
 -  key ^= (key << a) | (key >>> b); for a,b in (17,16) (16,17) (14,19) (19,14) (13,20) (20,13) (10,23) (23,10) (8,25) (25,8)
 - multiply by an odd
 - addition
 - xor
 - rotates

> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356327 ]

Daniel Naber commented on LUCENE-460:
-------------------------------------

"Currently, field[x TO y] will produce the same hasCode as field[y TO x]... not particularly important for RangeQuery, but  you get the idea. "

Actually it is kind of important, because while [a TO z] will give hits [z TO a] won't.


> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356384 ]

paul.elschot commented on LUCENE-460:
-------------------------------------

>  - hash codes should strive to be unique across the Query hierarchy, not just unique within one specific
> subclass. For example, TermQuery(t) and SpanTermQuery(t) will generate the exact same hash codes.

This has the disadvantage that the structure information added by the query parser is lost in the hash.
In case a hash is needed over the query text before parsing, one can use precisely that.
What's the point of hash code uniqueness over subclasses?

Regards,
Paul Elschot


> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356414 ]

Yonik Seeley commented on LUCENE-460:
-------------------------------------

Paul, I'm not sure I understand your point about "structure information added by the query parser is lost in the hash".
Let me rephrase my statement in case you misunderstood me:

   - hash codes should strive to avoid collisions across Query hierarchy, not just within one specific
     subclass. An example of hashCode implementations that don't do this are TermQuery(t) and SpanTermQuery(t) which will generate the exact same hash codes.

> What's the point of hash code uniqueness over subclasses?

The same point as hash code uniqueness (avoiding collisions) within a particular class - the more collisions you have, the more it will slow down hash based lookups.  Things should still work if all hashCodes mapped to the same number, but it would be dog slow.

> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356432 ]

paul.elschot commented on LUCENE-460:
-------------------------------------

>  For example, TermQuery(t) and SpanTermQuery(t) will generate the exact same hash codes.

I'm sorry, I misread this as an example of what a hash code should be. You meant this
to be an example of what is wrong with current hashcodes...
 

> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12361200 ]

Yonik Seeley commented on LUCENE-460:
-------------------------------------

Some people have asked where some of the magic constants come from in the hashCodes:

> python -c "import random;print hex(random.getrandbits(32))[:-1]"
> Just a way of making some things unique.... let me know if you have a
> better idea on that.
>
> So those constants are just random numbers.  I thought about trying to pick magic numbers to maximize hamming distances, etc,
> but it's more work and more likely to mess things up if you get it wrong.  getClass().hashCode() would also work, but it would be
> slower.


> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (LUCENE-460) hashCode improvements

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/LUCENE-460?page=all ]
     
Yonik Seeley closed LUCENE-460:
-------------------------------

    Resolution: Fixed
     Assign To: Yonik Seeley

closing... I think I got most of these.

> hashCode improvements
> ---------------------
>
>          Key: LUCENE-460
>          URL: http://issues.apache.org/jira/browse/LUCENE-460
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Search
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Assignee: Yonik Seeley
>     Priority: Minor
>      Fix For: CVS Nightly - Specify date in submission

>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]