[jira] Created: (LUCENE-977) internal hashing improvements

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-977) internal hashing improvements

Soren Daugaard (Jira)
internal hashing improvements
-----------------------------

                 Key: LUCENE-977
                 URL: https://issues.apache.org/jira/browse/LUCENE-977
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Yonik Seeley


Internal power-of-two closed hashtable traversal in DocumentsWriter and CharArraySet could be better.

Here is the current method of resolving collisions:
    if (text2 != null && !equals(text, len, text2)) {
      final int inc = code*1347|1;
      do {
        code += inc;
        pos = code & mask;
        text2 = entries[pos];
      } while (text2 != null && !equals(text, len, text2));

The problem is that two different hashCodes with the same lower bits will keep picking the same slots (the upper bits will be ignored).
This is because multiplication (*1347) only really shifts bits to the left... so given that the two codes already matched on the right, they will both pick the same increment, and this will keep them on the same path through the table (even though it's being added to numbers that differ on the left).  To resolve this, some bits need to be moved to the right when calculating the increment.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-977) internal hashing improvements

Soren Daugaard (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-977:
--------------------------------

    Attachment: hash.patch

Here is a patch that adds in 7 new bits (the rightmost bit is destroyed to make the number odd) when calculating the incrementor via
              final int inc = ((code>>8)+code)|1;
And thus gives 128 different possible paths to follow *per slot* on the first collision on that slot.
Ideally we would shift log2(size_of_hashtable), but it's probably not worth calculating that and I chose a small shift so it would work well for small hashCodes (say from a very short string).

Given that equals() in these cases is probably pretty fast, the average speedup is probably relatively minimal.

Comments?


> internal hashing improvements
> -----------------------------
>
>                 Key: LUCENE-977
>                 URL: https://issues.apache.org/jira/browse/LUCENE-977
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>         Attachments: hash.patch
>
>
> Internal power-of-two closed hashtable traversal in DocumentsWriter and CharArraySet could be better.
> Here is the current method of resolving collisions:
>     if (text2 != null && !equals(text, len, text2)) {
>       final int inc = code*1347|1;
>       do {
>         code += inc;
>         pos = code & mask;
>         text2 = entries[pos];
>       } while (text2 != null && !equals(text, len, text2));
> The problem is that two different hashCodes with the same lower bits will keep picking the same slots (the upper bits will be ignored).
> This is because multiplication (*1347) only really shifts bits to the left... so given that the two codes already matched on the right, they will both pick the same increment, and this will keep them on the same path through the table (even though it's being added to numbers that differ on the left).  To resolve this, some bits need to be moved to the right when calculating the increment.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-977) internal hashing improvements

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12519233 ]

Michael McCandless commented on LUCENE-977:
-------------------------------------------

This is an excellent change and makes complete sense that it will resolve conflicts faster than the first way.  I say commit it!

> internal hashing improvements
> -----------------------------
>
>                 Key: LUCENE-977
>                 URL: https://issues.apache.org/jira/browse/LUCENE-977
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>         Attachments: hash.patch
>
>
> Internal power-of-two closed hashtable traversal in DocumentsWriter and CharArraySet could be better.
> Here is the current method of resolving collisions:
>     if (text2 != null && !equals(text, len, text2)) {
>       final int inc = code*1347|1;
>       do {
>         code += inc;
>         pos = code & mask;
>         text2 = entries[pos];
>       } while (text2 != null && !equals(text, len, text2));
> The problem is that two different hashCodes with the same lower bits will keep picking the same slots (the upper bits will be ignored).
> This is because multiplication (*1347) only really shifts bits to the left... so given that the two codes already matched on the right, they will both pick the same increment, and this will keep them on the same path through the table (even though it's being added to numbers that differ on the left).  To resolve this, some bits need to be moved to the right when calculating the increment.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-977) internal hashing improvements

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved LUCENE-977.
---------------------------------

    Resolution: Fixed

committed.

> internal hashing improvements
> -----------------------------
>
>                 Key: LUCENE-977
>                 URL: https://issues.apache.org/jira/browse/LUCENE-977
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>         Attachments: hash.patch
>
>
> Internal power-of-two closed hashtable traversal in DocumentsWriter and CharArraySet could be better.
> Here is the current method of resolving collisions:
>     if (text2 != null && !equals(text, len, text2)) {
>       final int inc = code*1347|1;
>       do {
>         code += inc;
>         pos = code & mask;
>         text2 = entries[pos];
>       } while (text2 != null && !equals(text, len, text2));
> The problem is that two different hashCodes with the same lower bits will keep picking the same slots (the upper bits will be ignored).
> This is because multiplication (*1347) only really shifts bits to the left... so given that the two codes already matched on the right, they will both pick the same increment, and this will keep them on the same path through the table (even though it's being added to numbers that differ on the left).  To resolve this, some bits need to be moved to the right when calculating the increment.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]