[jira] Created: (MAHOUT-565) Features incorrectly hashed in Minhash

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)
Features incorrectly hashed in Minhash
--------------------------------------

                 Key: MAHOUT-565
                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
             Project: Mahout
          Issue Type: Bug
    Affects Versions: 0.4
            Reporter: Ankur


Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankur reassigned MAHOUT-565:
----------------------------

    Assignee: Ankur

> Features incorrectly hashed in Minhash
> --------------------------------------
>
>                 Key: MAHOUT-565
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>
> Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankur updated MAHOUT-565:
-------------------------

    Attachment: jira-565.v1.patch

Patch that fixes the issue along with minor changes in test case.

> Features incorrectly hashed in Minhash
> --------------------------------------
>
>                 Key: MAHOUT-565
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: jira-565.v1.patch
>
>
> Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972003#action_12972003 ]

Ankur commented on MAHOUT-565:
------------------------------

The formatting is a bit disturbed even though i am using the eclipse code template mentioned here - https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute


> Features incorrectly hashed in Minhash
> --------------------------------------
>
>                 Key: MAHOUT-565
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: jira-565.v1.patch
>
>
> Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972037#action_12972037 ]

Sean Owen commented on MAHOUT-565:
----------------------------------

I looked at the patch and might be missing something but i don't see how it changes the behavior. After the shift, the cast to byte retains only the bottom 8 bits anyway. The shifted-in bits don't matter right?

The formatting changes are fine IMHO.

There are several other changes in this patch, is that intended?
And might they be affecting or even fixing whatever you observe?

> Features incorrectly hashed in Minhash
> --------------------------------------
>
>                 Key: MAHOUT-565
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: jira-565.v1.patch
>
>
> Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972073#action_12972073 ]

Ankur commented on MAHOUT-565:
------------------------------

> ...The shifted-in bits don't matter right?
You are right. This change is NOT needed. The masking is only needed when we are getting back an integer from relevant bytes. Somewhere else (not in Mahout's code) I was messing the bytes up when converting them back to an integer. So out of caution I put this one. This particular change can be discarded.

> The formatting changes are fine IMHO
Thanks. I set up the code template mentioned on "How to Contribute"

> There are several other changes in this patch, is that intended?
There are 2 noteworthy changes
1. Concatenating hash signatures in a sliding-window fashion. This makes sure that an item falls into as many buckets as number of hash signatures selected giving it more room for collision with similar items.
2. Fixing test case in TestMinHashClustering - This was missing evaluation on last cluster.

I haven't had the time to write up the Mahout documentation for this. Also I need to think about using the results in recommendations context. Any suggestions ?

> Features incorrectly hashed in Minhash
> --------------------------------------
>
>                 Key: MAHOUT-565
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: jira-565.v1.patch
>
>
> Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-565.
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5

OK committed without that one bit.

> Features incorrectly hashed in Minhash
> --------------------------------------
>
>                 Key: MAHOUT-565
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>             Fix For: 0.5
>
>         Attachments: jira-565.v1.patch
>
>
> Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-565) Features incorrectly hashed in Minhash

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972133#action_12972133 ]

Hudson commented on MAHOUT-565:
-------------------------------

Integrated in Mahout-Quality #509 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/509/])
    MAHOUT-565


> Features incorrectly hashed in Minhash
> --------------------------------------
>
>                 Key: MAHOUT-565
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-565
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>             Fix For: 0.5
>
>         Attachments: jira-565.v1.patch
>
>
> Given a feature vector for which minhash signature is desired, each feature id (an integer) is converted to a byte array through a series of bit shift operations. Current implementation of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.