[jira] Created: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)
OrderedIntDoubleMapping / SparseVector is unnecessarily slow
------------------------------------------------------------

                 Key: MAHOUT-201
                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
             Project: Mahout
          Issue Type: Improvement
          Components: Matrix
    Affects Versions: 0.2
            Reporter: Jake Mannix
             Fix For: 0.3


In the work on MAHOUT-165, I find that while their sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.

This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:

* it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
* even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup

This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).

Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).

We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-201:
-------------------------------

    Description:
In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.

This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:

* it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
* even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup

This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).

Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).

We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

  was:
In the work on MAHOUT-165, I find that while their sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.

This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:

* it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
* even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup

This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).

Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).

We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?


> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-201:
-------------------------------

    Attachment: MAHOUT-201.patch

First stab at this.  Iteration should not require doing binary searches - this much I know, but I don't know what to benchmark against quite yet.  Unit tests still pass.

> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779403#action_12779403 ]

Shashikant Kore commented on MAHOUT-201:
----------------------------------------

Jake,

Colt also provides fast iteration on key-value pairs.

I think,  IntDoublePairIterator in this patch is same as IntDoubleProcedure of colt. Check the private class AddToVector, which inherits IntDoubleProcedure, in my patch on 165. Is that something you want?


> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779412#action_12779412 ]

Sean Owen commented on MAHOUT-201:
----------------------------------

Taking a step back -- OrderedIntDoubleMapping was a simple band-aid and placeholder. I had thought we'd replace it, as an implementation for SparseVector, with Colt or something. In fact it's my understanding we're keeping our interfaces and reimplementing in terms of Colt, completely.

So should the question rather be, how can we juice up the Colt implementation? there's no our-implementation going forward.

> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779588#action_12779588 ]

Jake Mannix commented on MAHOUT-201:
------------------------------------

The reason I started looking at Mahout's old OrderedIntDoubleMapping was that I didn't see any implementation in Colt which was actually just a wrapper around ordered int[]'s and their corresponding double[] values.  Colt has *unordered* int/double mappings (which are hash map impls), which a) doesn't have order information, and b) uses a hash, so isn't either as efficient for iteration, or for storage).

Once mahout-colt is committed in MAHOUT-165, then yes, we can juice-up a Colt implementation (I haven't pulled down your patch yet, Shashi, I'll check it out), but from what I can see, they have lots of fast map implementations, and as far as extending their DoubleMatrix1D class (i.e. Vector), the SparseDoubleMatrix1D is map-based as well, so there is no equivalent of just an int[], double[] pair implementing Vector.

> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779644#action_12779644 ]

Jake Mannix commented on MAHOUT-201:
------------------------------------

Ok, so looking at your patch which could obsolete this issue over on MAHOUT-165, I see that you've implemented ordered iteration as a sorting call in the iterator constructor, and that also, the Element.get() call still forwards into the hash-backed collection, which is what I was trying to avoid.

Once MAHOUT-165 is in, we should open a few more tickets on deciding the tradeoffs for various SparseVector implementations can take, and dig through which parts of Colt are most useful for this.

> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix resolved MAHOUT-201.
--------------------------------

    Resolution: Duplicate

The patch for this is currently included in the patch for MAHOUT-206 (which would have badly broken this patch anyways).

> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs, and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return values.get(ind);  } {code}, which is implemented as a binary search on index values array (the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice, I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open for that already?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.