Sequential access to VectorWritable content proposal.

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
I don't think sequentiality is a requirement in the case i am working on.
However, let me peek at the code first. I am guessing it is some form of a
near-perfect hash, in which case it may not be possible to read it in parts
at all. Which would be bad, indeed. I would need to find a completely
alternative input format then to overcome my case.

On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <[hidden email]> wrote:

> I don't thikn that sequentiality part of the contract.
>  RandomAccessSparseVectors are likely to
> produce disordered values when serialized, I think.
>
> On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > I will have to look at details of VectorWritable to make sure all cases
> are
> > covered (I only took a very brief look so far). But as long as it is able
> > to
> > produce elements in order of index increase, push technique will
> certainly
> > work for most algorithms (and in some cases, notably with SSVD, even if
> it
> > produces the data in non-sequential way, it would work too ) .
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Jake Mannix
Dmitriy,

  You should be able to specify that your matrices be stored in
SequentialAccessSparseVector format if you need to.  This is
almost always the right thing for HDFS-backed matrices, because
HDFS is write-once, and SASVectors are optimized for read-only
sequential access, which is your exact use case, right?

  -jake

On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <[hidden email]> wrote:

> I don't think sequentiality is a requirement in the case i am working on.
> However, let me peek at the code first. I am guessing it is some form of a
> near-perfect hash, in which case it may not be possible to read it in parts
> at all. Which would be bad, indeed. I would need to find a completely
> alternative input format then to overcome my case.
>
> On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <[hidden email]>
> wrote:
>
> > I don't thikn that sequentiality part of the contract.
> >  RandomAccessSparseVectors are likely to
> > produce disordered values when serialized, I think.
> >
> > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > I will have to look at details of VectorWritable to make sure all cases
> > are
> > > covered (I only took a very brief look so far). But as long as it is
> able
> > > to
> > > produce elements in order of index increase, push technique will
> > certainly
> > > work for most algorithms (and in some cases, notably with SSVD, even if
> > it
> > > produces the data in non-sequential way, it would work too ) .
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
Yes, it should be. I thought Ted implied VectorWritable does it only this
way and non other.

If we can differentiate I'd rather do it. Implying that if you save in one
format (non-sequential) we'd support it with caveat that it's subpar in
certain cases whereas where you want to format input sequentially, we'd
eliminate vector prebuffering stage. Yes, that will work. Thank you, Jake.

-d


On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[hidden email]> wrote:

> Dmitriy,
>
>  You should be able to specify that your matrices be stored in
> SequentialAccessSparseVector format if you need to.  This is
> almost always the right thing for HDFS-backed matrices, because
> HDFS is write-once, and SASVectors are optimized for read-only
> sequential access, which is your exact use case, right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > I don't think sequentiality is a requirement in the case i am working on.
> > However, let me peek at the code first. I am guessing it is some form of
> a
> > near-perfect hash, in which case it may not be possible to read it in
> parts
> > at all. Which would be bad, indeed. I would need to find a completely
> > alternative input format then to overcome my case.
> >
> > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <[hidden email]>
> > wrote:
> >
> > > I don't thikn that sequentiality part of the contract.
> > >  RandomAccessSparseVectors are likely to
> > > produce disordered values when serialized, I think.
> > >
> > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <[hidden email]>
> > > wrote:
> > >
> > > > I will have to look at details of VectorWritable to make sure all
> cases
> > > are
> > > > covered (I only took a very brief look so far). But as long as it is
> > able
> > > > to
> > > > produce elements in order of index increase, push technique will
> > > certainly
> > > > work for most algorithms (and in some cases, notably with SSVD, even
> if
> > > it
> > > > produces the data in non-sequential way, it would work too ) .
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Jake Mannix
Check the source for VectorWritable, I'm pretty sure it serializes
in the order of the nonDefaultIterator(), which for SASVectors is in order,
so while these are indeed non-optimal for random access and mutating
operations, that is indeed the tradeoff you have to make when picking
your vector impl.

  -jake

On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <[hidden email]> wrote:

> Yes, it should be. I thought Ted implied VectorWritable does it only this
> way and non other.
>
> If we can differentiate I'd rather do it. Implying that if you save in one
> format (non-sequential) we'd support it with caveat that it's subpar in
> certain cases whereas where you want to format input sequentially, we'd
> eliminate vector prebuffering stage. Yes, that will work. Thank you, Jake.
>
> -d
>
>
> On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[hidden email]>
> wrote:
>
> > Dmitriy,
> >
> >  You should be able to specify that your matrices be stored in
> > SequentialAccessSparseVector format if you need to.  This is
> > almost always the right thing for HDFS-backed matrices, because
> > HDFS is write-once, and SASVectors are optimized for read-only
> > sequential access, which is your exact use case, right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > I don't think sequentiality is a requirement in the case i am working
> on.
> > > However, let me peek at the code first. I am guessing it is some form
> of
> > a
> > > near-perfect hash, in which case it may not be possible to read it in
> > parts
> > > at all. Which would be bad, indeed. I would need to find a completely
> > > alternative input format then to overcome my case.
> > >
> > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <[hidden email]>
> > > wrote:
> > >
> > > > I don't thikn that sequentiality part of the contract.
> > > >  RandomAccessSparseVectors are likely to
> > > > produce disordered values when serialized, I think.
> > > >
> > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > I will have to look at details of VectorWritable to make sure all
> > cases
> > > > are
> > > > > covered (I only took a very brief look so far). But as long as it
> is
> > > able
> > > > > to
> > > > > produce elements in order of index increase, push technique will
> > > > certainly
> > > > > work for most algorithms (and in some cases, notably with SSVD,
> even
> > if
> > > > it
> > > > > produces the data in non-sequential way, it would work too ) .
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
Thank you, Jake.

Yes i will do that tonight. I did do a brief assessment of that code before
but not very thoroughly though which was why i thought it must be a low
hanging fruit; but Ted and you certainly should know better so  i need to
look at it the second time to be sure. Thank you, sir.

-Dima

On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <[hidden email]> wrote:

> Check the source for VectorWritable, I'm pretty sure it serializes
> in the order of the nonDefaultIterator(), which for SASVectors is in order,
> so while these are indeed non-optimal for random access and mutating
> operations, that is indeed the tradeoff you have to make when picking
> your vector impl.
>
>  -jake
>
> On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > Yes, it should be. I thought Ted implied VectorWritable does it only this
> > way and non other.
> >
> > If we can differentiate I'd rather do it. Implying that if you save in
> one
> > format (non-sequential) we'd support it with caveat that it's subpar in
> > certain cases whereas where you want to format input sequentially, we'd
> > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> Jake.
> >
> > -d
> >
> >
> > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[hidden email]>
> > wrote:
> >
> > > Dmitriy,
> > >
> > >  You should be able to specify that your matrices be stored in
> > > SequentialAccessSparseVector format if you need to.  This is
> > > almost always the right thing for HDFS-backed matrices, because
> > > HDFS is write-once, and SASVectors are optimized for read-only
> > > sequential access, which is your exact use case, right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <[hidden email]>
> > > wrote:
> > >
> > > > I don't think sequentiality is a requirement in the case i am working
> > on.
> > > > However, let me peek at the code first. I am guessing it is some form
> > of
> > > a
> > > > near-perfect hash, in which case it may not be possible to read it in
> > > parts
> > > > at all. Which would be bad, indeed. I would need to find a completely
> > > > alternative input format then to overcome my case.
> > > >
> > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <[hidden email]>
> > > > wrote:
> > > >
> > > > > I don't thikn that sequentiality part of the contract.
> > > > >  RandomAccessSparseVectors are likely to
> > > > > produce disordered values when serialized, I think.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I will have to look at details of VectorWritable to make sure all
> > > cases
> > > > > are
> > > > > > covered (I only took a very brief look so far). But as long as it
> > is
> > > > able
> > > > > > to
> > > > > > produce elements in order of index increase, push technique will
> > > > > certainly
> > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > even
> > > if
> > > > > it
> > > > > > produces the data in non-sequential way, it would work too ) .
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
We should talk about whether it isn't better to just serialize chunks of
vectors instead.

This would not require fancy footwork and perversion of the contracts that
WritableVector thinks it has now.  It would also be bound to be a better
match for the problem domain.

So what about just using VW as it stands for vectors that are at most 100K
elements and are keyed by row and left-most column?  Why isn't that the
better option?

On Mon, Dec 13, 2010 at 4:41 PM, Dmitriy Lyubimov <[hidden email]> wrote:

> Thank you, Jake.
>
> Yes i will do that tonight. I did do a brief assessment of that code before
> but not very thoroughly though which was why i thought it must be a low
> hanging fruit; but Ted and you certainly should know better so  i need to
> look at it the second time to be sure. Thank you, sir.
>
> -Dima
>
> On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <[hidden email]>
> wrote:
>
> > Check the source for VectorWritable, I'm pretty sure it serializes
> > in the order of the nonDefaultIterator(), which for SASVectors is in
> order,
> > so while these are indeed non-optimal for random access and mutating
> > operations, that is indeed the tradeoff you have to make when picking
> > your vector impl.
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > Yes, it should be. I thought Ted implied VectorWritable does it only
> this
> > > way and non other.
> > >
> > > If we can differentiate I'd rather do it. Implying that if you save in
> > one
> > > format (non-sequential) we'd support it with caveat that it's subpar in
> > > certain cases whereas where you want to format input sequentially, we'd
> > > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> > Jake.
> > >
> > > -d
> > >
> > >
> > > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[hidden email]>
> > > wrote:
> > >
> > > > Dmitriy,
> > > >
> > > >  You should be able to specify that your matrices be stored in
> > > > SequentialAccessSparseVector format if you need to.  This is
> > > > almost always the right thing for HDFS-backed matrices, because
> > > > HDFS is write-once, and SASVectors are optimized for read-only
> > > > sequential access, which is your exact use case, right?
> > > >
> > > >  -jake
> > > >
> > > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > I don't think sequentiality is a requirement in the case i am
> working
> > > on.
> > > > > However, let me peek at the code first. I am guessing it is some
> form
> > > of
> > > > a
> > > > > near-perfect hash, in which case it may not be possible to read it
> in
> > > > parts
> > > > > at all. Which would be bad, indeed. I would need to find a
> completely
> > > > > alternative input format then to overcome my case.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > I don't thikn that sequentiality part of the contract.
> > > > > >  RandomAccessSparseVectors are likely to
> > > > > > produce disordered values when serialized, I think.
> > > > > >
> > > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I will have to look at details of VectorWritable to make sure
> all
> > > > cases
> > > > > > are
> > > > > > > covered (I only took a very brief look so far). But as long as
> it
> > > is
> > > > > able
> > > > > > > to
> > > > > > > produce elements in order of index increase, push technique
> will
> > > > > > certainly
> > > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > > even
> > > > if
> > > > > > it
> > > > > > > produces the data in non-sequential way, it would work too ) .
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Sean Owen
This may be a late or ill-informed comment but --

I don't believe there's any issue with VectorWritable per se, no. Hadoop
most certainly assumes that one Writable can fit into RAM. A 1GB Writable is
just completely incompatible with how Hadoop works. The algorithm would have
to be parallelized more then.

Yes that may mean re-writing the code to deal with small Vectors and such.
That probably doesn't imply a change to VectorWritable but to the jobs using
it.

On Tue, Dec 14, 2010 at 12:54 AM, Ted Dunning <[hidden email]> wrote:

> We should talk about whether it isn't better to just serialize chunks of
> vectors instead.
>
> This would not require fancy footwork and perversion of the contracts that
> WritableVector thinks it has now.  It would also be bound to be a better
> match for the problem domain.
>
> So what about just using VW as it stands for vectors that are at most 100K
> elements and are keyed by row and left-most column?  Why isn't that the
> better option?
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Ted Dunning
> We should talk about whether it isn't better to just serialize chunks of
> vectors instead.

That's actually is a better option, i mentioned that as a preferrable
solution venue. Just splice the rows.  But that brings up an issue of data
prep utils, so that would require more coding than adding a capability to
the VW.



On Mon, Dec 13, 2010 at 4:54 PM, Ted Dunning <[hidden email]> wrote:

> We should talk about whether it isn't better to just serialize chunks of
> vectors instead.
>
> This would not require fancy footwork and perversion of the contracts that
> WritableVector thinks it has now.  It would also be bound to be a better
> match for the problem domain.
>
> So what about just using VW as it stands for vectors that are at most 100K
> elements and are keyed by row and left-most column?  Why isn't that the
> better option?
>
> On Mon, Dec 13, 2010 at 4:41 PM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > Thank you, Jake.
> >
> > Yes i will do that tonight. I did do a brief assessment of that code
> before
> > but not very thoroughly though which was why i thought it must be a low
> > hanging fruit; but Ted and you certainly should know better so  i need to
> > look at it the second time to be sure. Thank you, sir.
> >
> > -Dima
> >
> > On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <[hidden email]>
> > wrote:
> >
> > > Check the source for VectorWritable, I'm pretty sure it serializes
> > > in the order of the nonDefaultIterator(), which for SASVectors is in
> > order,
> > > so while these are indeed non-optimal for random access and mutating
> > > operations, that is indeed the tradeoff you have to make when picking
> > > your vector impl.
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <[hidden email]>
> > > wrote:
> > >
> > > > Yes, it should be. I thought Ted implied VectorWritable does it only
> > this
> > > > way and non other.
> > > >
> > > > If we can differentiate I'd rather do it. Implying that if you save
> in
> > > one
> > > > format (non-sequential) we'd support it with caveat that it's subpar
> in
> > > > certain cases whereas where you want to format input sequentially,
> we'd
> > > > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> > > Jake.
> > > >
> > > > -d
> > > >
> > > >
> > > > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[hidden email]>
> > > > wrote:
> > > >
> > > > > Dmitriy,
> > > > >
> > > > >  You should be able to specify that your matrices be stored in
> > > > > SequentialAccessSparseVector format if you need to.  This is
> > > > > almost always the right thing for HDFS-backed matrices, because
> > > > > HDFS is write-once, and SASVectors are optimized for read-only
> > > > > sequential access, which is your exact use case, right?
> > > > >
> > > > >  -jake
> > > > >
> > > > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I don't think sequentiality is a requirement in the case i am
> > working
> > > > on.
> > > > > > However, let me peek at the code first. I am guessing it is some
> > form
> > > > of
> > > > > a
> > > > > > near-perfect hash, in which case it may not be possible to read
> it
> > in
> > > > > parts
> > > > > > at all. Which would be bad, indeed. I would need to find a
> > completely
> > > > > > alternative input format then to overcome my case.
> > > > > >
> > > > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > I don't thikn that sequentiality part of the contract.
> > > > > > >  RandomAccessSparseVectors are likely to
> > > > > > > produce disordered values when serialized, I think.
> > > > > > >
> > > > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I will have to look at details of VectorWritable to make sure
> > all
> > > > > cases
> > > > > > > are
> > > > > > > > covered (I only took a very brief look so far). But as long
> as
> > it
> > > > is
> > > > > > able
> > > > > > > > to
> > > > > > > > produce elements in order of index increase, push technique
> > will
> > > > > > > certainly
> > > > > > > > work for most algorithms (and in some cases, notably with
> SSVD,
> > > > even
> > > > > if
> > > > > > > it
> > > > > > > > produces the data in non-sequential way, it would work too )
> .
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
I am not sure about that since the algorithms will have to be massively
changed either way.

On Mon, Dec 13, 2010 at 5:03 PM, Dmitriy Lyubimov <[hidden email]> wrote:

> But that brings up an issue of data
> prep utils, so that would require more coding than adding a capability to
> the VW.
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
I disagree.

it a simple binary logic: Am i equipped with push handler? yes --> push
values to handler. no? --> same as before. no change in anything that does
not _opt in_ into the capability.

On Mon, Dec 13, 2010 at 5:06 PM, Ted Dunning <[hidden email]> wrote:

> I am not sure about that since the algorithms will have to be massively
> changed either way.
>
> On Mon, Dec 13, 2010 at 5:03 PM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > But that brings up an issue of data
> > prep utils, so that would require more coding than adding a capability to
> > the VW.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
I meant that algorithms will need change to use the per element capability.
 Getting algorithms to work with a chunk is the same kind of work.  Chunks
are probably going to be more efficient due to blocking effects
although network overhead may dominate.

On Mon, Dec 13, 2010 at 5:09 PM, Dmitriy Lyubimov <[hidden email]> wrote:

> I disagree.
>
> it a simple binary logic: Am i equipped with push handler? yes --> push
> values to handler. no? --> same as before. no change in anything that does
> not _opt in_ into the capability.
>
> On Mon, Dec 13, 2010 at 5:06 PM, Ted Dunning <[hidden email]>
> wrote:
>
> > I am not sure about that since the algorithms will have to be massively
> > changed either way.
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Sean Owen
Sean,

Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
saw is 128m but most commonly 64.

Imagine for a second loading 1Gb records into a mapper. Assuming
sufficiently large amount of nodes, the math expectancy of the collocated
data in this case is approximately 32M, everything else comes from some
other node.

So we start our job with moving ~97% of the data around just to make it
available to the mappers.

There are two considerations in my case however that alleviated that concern
after all at least in my particular case somewhat:

1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
for 1Gb vectors, i am saying that the problem is already quite detrimental
to algorithm running time at much smaller sizes.

2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb vectors,
nor do i plan to. But if i did, i strongly suspect it is not I/O that would
be a bottleneck.

On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[hidden email]> wrote:

> This may be a late or ill-informed comment but --
>
> I don't believe there's any issue with VectorWritable per se, no. Hadoop
> most certainly assumes that one Writable can fit into RAM. A 1GB Writable
> is
> just completely incompatible with how Hadoop works. The algorithm would
> have
> to be parallelized more then.
>
> Yes that may mean re-writing the code to deal with small Vectors and such.
> That probably doesn't imply a change to VectorWritable but to the jobs
> using
> it.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Sean Owen
Yah I have the same sort of problem with recommenders. Once in a while you
have a very, very popular item. In those cases I use some smart sampling to
simply throw out some data. For my use cases, that has virtually no effect
on the result so it's valid. Not sure if you can use the same approaches.


On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <[hidden email]> wrote:

> Sean,
>
> Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
> saw is 128m but most commonly 64.
>
> Imagine for a second loading 1Gb records into a mapper. Assuming
> sufficiently large amount of nodes, the math expectancy of the collocated
> data in this case is approximately 32M, everything else comes from some
> other node.
>
> So we start our job with moving ~97% of the data around just to make it
> available to the mappers.
>
> There are two considerations in my case however that alleviated that
> concern
> after all at least in my particular case somewhat:
>
> 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
> just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
> for 1Gb vectors, i am saying that the problem is already quite detrimental
> to algorithm running time at much smaller sizes.
>
> 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb vectors,
> nor do i plan to. But if i did, i strongly suspect it is not I/O that would
> be a bottleneck.
>
> On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[hidden email]> wrote:
>
> > This may be a late or ill-informed comment but --
> >
> > I don't believe there's any issue with VectorWritable per se, no. Hadoop
> > most certainly assumes that one Writable can fit into RAM. A 1GB Writable
> > is
> > just completely incompatible with how Hadoop works. The algorithm would
> > have
> > to be parallelized more then.
> >
> > Yes that may mean re-writing the code to deal with small Vectors and
> such.
> > That probably doesn't imply a change to VectorWritable but to the jobs
> > using
> > it.
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Jake Mannix
Jake, yes, just looked at VW code again. I don't see any problem equipping
it with an element handler. Even when it's a sparse random access.



So, driven by practical delivery goals, i'd have to push on with hacking
Vector Writable for now. It's not clear if this will result in significant
decrease in SSVD running time but I think it will help me to get thru
current RAM starving nuisances.  I am staying fairly unconvinced that
there's been a really strong argumentation against providing vector
preprocessing capability of some sort so far, or that push preprocessor
technique is somehow not-so-elegant (or should i say ugly?) in itself. It's
a pattern that is well understood and widely used in the community. Esp. for
as long as javadoc is very explicit about it. I gather there's little
interest in this sort of memory allocation improvement. So I'll keep this
hack private, np, until some sort of spliced record or blocked matrix format
is matured.  The existing patch is certainly very usable as is, esp. if one
can afford some extra memory in child processes.


I fully support the notion  that eventually some sort of blocking format is
most promising for 1Gb vectors, but i think for my purposes i can get away
with just this and some MR optimizations for split sizes and associated IO.
But the more i think about it, the more convinced i become that SSVD on
wider matrices are preferrable over tall matrices as it would allow to
reduce amount of data ciculation thru shuffle-and-sort circuitry as well as
the number of blocks in SSVD computations. So it kind of makes wider
matrices more scalable. Althought i don't think that a billion rows tall
matrix is a problem for anything but CPU in context of SSVD either.


thanks.
-Dmitriy

On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <[hidden email]> wrote:

> Check the source for VectorWritable, I'm pretty sure it serializes
> in the order of the nonDefaultIterator(), which for SASVectors is in order,
> so while these are indeed non-optimal for random access and mutating
> operations, that is indeed the tradeoff you have to make when picking
> your vector impl.
>
>  -jake
>
> On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > Yes, it should be. I thought Ted implied VectorWritable does it only this
> > way and non other.
> >
> > If we can differentiate I'd rather do it. Implying that if you save in
> one
> > format (non-sequential) we'd support it with caveat that it's subpar in
> > certain cases whereas where you want to format input sequentially, we'd
> > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> Jake.
> >
> > -d
> >
> >
> > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[hidden email]>
> > wrote:
> >
> > > Dmitriy,
> > >
> > >  You should be able to specify that your matrices be stored in
> > > SequentialAccessSparseVector format if you need to.  This is
> > > almost always the right thing for HDFS-backed matrices, because
> > > HDFS is write-once, and SASVectors are optimized for read-only
> > > sequential access, which is your exact use case, right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <[hidden email]>
> > > wrote:
> > >
> > > > I don't think sequentiality is a requirement in the case i am working
> > on.
> > > > However, let me peek at the code first. I am guessing it is some form
> > of
> > > a
> > > > near-perfect hash, in which case it may not be possible to read it in
> > > parts
> > > > at all. Which would be bad, indeed. I would need to find a completely
> > > > alternative input format then to overcome my case.
> > > >
> > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <[hidden email]>
> > > > wrote:
> > > >
> > > > > I don't thikn that sequentiality part of the contract.
> > > > >  RandomAccessSparseVectors are likely to
> > > > > produce disordered values when serialized, I think.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I will have to look at details of VectorWritable to make sure all
> > > cases
> > > > > are
> > > > > > covered (I only took a very brief look so far). But as long as it
> > is
> > > > able
> > > > > > to
> > > > > > produce elements in order of index increase, push technique will
> > > > > certainly
> > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > even
> > > if
> > > > > it
> > > > > > produces the data in non-sequential way, it would work too ) .
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Sean Owen
I see. Thank you, Sean.

On Tue, Dec 14, 2010 at 12:17 AM, Sean Owen <[hidden email]> wrote:

> Yah I have the same sort of problem with recommenders. Once in a while you
> have a very, very popular item. In those cases I use some smart sampling to
> simply throw out some data. For my use cases, that has virtually no effect
> on the result so it's valid. Not sure if you can use the same approaches.
>
>
> On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > Sean,
> >
> > Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
> > saw is 128m but most commonly 64.
> >
> > Imagine for a second loading 1Gb records into a mapper. Assuming
> > sufficiently large amount of nodes, the math expectancy of the collocated
> > data in this case is approximately 32M, everything else comes from some
> > other node.
> >
> > So we start our job with moving ~97% of the data around just to make it
> > available to the mappers.
> >
> > There are two considerations in my case however that alleviated that
> > concern
> > after all at least in my particular case somewhat:
> >
> > 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> > 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
> > just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
> > for 1Gb vectors, i am saying that the problem is already quite
> detrimental
> > to algorithm running time at much smaller sizes.
> >
> > 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb
> vectors,
> > nor do i plan to. But if i did, i strongly suspect it is not I/O that
> would
> > be a bottleneck.
> >
> > On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[hidden email]> wrote:
> >
> > > This may be a late or ill-informed comment but --
> > >
> > > I don't believe there's any issue with VectorWritable per se, no.
> Hadoop
> > > most certainly assumes that one Writable can fit into RAM. A 1GB
> Writable
> > > is
> > > just completely incompatible with how Hadoop works. The algorithm would
> > > have
> > > to be parallelized more then.
> > >
> > > Yes that may mean re-writing the code to deal with small Vectors and
> > such.
> > > That probably doesn't imply a change to VectorWritable but to the jobs
> > > using
> > > it.
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
Sometimes sampling works.  Sometimes it doesn't.  Sampling helps in
recommendations even more than with space because cooccurrence counting
requires time quadratic in the number of occurrences of an item.

In fact, it is possible to do the block decomposition we were talking about
using an interleaved sampling.  That would have an interesting and often
positive result in many algorithms.

But sampling is, sadly, not an option for many algorithms.

On Tue, Dec 14, 2010 at 12:26 AM, Dmitriy Lyubimov <[hidden email]>wrote:

> I see. Thank you, Sean.
>
> On Tue, Dec 14, 2010 at 12:17 AM, Sean Owen <[hidden email]> wrote:
>
> > Yah I have the same sort of problem with recommenders. Once in a while
> you
> > have a very, very popular item. In those cases I use some smart sampling
> to
> > simply throw out some data. For my use cases, that has virtually no
> effect
> > on the result so it's valid. Not sure if you can use the same approaches.
> >
> >
> > On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > Sean,
> > >
> > > Absolutely agree. There's no such thing as 1Gb HDFS block. The max i
> ever
> > > saw is 128m but most commonly 64.
> > >
> > > Imagine for a second loading 1Gb records into a mapper. Assuming
> > > sufficiently large amount of nodes, the math expectancy of the
> collocated
> > > data in this case is approximately 32M, everything else comes from some
> > > other node.
> > >
> > > So we start our job with moving ~97% of the data around just to make it
> > > available to the mappers.
> > >
> > > There are two considerations in my case however that alleviated that
> > > concern
> > > after all at least in my particular case somewhat:
> > >
> > > 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> > > 100-150mb at most. But i still have to yield 50% of mapper RAM to the
> VW
> > > just to make sure 0.1% of cases goes thru. Sp tje case i am making is
> not
> > > for 1Gb vectors, i am saying that the problem is already quite
> > detrimental
> > > to algorithm running time at much smaller sizes.
> > >
> > > 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb
> > vectors,
> > > nor do i plan to. But if i did, i strongly suspect it is not I/O that
> > would
> > > be a bottleneck.
> > >
> > > On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[hidden email]> wrote:
> > >
> > > > This may be a late or ill-informed comment but --
> > > >
> > > > I don't believe there's any issue with VectorWritable per se, no.
> > Hadoop
> > > > most certainly assumes that one Writable can fit into RAM. A 1GB
> > Writable
> > > > is
> > > > just completely incompatible with how Hadoop works. The algorithm
> would
> > > > have
> > > > to be parallelized more then.
> > > >
> > > > Yes that may mean re-writing the code to deal with small Vectors and
> > > such.
> > > > That probably doesn't imply a change to VectorWritable but to the
> jobs
> > > > using
> > > > it.
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Dmitriy Lyubimov
Initial work on vector preprocessing is done in my git on "ssvd-vw-hack"
branch of my ssvd work (doc is here : https://github.com/dlyubimov/ssvd-doc) .
The heavy lifting is done thru VectorPreprocessor interface and seems to
work like a charm. I did not test it extensively, but when complete, it
should be able to cope with ocasional spikes in data density without
cirppling SVD mapper's memory.

thanks. -d

On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]> wrote:

> Hi all,
>
> I would like to submit a patch to VectorWritable that allows for streaming
> access to vector elements without having to prebuffer all of them first.
> (current code allows for the latter only).
>
> That patch would allow to strike down one of the memory usage issues in
> current Stochastic SVD implementation and effectively open memory bound for
> n of the SVD work. (The value i see is not to open up the the bound though
> but just be more efficient in memory use, thus essentially speeding u p the
> computation. )
>
> If it's ok, i would like to create a JIRA issue and provide a patch for it.
>
>
> Another issue is to provide an SSVD patch that depends on that patch for
> VectorWritable.
>
> Thank you.
> -Dmitriy
>
12