Sequential access to VectorWritable content proposal.

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
Hi all,

I would like to submit a patch to VectorWritable that allows for streaming
access to vector elements without having to prebuffer all of them first.
(current code allows for the latter only).

That patch would allow to strike down one of the memory usage issues in
current Stochastic SVD implementation and effectively open memory bound for
n of the SVD work. (The value i see is not to open up the the bound though
but just be more efficient in memory use, thus essentially speeding u p the
computation. )

If it's ok, i would like to create a JIRA issue and provide a patch for it.

Another issue is to provide an SSVD patch that depends on that patch for
VectorWritable.

Thank you.
-Dmitriy
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
Interesting idea.

Would this introduce a new vector type that only allows iterating through
the elements once?

On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]> wrote:

> Hi all,
>
> I would like to submit a patch to VectorWritable that allows for streaming
> access to vector elements without having to prebuffer all of them first.
> (current code allows for the latter only).
>
> That patch would allow to strike down one of the memory usage issues in
> current Stochastic SVD implementation and effectively open memory bound for
> n of the SVD work. (The value i see is not to open up the the bound though
> but just be more efficient in memory use, thus essentially speeding u p the
> computation. )
>
> If it's ok, i would like to create a JIRA issue and provide a patch for it.
>
> Another issue is to provide an SSVD patch that depends on that patch for
> VectorWritable.
>
> Thank you.
> -Dmitriy
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Jake Mannix
Hey Dmitriy,

  I've also been playing around with a VectorWritable format which is backed
by a
SequenceFile, but I've been focussed on the case where it's essentially the
entire
matrix, and the rows don't fit into memory.  This seems different than your
current
use case, however - you just want (relatively) small vectors to load faster,
right?

  -jake

On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]> wrote:

> Interesting idea.
>
> Would this introduce a new vector type that only allows iterating through
> the elements once?
>
> On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > I would like to submit a patch to VectorWritable that allows for
> streaming
> > access to vector elements without having to prebuffer all of them first.
> > (current code allows for the latter only).
> >
> > That patch would allow to strike down one of the memory usage issues in
> > current Stochastic SVD implementation and effectively open memory bound
> for
> > n of the SVD work. (The value i see is not to open up the the bound
> though
> > but just be more efficient in memory use, thus essentially speeding u p
> the
> > computation. )
> >
> > If it's ok, i would like to create a JIRA issue and provide a patch for
> it.
> >
> > Another issue is to provide an SSVD patch that depends on that patch for
> > VectorWritable.
> >
> > Thank you.
> > -Dmitriy
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Ted Dunning
No, no new vector types. All that is needed is push-type element handler
that the algorithm can set, something like

void onNextVectorElement ( int index, double value ) throws IOException

or something like that.

IF the handler is set, then the writable uses it during read process, (and
subequently get() is not producing anything) ,

and if the handler is not set , then it works as it does today.

On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]> wrote:

> Interesting idea.
>
> Would this introduce a new vector type that only allows iterating through
> the elements once?
>
> On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > I would like to submit a patch to VectorWritable that allows for
> streaming
> > access to vector elements without having to prebuffer all of them first.
> > (current code allows for the latter only).
> >
> > That patch would allow to strike down one of the memory usage issues in
> > current Stochastic SVD implementation and effectively open memory bound
> for
> > n of the SVD work. (The value i see is not to open up the the bound
> though
> > but just be more efficient in memory use, thus essentially speeding u p
> the
> > computation. )
> >
> > If it's ok, i would like to create a JIRA issue and provide a patch for
> it.
> >
> > Another issue is to provide an SSVD patch that depends on that patch for
> > VectorWritable.
> >
> > Thank you.
> > -Dmitriy
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
Why not use the Iterable interface then?

On Mon, Dec 13, 2010 at 10:57 AM, Dmitriy Lyubimov <[hidden email]>wrote:

> No, no new vector types. All that is needed is push-type element handler
> that the algorithm can set, something like
>
> void onNextVectorElement ( int index, double value ) throws IOException
>
> or something like that.
>
> IF the handler is set, then the writable uses it during read process, (and
> subequently get() is not producing anything) ,
>
> and if the handler is not set , then it works as it does today.
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
In reply to this post by Jake Mannix
How common is it that a row won't fit in memory?  My experience is that
essentially all rows that
I am interested will fit in very modest amounts of memory, but that row by
row handling is imperative.

Is this just gilding the lily?

On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]> wrote:

> Hey Dmitriy,
>
>  I've also been playing around with a VectorWritable format which is backed
> by a
> SequenceFile, but I've been focussed on the case where it's essentially the
> entire
> matrix, and the rows don't fit into memory.  This seems different than your
> current
> use case, however - you just want (relatively) small vectors to load
> faster,
> right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Jake Mannix
Jake,
No i was trying exactly what you were proposing some time ago on the list. I
am trying to make long vectors not to occupy a lot of memory.

E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
saying, hey, there's a lot of sequential techniques that can provide a
hander that would inspect vector element-by-element without having to
preallocate 8Mb.

for 1 million-long vectors it doesn't scary too much but starts being so for
default hadoop memory settings at the area of 50-100Mb (or 5-10 million
non-zero elements). Stochastic SVD will survive that, but it means less
memory for blocking, and the more blocks you have, the more CPU it requires
(although CPU demand is only linear to the number of blocks and only in
signficantly smaller part of computation, so that only insigificant part of
total CPU flops depends on # of blocks, but there is part that does, still.
)

Like i said, it also would address the case when rows don't fit in the
memory (hence no memory bound for n of A) but the most immediate benefit is
to speed/ scalability/memory req of SSVD in most practical LSI cases.

-Dmitriy

On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]> wrote:

> Hey Dmitriy,
>
>  I've also been playing around with a VectorWritable format which is backed
> by a
> SequenceFile, but I've been focussed on the case where it's essentially the
> entire
> matrix, and the rows don't fit into memory.  This seems different than your
> current
> use case, however - you just want (relatively) small vectors to load
> faster,
> right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Ted Dunning
I also think it is not very common, but it would be a collateral benefit

On Mon, Dec 13, 2010 at 11:05 AM, Ted Dunning <[hidden email]> wrote:

> How common is it that a row won't fit in memory?  My experience is that
> essentially all rows that
> I am interested will fit in very modest amounts of memory, but that row by
> row handling is imperative.
>
> Is this just gilding the lily?
>
> On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]>
> wrote:
>
> > Hey Dmitriy,
> >
> >  I've also been playing around with a VectorWritable format which is
> backed
> > by a
> > SequenceFile, but I've been focussed on the case where it's essentially
> the
> > entire
> > matrix, and the rows don't fit into memory.  This seems different than
> your
> > current
> > use case, however - you just want (relatively) small vectors to load
> > faster,
> > right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> > wrote:
> >
> > > Interesting idea.
> > >
> > > Would this introduce a new vector type that only allows iterating
> through
> > > the elements once?
> > >
> > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to submit a patch to VectorWritable that allows for
> > > streaming
> > > > access to vector elements without having to prebuffer all of them
> > first.
> > > > (current code allows for the latter only).
> > > >
> > > > That patch would allow to strike down one of the memory usage issues
> in
> > > > current Stochastic SVD implementation and effectively open memory
> bound
> > > for
> > > > n of the SVD work. (The value i see is not to open up the the bound
> > > though
> > > > but just be more efficient in memory use, thus essentially speeding u
> p
> > > the
> > > > computation. )
> > > >
> > > > If it's ok, i would like to create a JIRA issue and provide a patch
> for
> > > it.
> > > >
> > > > Another issue is to provide an SSVD patch that depends on that patch
> > for
> > > > VectorWritable.
> > > >
> > > > Thank you.
> > > > -Dmitriy
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Ted Dunning
Ted,

what Iterable are you talking about? On a Vector? i don't think it helps
because you have to have a Vector loaded (prebuffered) already-- so you
already took the memory for it?
I don't think VectorWritable has an Iterable though.

Iterator is a pull technique, and in MR it would be possible to do something
like this, but Iterable suggests you could do multiple passes over data
which is not possible in context of MR jobs without prebuffering.

Push technique explicitly suggests that one would not be able to do multiple
passes (as in DocumentHandler in SAX standard). So even if we implemented
iterator withough prebuffering vector data, in context of MR job it means
you can't have more than one iterator, which i think is conceptually
incoherent (i.e. misleading) with the Iterable contract.


Thanks.
-Dmitriy

On Mon, Dec 13, 2010 at 11:05 AM, Ted Dunning <[hidden email]> wrote:

> How common is it that a row won't fit in memory?  My experience is that
> essentially all rows that
> I am interested will fit in very modest amounts of memory, but that row by
> row handling is imperative.
>
> Is this just gilding the lily?
>
> On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]>
> wrote:
>
> > Hey Dmitriy,
> >
> >  I've also been playing around with a VectorWritable format which is
> backed
> > by a
> > SequenceFile, but I've been focussed on the case where it's essentially
> the
> > entire
> > matrix, and the rows don't fit into memory.  This seems different than
> your
> > current
> > use case, however - you just want (relatively) small vectors to load
> > faster,
> > right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> > wrote:
> >
> > > Interesting idea.
> > >
> > > Would this introduce a new vector type that only allows iterating
> through
> > > the elements once?
> > >
> > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to submit a patch to VectorWritable that allows for
> > > streaming
> > > > access to vector elements without having to prebuffer all of them
> > first.
> > > > (current code allows for the latter only).
> > > >
> > > > That patch would allow to strike down one of the memory usage issues
> in
> > > > current Stochastic SVD implementation and effectively open memory
> bound
> > > for
> > > > n of the SVD work. (The value i see is not to open up the the bound
> > > though
> > > > but just be more efficient in memory use, thus essentially speeding u
> p
> > > the
> > > > computation. )
> > > >
> > > > If it's ok, i would like to create a JIRA issue and provide a patch
> for
> > > it.
> > > >
> > > > Another issue is to provide an SSVD patch that depends on that patch
> > for
> > > > VectorWritable.
> > > >
> > > > Thank you.
> > > > -Dmitriy
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
In reply to this post by Dmitriy Lyubimov
I really don't see this as a big deal even with crazy big vectors.

Looking at web scale, for instance, the most linked wikipedia article only
has 10 million in-links or so.  On the web, the most massive web site is
unlikely to have >100 million in-links.  Both of these fit in very modest
amounts of memory.

Where's the rub?

On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[hidden email]>wrote:

> Jake,
> No i was trying exactly what you were proposing some time ago on the list.
> I
> am trying to make long vectors not to occupy a lot of memory.
>
> E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> saying, hey, there's a lot of sequential techniques that can provide a
> hander that would inspect vector element-by-element without having to
> preallocate 8Mb.
>
> for 1 million-long vectors it doesn't scary too much but starts being so
> for
> default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> non-zero elements). Stochastic SVD will survive that, but it means less
> memory for blocking, and the more blocks you have, the more CPU it requires
> (although CPU demand is only linear to the number of blocks and only in
> signficantly smaller part of computation, so that only insigificant part of
> total CPU flops depends on # of blocks, but there is part that does, still.
> )
>
> Like i said, it also would address the case when rows don't fit in the
> memory (hence no memory bound for n of A) but the most immediate benefit is
> to speed/ scalability/memory req of SSVD in most practical LSI cases.
>
> -Dmitriy
>
> On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]>
> wrote:
>
> > Hey Dmitriy,
> >
> >  I've also been playing around with a VectorWritable format which is
> backed
> > by a
> > SequenceFile, but I've been focussed on the case where it's essentially
> the
> > entire
> > matrix, and the rows don't fit into memory.  This seems different than
> your
> > current
> > use case, however - you just want (relatively) small vectors to load
> > faster,
> > right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> > wrote:
> >
> > > Interesting idea.
> > >
> > > Would this introduce a new vector type that only allows iterating
> through
> > > the elements once?
> > >
> > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to submit a patch to VectorWritable that allows for
> > > streaming
> > > > access to vector elements without having to prebuffer all of them
> > first.
> > > > (current code allows for the latter only).
> > > >
> > > > That patch would allow to strike down one of the memory usage issues
> in
> > > > current Stochastic SVD implementation and effectively open memory
> bound
> > > for
> > > > n of the SVD work. (The value i see is not to open up the the bound
> > > though
> > > > but just be more efficient in memory use, thus essentially speeding u
> p
> > > the
> > > > computation. )
> > > >
> > > > If it's ok, i would like to create a JIRA issue and provide a patch
> for
> > > it.
> > > >
> > > > Another issue is to provide an SSVD patch that depends on that patch
> > for
> > > > VectorWritable.
> > > >
> > > > Thank you.
> > > > -Dmitriy
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
In reply to this post by Dmitriy Lyubimov
http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize

On Mon, Dec 13, 2010 at 11:06 AM, Dmitriy Lyubimov <[hidden email]>wrote:

> I also think it is not very common, but it would be a collateral benefit
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Ted Dunning
Ted,

like i commented in the issue, 100m dense data elements means ~800M to load
a vector. If (suppose) i have 1G in a mapper (which is not terribly common,
not in our case, we only run jobs with -500M), then require 80% of RAM for
prebuffering (i.e. just being able to get access to the data) and leave 20%
for algorithm? Setting aside other considerations, doesn't this strike you
as being off-balance in requirements ? it does to me and unfortunately  in
my practical application with the hardware constraints i got i won't be able
to afford that luxury. So it is a problem for my task for the moment,
believe it or not. At this moment even 100Mb for prebuffering the data is
prohibitively costly in my app.

-d

On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[hidden email]> wrote:

> I really don't see this as a big deal even with crazy big vectors.
>
> Looking at web scale, for instance, the most linked wikipedia article only
> has 10 million in-links or so.  On the web, the most massive web site is
> unlikely to have >100 million in-links.  Both of these fit in very modest
> amounts of memory.
>
> Where's the rub?
>
> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[hidden email]
> >wrote:
>
> > Jake,
> > No i was trying exactly what you were proposing some time ago on the
> list.
> > I
> > am trying to make long vectors not to occupy a lot of memory.
> >
> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > saying, hey, there's a lot of sequential techniques that can provide a
> > hander that would inspect vector element-by-element without having to
> > preallocate 8Mb.
> >
> > for 1 million-long vectors it doesn't scary too much but starts being so
> > for
> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > non-zero elements). Stochastic SVD will survive that, but it means less
> > memory for blocking, and the more blocks you have, the more CPU it
> requires
> > (although CPU demand is only linear to the number of blocks and only in
> > signficantly smaller part of computation, so that only insigificant part
> of
> > total CPU flops depends on # of blocks, but there is part that does,
> still.
> > )
> >
> > Like i said, it also would address the case when rows don't fit in the
> > memory (hence no memory bound for n of A) but the most immediate benefit
> is
> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> >
> > -Dmitriy
> >
> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]>
> > wrote:
> >
> > > Hey Dmitriy,
> > >
> > >  I've also been playing around with a VectorWritable format which is
> > backed
> > > by a
> > > SequenceFile, but I've been focussed on the case where it's essentially
> > the
> > > entire
> > > matrix, and the rows don't fit into memory.  This seems different than
> > your
> > > current
> > > use case, however - you just want (relatively) small vectors to load
> > > faster,
> > > right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> > > wrote:
> > >
> > > > Interesting idea.
> > > >
> > > > Would this introduce a new vector type that only allows iterating
> > through
> > > > the elements once?
> > > >
> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to submit a patch to VectorWritable that allows for
> > > > streaming
> > > > > access to vector elements without having to prebuffer all of them
> > > first.
> > > > > (current code allows for the latter only).
> > > > >
> > > > > That patch would allow to strike down one of the memory usage
> issues
> > in
> > > > > current Stochastic SVD implementation and effectively open memory
> > bound
> > > > for
> > > > > n of the SVD work. (The value i see is not to open up the the bound
> > > > though
> > > > > but just be more efficient in memory use, thus essentially speeding
> u
> > p
> > > > the
> > > > > computation. )
> > > > >
> > > > > If it's ok, i would like to create a JIRA issue and provide a patch
> > for
> > > > it.
> > > > >
> > > > > Another issue is to provide an SSVD patch that depends on that
> patch
> > > for
> > > > > VectorWritable.
> > > > >
> > > > > Thank you.
> > > > > -Dmitriy
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Ted Dunning
Which is exactly why I am saying I don't see this as  the primary problem i
am trying to solve.

On Mon, Dec 13, 2010 at 11:13 AM, Ted Dunning <[hidden email]> wrote:

> http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize
>
> On Mon, Dec 13, 2010 at 11:06 AM, Dmitriy Lyubimov <[hidden email]
> >wrote:
>
> > I also think it is not very common, but it would be a collateral benefit
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Jake Mannix
In reply to this post by Ted Dunning
Ted, there are some vectors where it certainly matters: the eigenvectors of
a
really big matrix are dense, and thus take O(8*num_vertices) bytes to hold
just
one of them in memory.  Doing something sequential with these can certainly
make sense, and in some cases is actually necessary, esp. if done in the
mappers or reducers where there is less memory than you usually have...

  -jake

On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[hidden email]> wrote:

> I really don't see this as a big deal even with crazy big vectors.
>
> Looking at web scale, for instance, the most linked wikipedia article only
> has 10 million in-links or so.  On the web, the most massive web site is
> unlikely to have >100 million in-links.  Both of these fit in very modest
> amounts of memory.
>
> Where's the rub?
>
> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[hidden email]
> >wrote:
>
> > Jake,
> > No i was trying exactly what you were proposing some time ago on the
> list.
> > I
> > am trying to make long vectors not to occupy a lot of memory.
> >
> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > saying, hey, there's a lot of sequential techniques that can provide a
> > hander that would inspect vector element-by-element without having to
> > preallocate 8Mb.
> >
> > for 1 million-long vectors it doesn't scary too much but starts being so
> > for
> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > non-zero elements). Stochastic SVD will survive that, but it means less
> > memory for blocking, and the more blocks you have, the more CPU it
> requires
> > (although CPU demand is only linear to the number of blocks and only in
> > signficantly smaller part of computation, so that only insigificant part
> of
> > total CPU flops depends on # of blocks, but there is part that does,
> still.
> > )
> >
> > Like i said, it also would address the case when rows don't fit in the
> > memory (hence no memory bound for n of A) but the most immediate benefit
> is
> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> >
> > -Dmitriy
> >
> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]>
> > wrote:
> >
> > > Hey Dmitriy,
> > >
> > >  I've also been playing around with a VectorWritable format which is
> > backed
> > > by a
> > > SequenceFile, but I've been focussed on the case where it's essentially
> > the
> > > entire
> > > matrix, and the rows don't fit into memory.  This seems different than
> > your
> > > current
> > > use case, however - you just want (relatively) small vectors to load
> > > faster,
> > > right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> > > wrote:
> > >
> > > > Interesting idea.
> > > >
> > > > Would this introduce a new vector type that only allows iterating
> > through
> > > > the elements once?
> > > >
> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to submit a patch to VectorWritable that allows for
> > > > streaming
> > > > > access to vector elements without having to prebuffer all of them
> > > first.
> > > > > (current code allows for the latter only).
> > > > >
> > > > > That patch would allow to strike down one of the memory usage
> issues
> > in
> > > > > current Stochastic SVD implementation and effectively open memory
> > bound
> > > > for
> > > > > n of the SVD work. (The value i see is not to open up the the bound
> > > > though
> > > > > but just be more efficient in memory use, thus essentially speeding
> u
> > p
> > > > the
> > > > > computation. )
> > > > >
> > > > > If it's ok, i would like to create a JIRA issue and provide a patch
> > for
> > > > it.
> > > > >
> > > > > Another issue is to provide an SSVD patch that depends on that
> patch
> > > for
> > > > > VectorWritable.
> > > > >
> > > > > Thank you.
> > > > > -Dmitriy
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Dmitriy Lyubimov
PS. Just another detail.

Most of the time actually i don't run into long vectors. But occasiionally,
among millions of them, i may run into a 10-20 even 50 million-long one.
Which means, not only i have to adjust algorithm settings to allow for 400m
for prebuffering (which is all i have), i actually have to allow for the
worst case, i.e. what happens only in 0.1% of the cases.

I can't accept answer "use -Xmx1G" in map processes for reasons that i can't
change.

Absent of this solution, i realistically don't see how i can go without a
push technique in accessing the vectors. You are welcome to suggest
alternative solution to this conundrum, but i see nothing in current code
that would help to fix it, without introducing some (very little amount of )
a new code.



On Mon, Dec 13, 2010 at 11:19 AM, Dmitriy Lyubimov <[hidden email]>wrote:

> Ted,
>
> like i commented in the issue, 100m dense data elements means ~800M to load
> a vector. If (suppose) i have 1G in a mapper (which is not terribly common,
> not in our case, we only run jobs with -500M), then require 80% of RAM for
> prebuffering (i.e. just being able to get access to the data) and leave 20%
> for algorithm? Setting aside other considerations, doesn't this strike you
> as being off-balance in requirements ? it does to me and unfortunately  in
> my practical application with the hardware constraints i got i won't be able
> to afford that luxury. So it is a problem for my task for the moment,
> believe it or not. At this moment even 100Mb for prebuffering the data is
> prohibitively costly in my app.
>
> -d
>
>
> On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[hidden email]>wrote:
>
>> I really don't see this as a big deal even with crazy big vectors.
>>
>> Looking at web scale, for instance, the most linked wikipedia article only
>> has 10 million in-links or so.  On the web, the most massive web site is
>> unlikely to have >100 million in-links.  Both of these fit in very modest
>> amounts of memory.
>>
>> Where's the rub?
>>
>> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[hidden email]
>> >wrote:
>>
>> > Jake,
>> > No i was trying exactly what you were proposing some time ago on the
>> list.
>> > I
>> > am trying to make long vectors not to occupy a lot of memory.
>> >
>> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
>> > saying, hey, there's a lot of sequential techniques that can provide a
>> > hander that would inspect vector element-by-element without having to
>> > preallocate 8Mb.
>> >
>> > for 1 million-long vectors it doesn't scary too much but starts being so
>> > for
>> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
>> > non-zero elements). Stochastic SVD will survive that, but it means less
>> > memory for blocking, and the more blocks you have, the more CPU it
>> requires
>> > (although CPU demand is only linear to the number of blocks and only in
>> > signficantly smaller part of computation, so that only insigificant part
>> of
>> > total CPU flops depends on # of blocks, but there is part that does,
>> still.
>> > )
>> >
>> > Like i said, it also would address the case when rows don't fit in the
>> > memory (hence no memory bound for n of A) but the most immediate benefit
>> is
>> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
>> >
>> > -Dmitriy
>> >
>> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]>
>> > wrote:
>> >
>> > > Hey Dmitriy,
>> > >
>> > >  I've also been playing around with a VectorWritable format which is
>> > backed
>> > > by a
>> > > SequenceFile, but I've been focussed on the case where it's
>> essentially
>> > the
>> > > entire
>> > > matrix, and the rows don't fit into memory.  This seems different than
>> > your
>> > > current
>> > > use case, however - you just want (relatively) small vectors to load
>> > > faster,
>> > > right?
>> > >
>> > >  -jake
>> > >
>> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
>> > > wrote:
>> > >
>> > > > Interesting idea.
>> > > >
>> > > > Would this introduce a new vector type that only allows iterating
>> > through
>> > > > the elements once?
>> > > >
>> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <
>> [hidden email]>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I would like to submit a patch to VectorWritable that allows for
>> > > > streaming
>> > > > > access to vector elements without having to prebuffer all of them
>> > > first.
>> > > > > (current code allows for the latter only).
>> > > > >
>> > > > > That patch would allow to strike down one of the memory usage
>> issues
>> > in
>> > > > > current Stochastic SVD implementation and effectively open memory
>> > bound
>> > > > for
>> > > > > n of the SVD work. (The value i see is not to open up the the
>> bound
>> > > > though
>> > > > > but just be more efficient in memory use, thus essentially
>> speeding u
>> > p
>> > > > the
>> > > > > computation. )
>> > > > >
>> > > > > If it's ok, i would like to create a JIRA issue and provide a
>> patch
>> > for
>> > > > it.
>> > > > >
>> > > > > Another issue is to provide an SSVD patch that depends on that
>> patch
>> > > for
>> > > > > VectorWritable.
>> > > > >
>> > > > > Thank you.
>> > > > > -Dmitriy
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Jake Mannix
Jake,

Yes, that's the problem here. That's also why i am trying to make it a
proposal instead of making a private patch and be done with it, since i saw
your previous post, so it would make sense to do something in a more
coordinated fashion thru OSS format.

Yes, that's the same problem you are having (except in my case it's more
about memory balance between buffering and algorithm which also needs some
of that juice) than a problem of fitting one single vector into all the
memory i have.

-d



On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]> wrote:

> Hey Dmitriy,
>
>  I've also been playing around with a VectorWritable format which is backed
> by a
> SequenceFile, but I've been focussed on the case where it's essentially the
> entire
> matrix, and the rows don't fit into memory.  This seems different than your
> current
> use case, however - you just want (relatively) small vectors to load
> faster,
> right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
In reply to this post by Dmitriy Lyubimov
OK.

Let's assume that this is needed.

I think that an iterable interface on VectorWritable that throws
UnsupportedOperationException or similar if
you try to get the iterator twice is much more transparent than a watcher
structure and much easier for a user
to discover/re-invent.

Another (evil) thought is a parallel class to VectorWritable which is
essentially SequentialAccessVectorWritable that supports reading and
writing.  It seems to me that the Writable isn't real compatible with this
interface in any case.  How will that be resolved?


On Mon, Dec 13, 2010 at 11:36 AM, Dmitriy Lyubimov <[hidden email]>wrote:

> Absent of this solution, i realistically don't see how i can go without a
> push technique in accessing the vectors.
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
In reply to this post by Jake Mannix
OK.  Let's take that as read, then.

On Mon, Dec 13, 2010 at 11:27 AM, Jake Mannix <[hidden email]> wrote:

> Ted, there are some vectors where it certainly matters: the eigenvectors of
> a
> really big matrix are dense, and thus take O(8*num_vertices) bytes to hold
> just
> one of them in memory.  Doing something sequential with these can certainly
> make sense, and in some cases is actually necessary, esp. if done in the
> mappers or reducers where there is less memory than you usually have...
>
>  -jake
>
> On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[hidden email]>
> wrote:
>
> > I really don't see this as a big deal even with crazy big vectors.
> >
> > Looking at web scale, for instance, the most linked wikipedia article
> only
> > has 10 million in-links or so.  On the web, the most massive web site is
> > unlikely to have >100 million in-links.  Both of these fit in very modest
> > amounts of memory.
> >
> > Where's the rub?
> >
> > On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[hidden email]
> > >wrote:
> >
> > > Jake,
> > > No i was trying exactly what you were proposing some time ago on the
> > list.
> > > I
> > > am trying to make long vectors not to occupy a lot of memory.
> > >
> > > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > > saying, hey, there's a lot of sequential techniques that can provide a
> > > hander that would inspect vector element-by-element without having to
> > > preallocate 8Mb.
> > >
> > > for 1 million-long vectors it doesn't scary too much but starts being
> so
> > > for
> > > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > > non-zero elements). Stochastic SVD will survive that, but it means less
> > > memory for blocking, and the more blocks you have, the more CPU it
> > requires
> > > (although CPU demand is only linear to the number of blocks and only in
> > > signficantly smaller part of computation, so that only insigificant
> part
> > of
> > > total CPU flops depends on # of blocks, but there is part that does,
> > still.
> > > )
> > >
> > > Like i said, it also would address the case when rows don't fit in the
> > > memory (hence no memory bound for n of A) but the most immediate
> benefit
> > is
> > > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> > >
> > > -Dmitriy
> > >
> > > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[hidden email]>
> > > wrote:
> > >
> > > > Hey Dmitriy,
> > > >
> > > >  I've also been playing around with a VectorWritable format which is
> > > backed
> > > > by a
> > > > SequenceFile, but I've been focussed on the case where it's
> essentially
> > > the
> > > > entire
> > > > matrix, and the rows don't fit into memory.  This seems different
> than
> > > your
> > > > current
> > > > use case, however - you just want (relatively) small vectors to load
> > > > faster,
> > > > right?
> > > >
> > > >  -jake
> > > >
> > > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Interesting idea.
> > > > >
> > > > > Would this introduce a new vector type that only allows iterating
> > > through
> > > > > the elements once?
> > > > >
> > > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to submit a patch to VectorWritable that allows for
> > > > > streaming
> > > > > > access to vector elements without having to prebuffer all of them
> > > > first.
> > > > > > (current code allows for the latter only).
> > > > > >
> > > > > > That patch would allow to strike down one of the memory usage
> > issues
> > > in
> > > > > > current Stochastic SVD implementation and effectively open memory
> > > bound
> > > > > for
> > > > > > n of the SVD work. (The value i see is not to open up the the
> bound
> > > > > though
> > > > > > but just be more efficient in memory use, thus essentially
> speeding
> > u
> > > p
> > > > > the
> > > > > > computation. )
> > > > > >
> > > > > > If it's ok, i would like to create a JIRA issue and provide a
> patch
> > > for
> > > > > it.
> > > > > >
> > > > > > Another issue is to provide an SSVD patch that depends on that
> > patch
> > > > for
> > > > > > VectorWritable.
> > > > > >
> > > > > > Thank you.
> > > > > > -Dmitriy
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov
In reply to this post by Ted Dunning
Jake,

Are you saying vector element=sequence file record?

This is also a common problem and there are huge benefits to be had in terms
of I/O in stochastic SVD implementation (i had reviewed that in issues
section there). I was thinking in that direction too, but what i thought was
that this problem might be better served by block-oriented record rather
than element-oriented record format. I think block-wise formats are quite
common in scientific libraries like ScaLaPack.

But it's quite clear that some algorithms would be averse to pre-blocking,
but not to row splicing (i.e. just split a long row into subfragments).
That's also an interesting initiative although it's not going to work out of
the door with the stochastic SVD code. The fundamental issue with the
existing approach is that it has to create MR splits on vector bounaries,
not block boundaries (although it 100% block-wise algorithm seems quite
plausible and promising there too).


On Mon, Dec 13, 2010 at 1:06 PM, Jake Mannix <[hidden email]> wrote:

> I'm not sure that Dmitriy's use-case has an easy solution.  As you
> say, Writable loads into memory the whole thing, independently of
> whether you try / not try to do buffering on iteration.
>
> My situation (monstrous vectors) is easier, in some respects: if
> the matrices are essentially
> SequenceFile<IntWritable,Pair<IntWritable,DoubleWritable>>, then
> there are a lot bigger vectors which can be handled in MR jobs, but
> they no longer really look like "vectors" in the interface sense.
>
>  -jake
>
> On Mon, Dec 13, 2010 at 12:52 PM, Ted Dunning <[hidden email]>
> wrote:
>
> > OK.
> >
> > Let's assume that this is needed.
> >
> > I think that an iterable interface on VectorWritable that throws
> > UnsupportedOperationException or similar if
> > you try to get the iterator twice is much more transparent than a watcher
> > structure and much easier for a user
> > to discover/re-invent.
> >
> > Another (evil) thought is a parallel class to VectorWritable which is
> > essentially SequentialAccessVectorWritable that supports reading and
> > writing.  It seems to me that the Writable isn't real compatible with
> this
> > interface in any case.  How will that be resolved?
> >
> >
> > On Mon, Dec 13, 2010 at 11:36 AM, Dmitriy Lyubimov <[hidden email]
> > >wrote:
> >
> > > Absent of this solution, i realistically don't see how i can go without
> a
> > > push technique in accessing the vectors.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Sequential access to VectorWritable content proposal.

Ted Dunning
In reply to this post by Ted Dunning
I don't thikn that sequentiality part of the contract.
 RandomAccessSparseVectors are likely to
produce disordered values when serialized, I think.

On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <[hidden email]> wrote:

> I will have to look at details of VectorWritable to make sure all cases are
> covered (I only took a very brief look so far). But as long as it is able
> to
> produce elements in order of index increase, push technique will certainly
> work for most algorithms (and in some cases, notably with SSVD, even if it
> produces the data in non-sequential way, it would work too ) .
>
12