Fast way to read thousands of double value in hadoop jobs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fast way to read thousands of double value in hadoop jobs

Madhav Sharan
Hi , can someone please recommend a fast way in hadoop to store and retrieve matrix of double values?

As of now we store values in text files and the read it in java using HDFS inputstream and Scanner. [0] These files are actually vectors representing a video file. Each vector is 883 X 200 and for one map job we read 4 such vectors so job is to convert 706,400 values to double

Using this approach we take ~ 1.5 second to convert all these values. I can use a external cache server to avoid repeated conversion but I am looking for a better solution.

Reply | Threaded
Open this post in threaded view
|

Re: Fast way to read thousands of double value in hadoop jobs

Daniel Haviv
Store them within a sequencefile

On Thursday, 18 August 2016, Madhav Sharan <[hidden email]> wrote:
Hi , can someone please recommend a fast way in hadoop to store and retrieve matrix of double values?

As of now we store values in text files and the read it in java using HDFS inputstream and Scanner. [0] These files are actually vectors representing a video file. Each vector is 883 X 200 and for one map job we read 4 such vectors so job is to convert 706,400 values to double

Using this approach we take ~ 1.5 second to convert all these values. I can use a external cache server to avoid repeated conversion but I am looking for a better solution.

Reply | Threaded
Open this post in threaded view
|

Re: Fast way to read thousands of double value in hadoop jobs

Madhav Sharan
Thanks for your suggestion Daniel. I was already using SequenceFile but my format was poor. I was storing file contents as Text in my SeqFile,

So all my map jobs did repeated conversion from Text to double. I resolved this by correcting SequenceFile format. Now I store serialised java object in SeqFile and my map jobs are faster.

--
Madhav Sharan


On Wed, Aug 17, 2016 at 11:07 PM, Daniel Haviv <[hidden email]> wrote:
Store them within a sequencefile


On Thursday, 18 August 2016, Madhav Sharan <[hidden email]> wrote:
Hi , can someone please recommend a fast way in hadoop to store and retrieve matrix of double values?

As of now we store values in text files and the read it in java using HDFS inputstream and Scanner. [0] These files are actually vectors representing a video file. Each vector is 883 X 200 and for one map job we read 4 such vectors so job is to convert 706,400 values to double

Using this approach we take ~ 1.5 second to convert all these values. I can use a external cache server to avoid repeated conversion but I am looking for a better solution.


Reply | Threaded
Open this post in threaded view
|

Re: Fast way to read thousands of double value in hadoop jobs

Daniel Haviv
That was the idea :)
Thanks for the update

On Friday, 19 August 2016, Madhav Sharan <[hidden email]> wrote:
Thanks for your suggestion Daniel. I was already using SequenceFile but my format was poor. I was storing file contents as Text in my SeqFile,

So all my map jobs did repeated conversion from Text to double. I resolved this by correcting SequenceFile format. Now I store serialised java object in SeqFile and my map jobs are faster.

--
Madhav Sharan


On Wed, Aug 17, 2016 at 11:07 PM, Daniel Haviv <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;danielrulez@gmail.com&#39;);" target="_blank">danielrulez@...> wrote:
Store them within a sequencefile


On Thursday, 18 August 2016, Madhav Sharan <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;msharan@usc.edu&#39;);" target="_blank">msharan@...> wrote:
Hi , can someone please recommend a fast way in hadoop to store and retrieve matrix of double values?

As of now we store values in text files and the read it in java using HDFS inputstream and Scanner. [0] These files are actually vectors representing a video file. Each vector is 883 X 200 and for one map job we read 4 such vectors so job is to convert 706,400 values to double

Using this approach we take ~ 1.5 second to convert all these values. I can use a external cache server to avoid repeated conversion but I am looking for a better solution.