Hadoop archives (.har) are really really slow

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Hadoop archives (.har) are really really slow

Aaron Turner
Basically I want to list all the files in a .har file and compare the
file list/sizes to an existing directory in HDFS.  The problem is that
running commands like: hdfs dfs -ls -R <path to har file> is orders of
magnitude slower then running the same command against a live HDFS
file system.

How much slower?  I've calculated it will take ~19 days to list all
the files in 250TB worth of content spread between 2 .har files.

Is this normal?  Can I do this faster (write a map/reduce job/etc?)

--
Aaron Turner
https://synfin.net/         Twitter: @synfinatic
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop archives (.har) are really really slow

Aaron Turner
I can list all the files out of HDFS in a few hours, not a day. Listing the files in a single directory in the har takes ~50 min.  Honestly I'd be happy with only a 10x performance hit. I'm seeing closer to 100-150x. 

-Aaron


On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze <[hidden email]> wrote:

ls over files in har:// maybe 10 times slow than ls over regular files.  It does not sound normal unless it would take ~1 day to list out all the 250TB files when they are stored as regular files.
Tsz-Wo


On Monday, August 15, 2016 10:01 AM, Aaron Turner <[hidden email]> wrote:


Basically I want to list all the files in a .har file and compare the
file list/sizes to an existing directory in HDFS.  The problem is that
running commands like: hdfs dfs -ls -R <path to har file> is orders of
magnitude slower then running the same command against a live HDFS
file system.

How much slower?  I've calculated it will take ~19 days to list all
the files in 250TB worth of content spread between 2 .har files.

Is this normal?  Can I do this faster (write a map/reduce job/etc?)

--
Aaron Turner
https://synfin.net/         Twitter: @synfinatic
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Hadoop archives (.har) are really really slow

Aaron Turner
In reply to this post by Aaron Turner
Oh I should mention that creating the archive took only a few hours, but copying the files out of the archive back to HDFS was 80MB/min. Would take years to copy back which seems really surprising. 

-Aaron


On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze <[hidden email]> wrote:

ls over files in har:// maybe 10 times slow than ls over regular files.  It does not sound normal unless it would take ~1 day to list out all the 250TB files when they are stored as regular files.
Tsz-Wo


On Monday, August 15, 2016 10:01 AM, Aaron Turner <[hidden email]> wrote:


Basically I want to list all the files in a .har file and compare the
file list/sizes to an existing directory in HDFS.  The problem is that
running commands like: hdfs dfs -ls -R <path to har file> is orders of
magnitude slower then running the same command against a live HDFS
file system.

How much slower?  I've calculated it will take ~19 days to list all
the files in 250TB worth of content spread between 2 .har files.

Is this normal?  Can I do this faster (write a map/reduce job/etc?)

--
Aaron Turner
https://synfin.net/         Twitter: @synfinatic
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]