how to copy data between two hdfs cluster fastly?

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

how to copy data between two hdfs cluster fastly?

ch huang
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks
Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

Azuryy Yu
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks

Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

ch huang
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks


Reply | Threaded
Open this post in threaded view
|

Spark vs Tez

Adaryl "Bob" Wakefield, MBA
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.
Reply | Threaded
Open this post in threaded view
|

Re: Spark vs Tez

Alexander Pivovarov
Spark creator Amplab did some benchmarks.

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.

Reply | Threaded
Open this post in threaded view
|

Re: Spark vs Tez

Shahab Yunus
In reply to this post by Adaryl "Bob" Wakefield, MBA
What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand.

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.

Reply | Threaded
Open this post in threaded view
|

Re: Spark vs Tez

kartik saxena
In reply to this post by Adaryl "Bob" Wakefield, MBA
I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website. 

The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.

Reply | Threaded
Open this post in threaded view
|

Re: Spark vs Tez

Adaryl "Bob" Wakefield, MBA
In reply to this post by Shahab Yunus
It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct?
B.
 
 
 
Sent: Friday, October 17, 2014 1:12 PM
Subject: Re: Spark vs Tez
 
What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand.
 
Regards,
Shahab
 
On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.
 
Reply | Threaded
Open this post in threaded view
|

Dynamically set map / reducer memory

peterm_second
In reply to this post by Shahab Yunus
HI Guys,
I am trying to run a few MR jobs in a succession, some of the jobs don't need that much memory and others do. I want to be able to tell hadoop how much memory should be allocated  for the mappers of each job.
I know how to increase the memory for a mapper JVM, through the mapred xml.
I tried manually setting the  mapreduce.reduce.java.opts = -Xmx<someNumber>m , but wasn't picked up by the mapper jvm, the global setting was always been picked up .

In summation
Job 1 - Mappers need only 250 Mg of Ram
Job2 - Mapper
           Reducer need around - 2Gb

I don't want to be able to set those restrictions prior to submitting the job to my hadoop cluster.
Reply | Threaded
Open this post in threaded view
|

Re: Spark vs Tez

Alexander Pivovarov
In reply to this post by Adaryl "Bob" Wakefield, MBA
It's going to be spark engine for hive (in addition to mr and tez).

Spark API is available for Java and Python as well.

Tez engine is available now and it's quite stable. As for speed.  For complex queries it shows 10x-20x improvement in comparison to mr engine.
e.g. one of my queries runs 30 min using mr (about 100 mr jobs),   if I switch to tez it done in 100 sec.

I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1)

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct?
B.
 
 
 
Sent: Friday, October 17, 2014 1:12 PM
Subject: Re: Spark vs Tez
 
What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand.
 
Regards,
Shahab
 
On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.
 

Reply | Threaded
Open this post in threaded view
|

Re: Spark vs Tez

Adaryl "Bob" Wakefield, MBA
In reply to this post by kartik saxena
“The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.”
 
This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.
 
B.
Sent: Friday, October 17, 2014 1:12 PM
Subject: Re: Spark vs Tez
 
I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website. 
 
The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.
 
Thanks
 
On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.
 
Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

Shivram Mani
In reply to this post by ch huang
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <[hidden email]> wrote:
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks





--
Thanks
Shivram
Reply | Threaded
Open this post in threaded view
|

Re: Spark vs Tez

Gavin Yue
In reply to this post by Adaryl "Bob" Wakefield, MBA
Spark and tez both make MR faster, this has no doubt.

They also provide new features like DAG, which is quite important for interactive query processing.  From this perspective, you could view them as a wrapper around MR and try to handle the intermediary buffer(files) more efficiently.  It is a big pain in MR.

Also they both try to use Memory as the buffer instead of only filesystems.   Spark has a concept RDD, which is quite interesting and also limited.



On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct?
B.
 
 
 
Sent: Friday, October 17, 2014 1:12 PM
Subject: Re: Spark vs Tez
 
What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand.
 
Regards,
Shahab
 
On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <[hidden email]> wrote:
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.
 

Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

Alexander Pivovarov
In reply to this post by Shivram Mani
try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <[hidden email]> wrote:
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <[hidden email]> wrote:
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks





--
Thanks
Shivram

Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

Jakub Stransky

Distcp?

On 17 Oct 2014 20:51, "Alexander Pivovarov" <[hidden email]> wrote:
try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <[hidden email]> wrote:
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <[hidden email]> wrote:
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks





--
Thanks
Shivram

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically set map / reducer memory

Girish Lingappa
In reply to this post by peterm_second
Peter

If you are using oozie to launch the MR jobs you can specify the memory requirements in the workflow action specific to each job, in the workflow xml you are using to launch the job. If you are writing your own driver program to launch the jobs you can still set these parameters in the job configuration you are using to launch the job.
 In the case where you modified mapred-site.xml to set your memory requirements did you change that on the client machine where you are launching the job?
 Please share more details on the setup and the way you are launching the jobs so we can better understand the problem you are facing

Girish

On Fri, Oct 17, 2014 at 11:24 AM, peter 2 <[hidden email]> wrote:
HI Guys,
I am trying to run a few MR jobs in a succession, some of the jobs don't need that much memory and others do. I want to be able to tell hadoop how much memory should be allocated  for the mappers of each job.
I know how to increase the memory for a mapper JVM, through the mapred xml.
I tried manually setting the  mapreduce.reduce.java.opts = -Xmx<someNumber>m , but wasn't picked up by the mapper jvm, the global setting was always been picked up .

In summation
Job 1 - Mappers need only 250 Mg of Ram
Job2 - Mapper
           Reducer need around - 2Gb

I don't want to be able to set those restrictions prior to submitting the job to my hadoop cluster.

Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

ch huang
In reply to this post by Shivram Mani
some file , total size  is 2T ,and block size  is 128M

On Sat, Oct 18, 2014 at 2:26 AM, Shivram Mani <[hidden email]> wrote:
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <[hidden email]> wrote:
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks





--
Thanks
Shivram

Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

ch huang
In reply to this post by Jakub Stransky
yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <[hidden email]> wrote:

Distcp?

On 17 Oct 2014 20:51, "Alexander Pivovarov" <[hidden email]> wrote:
try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <[hidden email]> wrote:
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <[hidden email]> wrote:
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks





--
Thanks
Shivram


Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

Shivram Mani

Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you were doing was to copy a large file, only one map task is effectively used


On Fri, Oct 17, 2014 at 8:18 PM, ch huang <[hidden email]> wrote:
yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <[hidden email]> wrote:

Distcp?

On 17 Oct 2014 20:51, "Alexander Pivovarov" <[hidden email]> wrote:
try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <[hidden email]> wrote:
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <[hidden email]> wrote:
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks





--
Thanks
Shivram





--
Thanks
Shivram
Reply | Threaded
Open this post in threaded view
|

Re: how to copy data between two hdfs cluster fastly?

Shivram Mani

If you still do want to use distcp

1. Break the file into smaller files (only if you have the luxury of doing this

2. Use the "-m” option to set the number of mappers.

(Each map task will aim at copying (total bytes across all file)  / numSplits. Uses the UniformSizeInputFormat by default

3. distcp by default uses a throttled input stream which by default is set to 100MB. You can tune this based on your network bandwidth using the -"bandwidth" option


On Fri, Oct 17, 2014 at 10:24 PM, Shivram Mani <[hidden email]> wrote:

Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you were doing was to copy a large file, only one map task is effectively used


On Fri, Oct 17, 2014 at 8:18 PM, ch huang <[hidden email]> wrote:
yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <[hidden email]> wrote:

Distcp?

On 17 Oct 2014 20:51, "Alexander Pivovarov" <[hidden email]> wrote:
try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <[hidden email]> wrote:
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <[hidden email]> wrote:
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <[hidden email]> wrote:
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang <[hidden email]> wrote:
hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks





--
Thanks
Shivram





--
Thanks
Shivram



--
Thanks
Shivram
12