How to distcp data between two clusters which are not in the same local network?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to distcp data between two clusters which are not in the same local network?

Shady Xu
Hi all,

Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks always used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I want, and I tried changing the property 'dfs.datanode.use.datanode.hostname' to true too. Nothing works.

Does hadoop now support this or do I miss something?
Reply | Threaded
Open this post in threaded view
|

Re: How to distcp data between two clusters which are not in the same local network?

Wei-Chiu Chuang-2
Hello,
if I understand your question correctly, you are actually building a multi-home Hadoop, correct?
Multi-homed Hadoop cluster can be tricky to set up, to the extend that Cloudera does not recommend it. I've not set up a multihome Hadoop cluster before, but I think you have to make sure the reverse resolution works for the IP addresses.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html


On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu <[hidden email]> wrote:
Hi all,

Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks always used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I want, and I tried changing the property 'dfs.datanode.use.datanode.hostname' to true too. Nothing works.

Does hadoop now support this or do I miss something?

Reply | Threaded
Open this post in threaded view
|

Re: How to distcp data between two clusters which are not in the same local network?

Sunil Govind
Hi

I think you can also refer below link too.

Thanks 
Sunil

On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang <[hidden email]> wrote:
Hello,
if I understand your question correctly, you are actually building a multi-home Hadoop, correct?
Multi-homed Hadoop cluster can be tricky to set up, to the extend that Cloudera does not recommend it. I've not set up a multihome Hadoop cluster before, but I think you have to make sure the reverse resolution works for the IP addresses.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html


On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu <[hidden email]> wrote:
Hi all,

Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks always used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I want, and I tried changing the property 'dfs.datanode.use.datanode.hostname' to true too. Nothing works.

Does hadoop now support this or do I miss something?

Reply | Threaded
Open this post in threaded view
|

Re: How to distcp data between two clusters which are not in the same local network?

Shady Xu
Thanks Wei-Chiu and Sunil, I have read the docs you mentioned before starting. The specific problem now is that the DataNodes of the source cluster report their local ip instead of the public one, which cannot be accessed from the NodeManagers of the destination cluster. Seems the solution is to set the `dfs.datanode.dns.interface` property but unfortunately it doesn't work.

2016-08-15 22:06 GMT+08:00 Sunil Govind <[hidden email]>:
Hi

I think you can also refer below link too.

Thanks 
Sunil

On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang <[hidden email]> wrote:
Hello,
if I understand your question correctly, you are actually building a multi-home Hadoop, correct?
Multi-homed Hadoop cluster can be tricky to set up, to the extend that Cloudera does not recommend it. I've not set up a multihome Hadoop cluster before, but I think you have to make sure the reverse resolution works for the IP addresses.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html


On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu <[hidden email]> wrote:
Hi all,

Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks always used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I want, and I tried changing the property 'dfs.datanode.use.datanode.hostname' to true too. Nothing works.

Does hadoop now support this or do I miss something?


Reply | Threaded
Open this post in threaded view
|

Re: How to distcp data between two clusters which are not in the same local network?

Shady Xu
Anyone any idea?

2016-08-16 10:27 GMT+08:00 Shady Xu <[hidden email]>:
Thanks Wei-Chiu and Sunil, I have read the docs you mentioned before starting. The specific problem now is that the DataNodes of the source cluster report their local ip instead of the public one, which cannot be accessed from the NodeManagers of the destination cluster. Seems the solution is to set the `dfs.datanode.dns.interface` property but unfortunately it doesn't work.

2016-08-15 22:06 GMT+08:00 Sunil Govind <[hidden email]>:
Hi

I think you can also refer below link too.

Thanks 
Sunil

On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang <[hidden email]> wrote:
Hello,
if I understand your question correctly, you are actually building a multi-home Hadoop, correct?
Multi-homed Hadoop cluster can be tricky to set up, to the extend that Cloudera does not recommend it. I've not set up a multihome Hadoop cluster before, but I think you have to make sure the reverse resolution works for the IP addresses.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html


On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu <[hidden email]> wrote:
Hi all,

Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks always used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I want, and I tried changing the property 'dfs.datanode.use.datanode.hostname' to true too. Nothing works.

Does hadoop now support this or do I miss something?



Reply | Threaded
Open this post in threaded view
|

Re: How to distcp data between two clusters which are not in the same local network?

iain wright

--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Aug 24, 2016 at 2:17 AM, Shady Xu <[hidden email]> wrote:
Anyone any idea?

2016-08-16 10:27 GMT+08:00 Shady Xu <[hidden email]>:
Thanks Wei-Chiu and Sunil, I have read the docs you mentioned before starting. The specific problem now is that the DataNodes of the source cluster report their local ip instead of the public one, which cannot be accessed from the NodeManagers of the destination cluster. Seems the solution is to set the `dfs.datanode.dns.interface` property but unfortunately it doesn't work.

2016-08-15 22:06 GMT+08:00 Sunil Govind <[hidden email]>:
Hi

I think you can also refer below link too.

Thanks 
Sunil

On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang <[hidden email]> wrote:
Hello,
if I understand your question correctly, you are actually building a multi-home Hadoop, correct?
Multi-homed Hadoop cluster can be tricky to set up, to the extend that Cloudera does not recommend it. I've not set up a multihome Hadoop cluster before, but I think you have to make sure the reverse resolution works for the IP addresses.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html


On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu <[hidden email]> wrote:
Hi all,

Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks always used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I want, and I tried changing the property 'dfs.datanode.use.datanode.hostname' to true too. Nothing works.

Does hadoop now support this or do I miss something?




Reply | Threaded
Open this post in threaded view
|

Re: How to distcp data between two clusters which are not in the same local network?

Shady Xu
Thanks iain, it works now. I read the doc you mentioned, but forgot to set the `dfs.client.use.datanode.hostname` property in the destination cluster.

Though I still don't know why the `dfs.datanode.dns.interface` property does not work. I read though the related source code but don't find anything wrong.

2016-08-25 1:48 GMT+08:00 iain wright <[hidden email]>:

--
Iain Wright

This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message.

On Wed, Aug 24, 2016 at 2:17 AM, Shady Xu <[hidden email]> wrote:
Anyone any idea?

2016-08-16 10:27 GMT+08:00 Shady Xu <[hidden email]>:
Thanks Wei-Chiu and Sunil, I have read the docs you mentioned before starting. The specific problem now is that the DataNodes of the source cluster report their local ip instead of the public one, which cannot be accessed from the NodeManagers of the destination cluster. Seems the solution is to set the `dfs.datanode.dns.interface` property but unfortunately it doesn't work.

2016-08-15 22:06 GMT+08:00 Sunil Govind <[hidden email]>:
Hi

I think you can also refer below link too.

Thanks 
Sunil

On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang <[hidden email]> wrote:
Hello,
if I understand your question correctly, you are actually building a multi-home Hadoop, correct?
Multi-homed Hadoop cluster can be tricky to set up, to the extend that Cloudera does not recommend it. I've not set up a multihome Hadoop cluster before, but I think you have to make sure the reverse resolution works for the IP addresses.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html


On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu <[hidden email]> wrote:
Hi all,

Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks always used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I want, and I tried changing the property 'dfs.datanode.use.datanode.hostname' to true too. Nothing works.

Does hadoop now support this or do I miss something?