How to restart an HDFS standby namenode dead for a very long time

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to restart an HDFS standby namenode dead for a very long time

Zach Cox
Hi - we have an HDFS (version 2.0.0-cdh4.4.0) cluster setup in HA with 2 namenodes and 5 journal nodes. This cluster has been somewhat neglected (long story) and the standby namenode process has been dead for several months. 

Recently we tried to just start the standby namenode process again, but several hours later the entire HDFS cluster (and HBase on top of it) was unavailable for several hours. As soon as we stopped the standby namenode process, HDFS (and HBase) started working fine again. I don't know for sure, but I'm guessing the standby namenode was trying to catch up on several months of edits from being down for so long, and just couldn't do it.

We really need to get this standby namenode process started again, so I'm trying to find the right way to do it. I've tried starting it with the -bootstrapStandby option, but that appears broken in our HDFS version. Instead, we can manually rsync the files in the dfs.name.dir from the active namenode.

I guess my question is: is there a recommended way to get this standby namenode resurrected successfully? And would we need to do anything other than rsync dfs.name.dir from the active namenode before starting the standby namenode again?

Thanks,
Zach

Reply | Threaded
Open this post in threaded view
|

RE: How to restart an HDFS standby namenode dead for a very long time

Brahma Reddy

Seems to be you are hitting following jira.. Please refer

 

https://issues.apache.org/jira/browse/HDFS-9917

 

 

 

 

--Brahma Reddy Battula

 

From: Zach Cox [mailto:[hidden email]]
Sent: 14 July 2016 03:34
To: [hidden email]
Subject: How to restart an HDFS standby namenode dead for a very long time

 

Hi - we have an HDFS (version 2.0.0-cdh4.4.0) cluster setup in HA with 2 namenodes and 5 journal nodes. This cluster has been somewhat neglected (long story) and the standby namenode process has been dead for several months. 

 

Recently we tried to just start the standby namenode process again, but several hours later the entire HDFS cluster (and HBase on top of it) was unavailable for several hours. As soon as we stopped the standby namenode process, HDFS (and HBase) started working fine again. I don't know for sure, but I'm guessing the standby namenode was trying to catch up on several months of edits from being down for so long, and just couldn't do it.

 

We really need to get this standby namenode process started again, so I'm trying to find the right way to do it. I've tried starting it with the -bootstrapStandby option, but that appears broken in our HDFS version. Instead, we can manually rsync the files in the dfs.name.dir from the active namenode.

 

I guess my question is: is there a recommended way to get this standby namenode resurrected successfully? And would we need to do anything other than rsync dfs.name.dir from the active namenode before starting the standby namenode again?

 

Thanks,

Zach

 

Reply | Threaded
Open this post in threaded view
|

Re: How to restart an HDFS standby namenode dead for a very long time

Zach Cox
Yes it's definitely possible we are hitting that jira. Do we need to do anything other than rsync dfs.name.dir from the active namenode before starting the standby namenode again?

Thanks,
Zach


On Fri, Jul 15, 2016 at 2:21 AM Brahma Reddy Battula <[hidden email]> wrote:

Seems to be you are hitting following jira.. Please refer

 

https://issues.apache.org/jira/browse/HDFS-9917

 

 

 

 

--Brahma Reddy Battula

 

From: Zach Cox [mailto:[hidden email]]
Sent: 14 July 2016 03:34
To: [hidden email]
Subject: How to restart an HDFS standby namenode dead for a very long time

 

Hi - we have an HDFS (version 2.0.0-cdh4.4.0) cluster setup in HA with 2 namenodes and 5 journal nodes. This cluster has been somewhat neglected (long story) and the standby namenode process has been dead for several months. 

 

Recently we tried to just start the standby namenode process again, but several hours later the entire HDFS cluster (and HBase on top of it) was unavailable for several hours. As soon as we stopped the standby namenode process, HDFS (and HBase) started working fine again. I don't know for sure, but I'm guessing the standby namenode was trying to catch up on several months of edits from being down for so long, and just couldn't do it.

 

We really need to get this standby namenode process started again, so I'm trying to find the right way to do it. I've tried starting it with the -bootstrapStandby option, but that appears broken in our HDFS version. Instead, we can manually rsync the files in the dfs.name.dir from the active namenode.

 

I guess my question is: is there a recommended way to get this standby namenode resurrected successfully? And would we need to do anything other than rsync dfs.name.dir from the active namenode before starting the standby namenode again?

 

Thanks,

Zach

 

Reply | Threaded
Open this post in threaded view
|

RE: How to restart an HDFS standby namenode dead for a very long time

Brahma Reddy

Sorry for late reply..

 

 

To recover this, you can restart the DN’s one by one. (OR) apply the patch in HDFS-9917 and then restart the Standby Namenode.

 

 

--Brahma Reddy Battula

 

From: Zach Cox [mailto:[hidden email]]
Sent: 15 July 2016 19:59
To: Brahma Reddy Battula; [hidden email]
Subject: Re: How to restart an HDFS standby namenode dead for a very long time

 

Yes it's definitely possible we are hitting that jira. Do we need to do anything other than rsync dfs.name.dir from the active namenode before starting the standby namenode again?

 

Thanks,

Zach

 

 

On Fri, Jul 15, 2016 at 2:21 AM Brahma Reddy Battula <[hidden email]> wrote:

Seems to be you are hitting following jira.. Please refer

 

https://issues.apache.org/jira/browse/HDFS-9917

 

 

 

 

--Brahma Reddy Battula

 

From: Zach Cox [mailto:[hidden email]]
Sent: 14 July 2016 03:34
To: [hidden email]
Subject: How to restart an HDFS standby namenode dead for a very long time

 

Hi - we have an HDFS (version 2.0.0-cdh4.4.0) cluster setup in HA with 2 namenodes and 5 journal nodes. This cluster has been somewhat neglected (long story) and the standby namenode process has been dead for several months. 

 

Recently we tried to just start the standby namenode process again, but several hours later the entire HDFS cluster (and HBase on top of it) was unavailable for several hours. As soon as we stopped the standby namenode process, HDFS (and HBase) started working fine again. I don't know for sure, but I'm guessing the standby namenode was trying to catch up on several months of edits from being down for so long, and just couldn't do it.

 

We really need to get this standby namenode process started again, so I'm trying to find the right way to do it. I've tried starting it with the -bootstrapStandby option, but that appears broken in our HDFS version. Instead, we can manually rsync the files in the dfs.name.dir from the active namenode.

 

I guess my question is: is there a recommended way to get this standby namenode resurrected successfully? And would we need to do anything other than rsync dfs.name.dir from the active namenode before starting the standby namenode again?

 

Thanks,

Zach