HDFS HA Namenodes crash all the time

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

HDFS HA Namenodes crash all the time

Marcin Tustin
Hi All,

We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557 (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to make this more stable. 

Before we went to HA our namenode was reasonably stable. Now, the namenodes are crashing multiple times a day, and frequently failing to fail over correctly; to the point where I can't even use haadmin -transitionToActive to force a failover. I find that instead I have to restart the namenodes.

We're running them on AWS instances with 31.01GB and 8 cores. In addition to the namenode, we host a journalnode, a zkfailovercontroller, and the ambari metrics collector on the same machine. (The third journalnode lives with the yarn resource manager).

Right now the namenodes are configured with a maximum heap of 25 GB.

Does that sound credible? What else should we be paying attention to to make HDFS stable again?

With thanks,
Marcin


Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity

Reply | Threaded
Open this post in threaded view
|

Re: HDFS HA Namenodes crash all the time

Sandeep Nemuri
What does the logs say ?

On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <[hidden email]> wrote:
Hi All,

We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557 (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to make this more stable. 

Before we went to HA our namenode was reasonably stable. Now, the namenodes are crashing multiple times a day, and frequently failing to fail over correctly; to the point where I can't even use haadmin -transitionToActive to force a failover. I find that instead I have to restart the namenodes.

We're running them on AWS instances with 31.01GB and 8 cores. In addition to the namenode, we host a journalnode, a zkfailovercontroller, and the ambari metrics collector on the same machine. (The third journalnode lives with the yarn resource manager).

Right now the namenodes are configured with a maximum heap of 25 GB.

Does that sound credible? What else should we be paying attention to to make HDFS stable again?

With thanks,
Marcin


Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity




--
  Regards
  Sandeep Nemuri
Reply | Threaded
Open this post in threaded view
|

Re: HDFS HA Namenodes crash all the time

Nikhil-2
check zkfc logs first ; try checking the HDFS ha and zookeeper timeouts, 
its better to have a dedicated disk for journal node service (similar to zookeeper)

On Sat, Dec 19, 2015 at 9:29 AM, Sandeep Nemuri <[hidden email]> wrote:
What does the logs say ?

On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <[hidden email]> wrote:
Hi All,

We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557 (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to make this more stable. 

Before we went to HA our namenode was reasonably stable. Now, the namenodes are crashing multiple times a day, and frequently failing to fail over correctly; to the point where I can't even use haadmin -transitionToActive to force a failover. I find that instead I have to restart the namenodes.

We're running them on AWS instances with 31.01GB and 8 cores. In addition to the namenode, we host a journalnode, a zkfailovercontroller, and the ambari metrics collector on the same machine. (The third journalnode lives with the yarn resource manager).

Right now the namenodes are configured with a maximum heap of 25 GB.

Does that sound credible? What else should we be paying attention to to make HDFS stable again?

With thanks,
Marcin


Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity




--
  Regards
  Sandeep Nemuri