Standby Namenode getting RPC latency alerts

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Standby Namenode getting RPC latency alerts

sandeep vura
Hi Team,

We are getting rpc latency alerts from the standby namenode. What does it means? Where to check the logs for the root cause?


I have already checked standby namenode logs but didn't find any specific error.


Regards,
Sandeep.v

Reply | Threaded
Open this post in threaded view
|

Re: Standby Namenode getting RPC latency alerts

Rakesh Radhakrishnan-2
Hi Sandeep,

This alert could be triggered if the NN operations exceeds certain threshold value. Sometimes an increase in the RPC processing time increases the length of call queue and results in this situation. Could you please provide more details about the client operations you are performing and causing it to perform too many NameNode operations. Perhaps, you can check your client applications and their logs to get any info/hint. Also, do you see any heavy utilization of CPU? Could you please share both Namenodes, client logs etc.

Regards,
Rakesh

On Mon, Jul 18, 2016 at 8:35 AM, sandeep vura <[hidden email]> wrote:
Hi Team,

We are getting rpc latency alerts from the standby namenode. What does it means? Where to check the logs for the root cause?


I have already checked standby namenode logs but didn't find any specific error.


Regards,
Sandeep.v


Reply | Threaded
Open this post in threaded view
|

Re: Standby Namenode getting RPC latency alerts

Chackravarthy Esakkimuthu
In reply to this post by sandeep vura
Sandeep,

Can you please share more information on which hadoop version you are using and also size of the cluster in terms of fsimage size or file/block count. Also what is the threshold set for rpc latency?

There is very less probability that standbyNN getting rpc latency unless there is a checkpointing is in progress. Checkpointing is done by standbyNN and acquires FSNameSystem write lock during the process. Hence other NN operations (from DN) like Heartbeat processing or incremental block report or full block report will get blocked during this time. This might be the case you face in your cluster.

If fsImage is bigger enough (in the order of few GB's), then checkpointing might take more than a minute. If you are using Hadoop 2.6.0, you might be encountering this situation. This got fixed in hadoop-2.7.0.

Thanks,
Chackra

Thanks,
Chackra

On Mon, Jul 18, 2016 at 8:35 AM, sandeep vura <[hidden email]> wrote:
Hi Team,

We are getting rpc latency alerts from the standby namenode. What does it means? Where to check the logs for the root cause?


I have already checked standby namenode logs but didn't find any specific error.


Regards,
Sandeep.v