Long running Yarn Applications on a secured HA cluster?
I'm working on a project that uses Apache Flink (stream processing) on top of a secured HA Yarn cluster.
The test application I've been testing with just uses HBase (it writes the current time in a column every minute).
The problem I have is that after 173.5 hours (exactly) my application dies.
The best assessment we have right now is that the Hadoop Delegation Tokens are expiring.
I know for sure the Kerberos tickets are correctly renewed/recreated in the cluster using my keytab file because I had our IT-ops guys drop the max ticket life to 5 minutes and the max renew to 10 minutes.
Following what we found on these two web sites we set those settings (I posted the current settings that seem relevant below).
Yet this has not changed the situation, the job still dies after 173.5 hours with this exception.
15:47:55,283 INFO org.apache.flink.yarn.YarnJobManager - Status of job 2e4a3516d8e4876b705eaff4a52fc272 (Long running Flink application) changed to FAILING.
java.lang.Exception: Serialized representation of org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: FailedServerException: 1 time,
Me and my colleagues did some searching and these two seem to describe a similar problem to what we see (just instead of HBase these reports are about HDFS):
Failed to Update HDFS Delegation Token for long running application in HA mode