NullPointerException on namenode startup

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

NullPointerException on namenode startup

Robert Bart
Hi all,

I have a cluster of 9 nodes running Hadoop 1.0.2. I am unable to bring our
the cluster back up after an unsuccessful attempt to restart it. The
problem appears to be caused by a NullPointerException being thrown during
startup of the namenode. The exception seems to be thrown while reading one
of the binary files "edits" or "edits.new" in dfs.name.dir (stack trace
below)

Googling turns up reports of similar problems, and the most common solution
seems to require manually repairing one or both of these files.
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200912.mbox/%3Ca03ecfdb0912160654g4d2f6bd0i73bd690bda55b753@...%3E
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201205.mbox/%3C4FA3DFFA.9080507@...%3E
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201010.mbox/%3C83E09645-A671-4DCE-89A6-D0E1952190A9@...%3E

Does this indeed appear to be corruption of the files in dfs.name.dir, or
is it possible that something else is going on? If the former, is manually
repairing the files using a hex editor really the best way to proceed? Any
ideas on how to fix this without losing data would be greatly appreciated!

Here is the relevant part of the namenode log:

************************************************************/
2012-05-14 21:29:05,459 INFO hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = rv-n11.cs.washington.edu/128.95.1.25
STARTUP_MSG:   args = []
STARTUP_MSG:   vers ion = 1.0.3-SNAPSHOT
STARTUP_MSG:   build =  -r ; compiled by 'knowall' on Tue Apr 17 10:43:14
PDT 2012
************************************************************/
2012-05-14 21:29:05,608 INFO metrics2.impl.MetricsConfig: loaded properties
from hadoop-metrics2.properties
2012-05-14 21:29:05,625 INFO metrics2.impl.MetricsSourceAdapter: MBean for
source MetricsSystem,sub=Stats registered.
2012-05-14 21:29:05,626 INFO metrics2.impl.MetricsSystemImpl: Scheduled
snapshot period at 10 second(s).
2012-05-14 21:29:05,626 INFO metrics2.impl.MetricsSystemImpl: NameNode
metrics system started
2012-05-14 21:29:05,754 INFO metrics2.impl.MetricsSourceAdapter: MBean for
source ugi registered.
2012-05-14 21:29:05,766 INFO metrics2.impl.MetricsSourceAdapter: MBean for
source jvm registered.
2012-05-14 21:29:05,768 INFO metrics2.impl.MetricsSourceAdapter: MBean for
source NameNode registered.
2012-05-14 21:29:05,802 INFO hdfs.util.GSet: VM type       = 64-bit
2012-05-14 21:29:05,802 INFO hdfs.util.GSet: 2% max memory = 17.77875 MB
2012-05-14 21:29:05,803 INFO hdfs.util.GSet: capacity      = 2^21 = 2097152
entries
2012-05-14 21:29:05,803 INFO hdfs.util.GSet: recommended=2097152,
actual=2097152
2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
fsOwner=knowall
2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
supergroup=supergroup
2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=true
2012-05-14 21:29:05,833 INFO hdfs.server.namenode.FSNamesystem:
dfs.block.invalidate.limit=100
2012-05-14 21:29:05,833 INFO hdfs.server.namenode.FSNamesystem:
isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
accessTokenLifetime=0 min(s)
2012-05-14 21:29:06,005 INFO hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStateMBean and NameNodeMXBean
2012-05-14 21:29:06,032 INFO hdfs.server.namenode.NameNode: Caching file
names occuring more than 10 times
2012-05-14 21:29:06,044 INFO hdfs.server.common.Storage: Number of files =
52617
2012-05-14 21:29:07,319 INFO hdfs.server.common.Storage: Number of files
under construction = 34
2012-05-14 21:29:07,328 INFO hdfs.server.common.Storage: Image file of size
15201146 loaded in 1 seconds.
2012-05-14 21:29:07,334 ERROR hdfs.server.namenode.NameNode:
java.lang.NullPointerException
        at hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1094)
        at hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1106)
        at hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1009)
        at
hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
        at hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:626)
        at hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1015)
        at hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:833)
        at
hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
        at
hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
        at
hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
        at hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
        at hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
        at hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
        at hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
        at hdfs.server.namenode.NameNode.main(NameNode.java:1288)

2012-05-14 21:29:07,334 INFO hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at rv-n11.cs.washington.edu/128.95.1.25
************************************************************/


--
Rob
Reply | Threaded
Open this post in threaded view
|

Re: NullPointerException on namenode startup

Robert Bart
I think we've fixed this issue by temporarily modifying the Hadoop code to
squelch the NPE - more details below.

This issue seems to happen often.  I find it particularly interesting
that one of these threads was reported on the same day that our
secondary namenode crashed with this error.

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201205.mbox/%3C4FA3DFFA.9080507@...%3E

 After reading the internet our options seemed to be:

1. Use the checkpoint and revert to May 4th
2. Manually repair the "edits" or "edits.new" file
3. Temporarily edit the Hadoop code, probably in or around
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits:626 to
catch the NullPointerException, possibly replacing it with some kind
of no-op, or treating it as an EOF

Since the secondary namenode crashed on May 4th, we didn't want to go
back to our last checkpoint since that meant losing 10 days of data.
I didn't know how to begin with #2 so I tried #3 as described by Terry
Healy.  I added a try-catch around line 696.  It originally was:

   INodeFile node = (INodeFile)fsDir.unprotectedAddFile(path,
permissions, blocks, replication, mtime, atime, blockSize);

And it became:

   INodeFile node = null;
   try {
       node = (INodeFile)fsDir.unprotectedAddFile(path, permissions,
blocks, replication, mtime, atime, blockSize);
   }
   catch (NullPointerException e) {
   }

I started up just the namenode and it came up just fine.  After that,
I reverted the code, recompiled, and booted up the system.  Worked
like a charm.

Peace.

On Mon, May 14, 2012 at 11:13 PM, Robert Bart <[hidden email]>wrote:

> Hi all,
>
> I have a cluster of 9 nodes running Hadoop 1.0.2. I am unable to bring our
> the cluster back up after an unsuccessful attempt to restart it. The
> problem appears to be caused by a NullPointerException being thrown during
> startup of the namenode. The exception seems to be thrown while reading one
> of the binary files "edits" or "edits.new" in dfs.name.dir (stack trace
> below)
>
> Googling turns up reports of similar problems, and the most common
> solution seems to require manually repairing one or both of these files.
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200912.mbox/%3Ca03ecfdb0912160654g4d2f6bd0i73bd690bda55b753@...%3E
>
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201205.mbox/%3C4FA3DFFA.9080507@...%3E
>
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201010.mbox/%3C83E09645-A671-4DCE-89A6-D0E1952190A9@...%3E
>
> Does this indeed appear to be corruption of the files in dfs.name.dir, or
> is it possible that something else is going on? If the former, is manually
> repairing the files using a hex editor really the best way to proceed? Any
> ideas on how to fix this without losing data would be greatly appreciated!
>
> Here is the relevant part of the namenode log:
>
> ************************************************************/
> 2012-05-14 21:29:05,459 INFO hdfs.server.namenode.NameNode: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = rv-n11.cs.washington.edu/128.95.1.25
> STARTUP_MSG:   args = []
> STARTUP_MSG:   vers ion = 1.0.3-SNAPSHOT
> STARTUP_MSG:   build =  -r ; compiled by 'knowall' on Tue Apr 17 10:43:14
> PDT 2012
> ************************************************************/
> 2012-05-14 21:29:05,608 INFO metrics2.impl.MetricsConfig: loaded
> properties from hadoop-metrics2.properties
> 2012-05-14 21:29:05,625 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source MetricsSystem,sub=Stats registered.
> 2012-05-14 21:29:05,626 INFO metrics2.impl.MetricsSystemImpl: Scheduled
> snapshot period at 10 second(s).
> 2012-05-14 21:29:05,626 INFO metrics2.impl.MetricsSystemImpl: NameNode
> metrics system started
> 2012-05-14 21:29:05,754 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source ugi registered.
> 2012-05-14 21:29:05,766 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source jvm registered.
> 2012-05-14 21:29:05,768 INFO metrics2.impl.MetricsSourceAdapter: MBean for
> source NameNode registered.
> 2012-05-14 21:29:05,802 INFO hdfs.util.GSet: VM type       = 64-bit
> 2012-05-14 21:29:05,802 INFO hdfs.util.GSet: 2% max memory = 17.77875 MB
> 2012-05-14 21:29:05,803 INFO hdfs.util.GSet: capacity      = 2^21 =
> 2097152 entries
> 2012-05-14 21:29:05,803 INFO hdfs.util.GSet: recommended=2097152,
> actual=2097152
> 2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
> fsOwner=knowall
> 2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
> supergroup=supergroup
> 2012-05-14 21:29:05,825 INFO hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=true
> 2012-05-14 21:29:05,833 INFO hdfs.server.namenode.FSNamesystem:
> dfs.block.invalidate.limit=100
> 2012-05-14 21:29:05,833 INFO hdfs.server.namenode.FSNamesystem:
> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
> accessTokenLifetime=0 min(s)
> 2012-05-14 21:29:06,005 INFO hdfs.server.namenode.FSNamesystem: Registered
> FSNamesystemStateMBean and NameNodeMXBean
> 2012-05-14 21:29:06,032 INFO hdfs.server.namenode.NameNode: Caching file
> names occuring more than 10 times
> 2012-05-14 21:29:06,044 INFO hdfs.server.common.Storage: Number of files =
> 52617
> 2012-05-14 21:29:07,319 INFO hdfs.server.common.Storage: Number of files
> under construction = 34
> 2012-05-14 21:29:07,328 INFO hdfs.server.common.Storage: Image file of
> size 15201146 loaded in 1 seconds.
> 2012-05-14 21:29:07,334 ERROR hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>         at hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1094)
>         at hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1106)
>         at hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1009)
>         at
> hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>         at hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:626)
>         at hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1015)
>         at hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:833)
>         at
> hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>         at
> hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>         at
> hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>         at hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>         at hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>         at hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>         at hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>         at hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>
> 2012-05-14 21:29:07,334 INFO hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at
> rv-n11.cs.washington.edu/128.95.1.25
> ************************************************************/
>
>
> --
> Rob
>
>


--
Rob