Decommission in hadoop-0.12.2

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Decommission in hadoop-0.12.2

Espen Amble Kolstad-2
Hi,

I'm trying to decommission a node with hadoop-0.12.2.
I use the property dfs.hosts.exclude, since the command haddop
dfsadmin -decommission seems to be gone.
I then start the cluster with an emtpy exclude-file, add the name of the node
to decommission and run hadoop dfsadmin -refreshNodes.
The log then says:
2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node
81.93.168.215:50010

But nothing happens.
I've left it in this state over night, but still nothing.

Am I missing something ?

- Espen
Reply | Threaded
Open this post in threaded view
|

Re: Decommission in hadoop-0.12.2

Andrzej Białecki-2
Espen Amble Kolstad wrote:

> Hi,
>
> I'm trying to decommission a node with hadoop-0.12.2.
> I use the property dfs.hosts.exclude, since the command haddop
> dfsadmin -decommission seems to be gone.
> I then start the cluster with an emtpy exclude-file, add the name of the node
> to decommission and run hadoop dfsadmin -refreshNodes.
> The log then says:
> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node
> 81.93.168.215:50010
>
> But nothing happens.
> I've left it in this state over night, but still nothing.
>
> Am I missing something ?

What does the dfsadmin -report says about this node? It takes time to
ensure that all blocks are replicated from this node to other nodes.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Decommission in hadoop-0.12.2

Espen Amble Kolstad-2
On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:

> Espen Amble Kolstad wrote:
> > Hi,
> >
> > I'm trying to decommission a node with hadoop-0.12.2.
> > I use the property dfs.hosts.exclude, since the command haddop
> > dfsadmin -decommission seems to be gone.
> > I then start the cluster with an emtpy exclude-file, add the name of the
> > node to decommission and run hadoop dfsadmin -refreshNodes.
> > The log then says:
> > 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
> > node 81.93.168.215:50010
> >
> > But nothing happens.
> > I've left it in this state over night, but still nothing.
> >
> > Am I missing something ?
>
> What does the dfsadmin -report says about this node? It takes time to
> ensure that all blocks are replicated from this node to other nodes.

Hi,

dfsadmin -report:

Name: 81.93.168.215:50010
State          : Decommission in progress
Total raw bytes: 1438871724032 (1.30 TB)
Used raw bytes: 270070137404 (0.24 TB)
% used: 18.76%
Last contact: Tue Mar 27 09:42:26 CEST 2007

In the web-interface (dfshealth.jsp) no change can be seen in % or the number
of blocks on any of the nodes.

- Espen
Reply | Threaded
Open this post in threaded view
|

Re: Decommission in hadoop-0.12.2

Andrzej Białecki-2
Espen Amble Kolstad wrote:

> On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:
>> Espen Amble Kolstad wrote:
>>> Hi,
>>>
>>> I'm trying to decommission a node with hadoop-0.12.2.
>>> I use the property dfs.hosts.exclude, since the command haddop
>>> dfsadmin -decommission seems to be gone.
>>> I then start the cluster with an emtpy exclude-file, add the name of the
>>> node to decommission and run hadoop dfsadmin -refreshNodes.
>>> The log then says:
>>> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
>>> node 81.93.168.215:50010
>>>
>>> But nothing happens.
>>> I've left it in this state over night, but still nothing.
>>>
>>> Am I missing something ?
>> What does the dfsadmin -report says about this node? It takes time to
>> ensure that all blocks are replicated from this node to other nodes.
>
> Hi,
>
> dfsadmin -report:
>
> Name: 81.93.168.215:50010
> State          : Decommission in progress
> Total raw bytes: 1438871724032 (1.30 TB)
> Used raw bytes: 270070137404 (0.24 TB)
> % used: 18.76%
> Last contact: Tue Mar 27 09:42:26 CEST 2007
>
> In the web-interface (dfshealth.jsp) no change can be seen in % or the number
> of blocks on any of the nodes.

You may want to check the datanode logs if there are any exceptions
reported.. Also, things are taking time - I believe the datanodes
synchronize their block information piecewise, so that they don't
overwhelm the namenode. It surely takes some time in my case, even
though the disk size per node that I use is much smaller.

Regarding the number of blocks - if all blocks are already present on
other datanodes at least in 1 copy, then no new blocks need to be
created - I'm not sure when the namenode decides that these blocks
should get additional replicas: during the decommissioning or after it's
complete ...

It would be nice to have a progress meter on the decommissioning
process, though.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Decommission in hadoop-0.12.2

Espen Amble Kolstad-2
On Tuesday 27 March 2007 10:03:41 Andrzej Bialecki wrote:

> Espen Amble Kolstad wrote:
> > On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:
> >> Espen Amble Kolstad wrote:
> >>> Hi,
> >>>
> >>> I'm trying to decommission a node with hadoop-0.12.2.
> >>> I use the property dfs.hosts.exclude, since the command haddop
> >>> dfsadmin -decommission seems to be gone.
> >>> I then start the cluster with an emtpy exclude-file, add the name of
> >>> the node to decommission and run hadoop dfsadmin -refreshNodes.
> >>> The log then says:
> >>> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
> >>> node 81.93.168.215:50010
> >>>
> >>> But nothing happens.
> >>> I've left it in this state over night, but still nothing.
> >>>
> >>> Am I missing something ?
> >>
> >> What does the dfsadmin -report says about this node? It takes time to
> >> ensure that all blocks are replicated from this node to other nodes.
> >
> > Hi,
> >
> > dfsadmin -report:
> >
> > Name: 81.93.168.215:50010
> > State          : Decommission in progress
> > Total raw bytes: 1438871724032 (1.30 TB)
> > Used raw bytes: 270070137404 (0.24 TB)
> > % used: 18.76%
> > Last contact: Tue Mar 27 09:42:26 CEST 2007
> >
> > In the web-interface (dfshealth.jsp) no change can be seen in % or the
> > number of blocks on any of the nodes.
>
> You may want to check the datanode logs if there are any exceptions
> reported.. Also, things are taking time - I believe the datanodes
> synchronize their block information piecewise, so that they don't
> overwhelm the namenode. It surely takes some time in my case, even
> though the disk size per node that I use is much smaller.
>
> Regarding the number of blocks - if all blocks are already present on
> other datanodes at least in 1 copy, then no new blocks need to be
> created - I'm not sure when the namenode decides that these blocks
> should get additional replicas: during the decommissioning or after it's
> complete ...
>
> It would be nice to have a progress meter on the decommissioning
> process, though.

Hi,

I have replication set to 1 for the whole hdfs, so there should not be any
other replicas.
I can't find any errors in my logs. And the namenode-log looks like this (at
INFO level):
2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node
81.93.168.215:50010
2007-03-27 09:04:48,831 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 09:04:49,500 INFO  fs.FSNamesystem - Roll FSImage
2007-03-27 10:04:50,221 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 10:04:50,360 INFO  fs.FSNamesystem - Roll FSImage

- Espen
Reply | Threaded
Open this post in threaded view
|

RE: Decommission in hadoop-0.12.2

Dhruba Borthakur-2
The decommission-in-progress state indicates that the Namenode is triggering
replication of blocks that reside on the node-being-decommissioned. When all
those blocks get replicated to another Datanode(s),then the state should
change to 'decommissioned".

You can run a bin/hdoop fsck -blocks -locations -files to list out all the
locations of all blocks in the fs (this might take lots of time depending on
the number of files). Please verify if any of the blocks that reside on the
decommission-in-progress node have 2 replicas. Once all those blocks have
two replicas (because you have set replication factor to 1), the
decommissioning should be complete.

Thanks,
dhruba


-----Original Message-----
From: Espen Amble Kolstad [mailto:[hidden email]]
Sent: Tuesday, March 27, 2007 1:23 AM
To: [hidden email]
Subject: Re: Decommission in hadoop-0.12.2

On Tuesday 27 March 2007 10:03:41 Andrzej Bialecki wrote:

> Espen Amble Kolstad wrote:
> > On Tuesday 27 March 2007 09:27:58 Andrzej Bialecki wrote:
> >> Espen Amble Kolstad wrote:
> >>> Hi,
> >>>
> >>> I'm trying to decommission a node with hadoop-0.12.2.
> >>> I use the property dfs.hosts.exclude, since the command haddop
> >>> dfsadmin -decommission seems to be gone.
> >>> I then start the cluster with an emtpy exclude-file, add the name of
> >>> the node to decommission and run hadoop dfsadmin -refreshNodes.
> >>> The log then says:
> >>> 2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning
> >>> node 81.93.168.215:50010
> >>>
> >>> But nothing happens.
> >>> I've left it in this state over night, but still nothing.
> >>>
> >>> Am I missing something ?
> >>
> >> What does the dfsadmin -report says about this node? It takes time to
> >> ensure that all blocks are replicated from this node to other nodes.
> >
> > Hi,
> >
> > dfsadmin -report:
> >
> > Name: 81.93.168.215:50010
> > State          : Decommission in progress
> > Total raw bytes: 1438871724032 (1.30 TB)
> > Used raw bytes: 270070137404 (0.24 TB)
> > % used: 18.76%
> > Last contact: Tue Mar 27 09:42:26 CEST 2007
> >
> > In the web-interface (dfshealth.jsp) no change can be seen in % or the
> > number of blocks on any of the nodes.
>
> You may want to check the datanode logs if there are any exceptions
> reported.. Also, things are taking time - I believe the datanodes
> synchronize their block information piecewise, so that they don't
> overwhelm the namenode. It surely takes some time in my case, even
> though the disk size per node that I use is much smaller.
>
> Regarding the number of blocks - if all blocks are already present on
> other datanodes at least in 1 copy, then no new blocks need to be
> created - I'm not sure when the namenode decides that these blocks
> should get additional replicas: during the decommissioning or after it's
> complete ...
>
> It would be nice to have a progress meter on the decommissioning
> process, though.

Hi,

I have replication set to 1 for the whole hdfs, so there should not be any
other replicas.
I can't find any errors in my logs. And the namenode-log looks like this (at

INFO level):
2007-03-27 08:42:59,168 INFO  fs.FSNamesystem - Start Decommissioning node
81.93.168.215:50010
2007-03-27 09:04:48,831 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 09:04:49,500 INFO  fs.FSNamesystem - Roll FSImage
2007-03-27 10:04:50,221 INFO  fs.FSNamesystem - Roll Edit Log
2007-03-27 10:04:50,360 INFO  fs.FSNamesystem - Roll FSImage

- Espen

Reply | Threaded
Open this post in threaded view
|

Re: Decommission in hadoop-0.12.2

Andrzej Białecki-2
Dhruba Borthakur wrote:

> The decommission-in-progress state indicates that the Namenode is triggering
> replication of blocks that reside on the node-being-decommissioned. When all
> those blocks get replicated to another Datanode(s),then the state should
> change to 'decommissioned".
>
> You can run a bin/hdoop fsck -blocks -locations -files to list out all the
> locations of all blocks in the fs (this might take lots of time depending on
> the number of files). Please verify if any of the blocks that reside on the
> decommission-in-progress node have 2 replicas. Once all those blocks have
> two replicas (because you have set replication factor to 1), the
> decommissioning should be complete.

... though it would be nice if the report gave a "xx% complete"
information ...


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: Decommission in hadoop-0.12.2

Dhruba Borthakur-2
I agree. A decommission-meter would be a really helpful tool to monitor the
progress of a decommission command.

Thanks,
dhruba

-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Tuesday, March 27, 2007 9:45 AM
To: [hidden email]
Subject: Re: Decommission in hadoop-0.12.2

Dhruba Borthakur wrote:
> The decommission-in-progress state indicates that the Namenode is
triggering
> replication of blocks that reside on the node-being-decommissioned. When
all
> those blocks get replicated to another Datanode(s),then the state should
> change to 'decommissioned".
>
> You can run a bin/hdoop fsck -blocks -locations -files to list out all the
> locations of all blocks in the fs (this might take lots of time depending
on
> the number of files). Please verify if any of the blocks that reside on
the
> decommission-in-progress node have 2 replicas. Once all those blocks have
> two replicas (because you have set replication factor to 1), the
> decommissioning should be complete.

... though it would be nice if the report gave a "xx% complete"
information ...


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Decommission in hadoop-0.12.2

Espen Amble Kolstad-2
Hi,

I changed replication for the entire hdfs to 2, and then tried to
decommission. That seemed to do the trick. The namenode-log immediately
started printing:
2007-03-27 17:37:19,954 INFO  dfs.StateChange - BLOCK*
NameSystem.pendingTransfer: ask x.x.x.x:50010 to replicate
blk_9167696482646713604 to datanode(s) x.x.x.x:50010
2007-03-27 17:37:19,954 INFO  dfs.StateChange - BLOCK*
NameSystem.pendingTransfer: ask x.x.x.x:50010 to replicate
blk_9168899963250271798 to datanode(s) x.x.x.x:50010
and then finally:
2007-03-28 00:10:41,876 INFO  fs.FSNamesystem - Decommission complete for node
x.x.x.x:50010

Could it be decommission doesn't work when replication is set to 1?

Thanks for your help!

- Espen

On Tuesday 27 March 2007 18:46:54 Dhruba Borthakur wrote:

> I agree. A decommission-meter would be a really helpful tool to monitor the
> progress of a decommission command.
>
> Thanks,
> dhruba
>
> -----Original Message-----
> From: Andrzej Bialecki [mailto:[hidden email]]
> Sent: Tuesday, March 27, 2007 9:45 AM
> To: [hidden email]
> Subject: Re: Decommission in hadoop-0.12.2
>
> Dhruba Borthakur wrote:
> > The decommission-in-progress state indicates that the Namenode is
>
> triggering
>
> > replication of blocks that reside on the node-being-decommissioned. When
>
> all
>
> > those blocks get replicated to another Datanode(s),then the state should
> > change to 'decommissioned".
> >
> > You can run a bin/hdoop fsck -blocks -locations -files to list out all
> > the locations of all blocks in the fs (this might take lots of time
> > depending
>
> on
>
> > the number of files). Please verify if any of the blocks that reside on
>
> the
>
> > decommission-in-progress node have 2 replicas. Once all those blocks have
> > two replicas (because you have set replication factor to 1), the
> > decommissioning should be complete.
>
> ... though it would be nice if the report gave a "xx% complete"
> information ...