HDFS replica management

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

HDFS replica management

alakshman
Is replica management built into HDFS ? What I mean is if I set replication
factor to 3 and if I lose 3 disks is that data lost forever ? I mean all 3
disks dying at the same time I know is a far fetched scenario but if they
die over a certain period of time does HDFS re-replicate the data to ensure
that there are always 3 copies in the system ?

Thanks
A
Reply | Threaded
Open this post in threaded view
|

Re: HDFS replica management

Ted Dunning-3


Assuming that you have many more disks than 3, then the chances that 3
simultaneous disk failures being just the right 3 is much lower than the
chances of losing any 3 disks.  This is enhanced by the ability of Hadoop to
allocate files in different racks since one of the few mechanisms of
coordinating failures is losing an entire rack.

For example, if you have 20 disks, then the chance of losing a particular
three disks given that you are losing 3 disks is about one chance in a
thousand (assuming independent error location) and should be impossible if
the failures are rack aligned.

Remember, you can always increase the number of replicas if you like.


On 7/17/07 12:55 AM, "Phantom" <[hidden email]> wrote:

> Is replica management built into HDFS ? What I mean is if I set replication
> factor to 3 and if I lose 3 disks is that data lost forever ? I mean all 3
> disks dying at the same time I know is a far fetched scenario but if they
> die over a certain period of time does HDFS re-replicate the data to ensure
> that there are always 3 copies in the system ?
>
> Thanks
> A

Reply | Threaded
Open this post in threaded view
|

Re: HDFS replica management

alakshman
Here is the scenario I was concerned about. Consider three nodes in the
system A, B and C which are placed say in different racks. Let us say that
the disk on A fries up today. Now the blocks that were stored on A are not
going to re-replicated (this is my understanding but I could be wrong in
this assumption) to some other node or to the new disk with which you would
bring back A. Now a month later the disk of B could fry and then another
month later disk on C could fry. This way you could slowly start losing data
in the absence of a replica synchronization algorithm like that in S3. This
would never happen in S3 because there is always a replica synchronization
algorithm that is running to give the guarantee that there will always be 3
replicas in the system. So if a disk fries then the data is re-replicated.
Of course there is no way to protect oneself from 3 machines which store
replicas losing their disks at the same time.

So I was wondering if there is a replica synchronization algorithm in place
or is it a feature that is planned for the future.

A


On 7/17/07, Ted Dunning <[hidden email]> wrote:

>
>
>
> Assuming that you have many more disks than 3, then the chances that 3
> simultaneous disk failures being just the right 3 is much lower than the
> chances of losing any 3 disks.  This is enhanced by the ability of Hadoop
> to
> allocate files in different racks since one of the few mechanisms of
> coordinating failures is losing an entire rack.
>
> For example, if you have 20 disks, then the chance of losing a particular
> three disks given that you are losing 3 disks is about one chance in a
> thousand (assuming independent error location) and should be impossible if
> the failures are rack aligned.
>
> Remember, you can always increase the number of replicas if you like.
>
>
> On 7/17/07 12:55 AM, "Phantom" <[hidden email]> wrote:
>
> > Is replica management built into HDFS ? What I mean is if I set
> replication
> > factor to 3 and if I lose 3 disks is that data lost forever ? I mean all
> 3
> > disks dying at the same time I know is a far fetched scenario but if
> they
> > die over a certain period of time does HDFS re-replicate the data to
> ensure
> > that there are always 3 copies in the system ?
> >
> > Thanks
> > A
>
>
Reply | Threaded
Open this post in threaded view
|

Re: HDFS replica management

Doug Cutting
Phantom wrote:
> Here is the scenario I was concerned about. Consider three nodes in the
> system A, B and C which are placed say in different racks. Let us say that
> the disk on A fries up today. Now the blocks that were stored on A are not
> going to re-replicated (this is my understanding but I could be wrong in
> this assumption) to some other node or to the new disk with which you would
> bring back A.

That's incorrect.  When a datanode fails to send a heartbeat to the
namenode in a timely manner then its data is assumed missing and is
re-replicated.  And when block corruption is detected, corrupt replicas
are removed and non-corrupt replicas are re-replicated to maintain the
desired level of replication.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: HDFS replica management

alakshman
That's awesome.

Thanks
A

On 7/17/07, Doug Cutting <[hidden email]> wrote:

>
> Phantom wrote:
> > Here is the scenario I was concerned about. Consider three nodes in the
> > system A, B and C which are placed say in different racks. Let us say
> that
> > the disk on A fries up today. Now the blocks that were stored on A are
> not
> > going to re-replicated (this is my understanding but I could be wrong in
> > this assumption) to some other node or to the new disk with which you
> would
> > bring back A.
>
> That's incorrect.  When a datanode fails to send a heartbeat to the
> namenode in a timely manner then its data is assumed missing and is
> re-replicated.  And when block corruption is detected, corrupt replicas
> are removed and non-corrupt replicas are re-replicated to maintain the
> desired level of replication.
>
> Doug
>
Reply | Threaded
Open this post in threaded view
|

Re: HDFS replica management

alakshman
I am sure re-replication is not done on every heartbeat miss since that
would be very expensive and inefficient. At the same time you cannot really
tell if a node is partitioned away, crashed or just slow. Is it threshold
based i.e I missed N heartbeats so re-replicate ? Which package in the
source code could I look at to glean this information ?

Thanks
A

On 7/17/07, Phantom <[hidden email]> wrote:

>
> That's awesome.
>
> Thanks
> A
>
> On 7/17/07, Doug Cutting <[hidden email]> wrote:
> >
> > Phantom wrote:
> > > Here is the scenario I was concerned about. Consider three nodes in
> > the
> > > system A, B and C which are placed say in different racks. Let us say
> > that
> > > the disk on A fries up today. Now the blocks that were stored on A are
> > not
> > > going to re-replicated (this is my understanding but I could be wrong
> > in
> > > this assumption) to some other node or to the new disk with which you
> > would
> > > bring back A.
> >
> > That's incorrect.  When a datanode fails to send a heartbeat to the
> > namenode in a timely manner then its data is assumed missing and is
> > re-replicated.  And when block corruption is detected, corrupt replicas
> > are removed and non-corrupt replicas are re-replicated to maintain the
> > desired level of replication.
> >
> > Doug
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: HDFS replica management

alakshman
The reason I ask is because I know in S3 and in P2P storage systems that I
have been involved in we had a replica synchronization algorithm that would
run once every night and it relied on techniques like Merkle tree
comparisons. Anyway understanding that would be beneficial. I don't mind
reading through the sources but would appreciate if pointed to the correct
package.

Thanks
A

On 7/17/07, Phantom <[hidden email]> wrote:

>
> I am sure re-replication is not done on every heartbeat miss since that
> would be very expensive and inefficient. At the same time you cannot really
> tell if a node is partitioned away, crashed or just slow. Is it threshold
> based i.e I missed N heartbeats so re-replicate ? Which package in the
> source code could I look at to glean this information ?
>
> Thanks
> A
>
> On 7/17/07, Phantom <[hidden email]> wrote:
> >
> > That's awesome.
> >
> > Thanks
> > A
> >
> > On 7/17/07, Doug Cutting < [hidden email]> wrote:
> > >
> > > Phantom wrote:
> > > > Here is the scenario I was concerned about. Consider three nodes in
> > > the
> > > > system A, B and C which are placed say in different racks. Let us
> > > say that
> > > > the disk on A fries up today. Now the blocks that were stored on A
> > > are not
> > > > going to re-replicated (this is my understanding but I could be
> > > wrong in
> > > > this assumption) to some other node or to the new disk with which
> > > you would
> > > > bring back A.
> > >
> > > That's incorrect.  When a datanode fails to send a heartbeat to the
> > > namenode in a timely manner then its data is assumed missing and is
> > > re-replicated.  And when block corruption is detected, corrupt
> > > replicas
> > > are removed and non-corrupt replicas are re-replicated to maintain the
> > >
> > > desired level of replication.
> > >
> > > Doug
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: HDFS replica management

Doug Cutting
In reply to this post by alakshman
Phantom wrote:
> I am sure re-replication is not done on every heartbeat miss since that
> would be very expensive and inefficient. At the same time you cannot really
> tell if a node is partitioned away, crashed or just slow. Is it threshold
> based i.e I missed N heartbeats so re-replicate ?

Yes, detection of datanode failure is threshold-based.  It is currently
ten minutes plus ten missed heartbeats.

> Which package in the
> source code could I look at to glean this information ?

This is in dfs/FSNameSystem.java.

Doug
Reply | Threaded
Open this post in threaded view
|

RE: HDFS replica management

Hairong Kuang
>> Which package in the
>> source code could I look at to glean this information ?

>This is in dfs/FSNameSystem.java.

FSNameSystem.java is a huge chunk of source code. To be more specific,
datanode failure detection is done by HeartbeatMonitor. Once a data node is
detected as dead, all blocks belonged to this data node will be put in
neededReplications queue. Then the ReplicationMonitor will start to
replicate those under-replicated blocks. All the replication target chosen
logic is in dfs/ReplicationTargetChooser.java.

Hairong

Reply | Threaded
Open this post in threaded view
|

RE: HDFS replica management

Dhruba Borthakur-2
A Datanode is declared dead if heartbeats are missing for 10 minutes. The
Datanodes typically send a heartbeat every 3 seconds.

Thanks,
dhruba

-----Original Message-----
From: Hairong Kuang [mailto:[hidden email]]
Sent: Tuesday, July 17, 2007 12:30 PM
To: [hidden email]
Subject: RE: HDFS replica management

>> Which package in the
>> source code could I look at to glean this information ?

>This is in dfs/FSNameSystem.java.

FSNameSystem.java is a huge chunk of source code. To be more specific,
datanode failure detection is done by HeartbeatMonitor. Once a data node is
detected as dead, all blocks belonged to this data node will be put in
neededReplications queue. Then the ReplicationMonitor will start to
replicate those under-replicated blocks. All the replication target chosen
logic is in dfs/ReplicationTargetChooser.java.

Hairong