Quantcast

Backing up HDFS

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Backing up HDFS

dan.paulus

So I am administering a 10+ node hadoop cluster and everything is going
swimmingly.  Unfortunately, some relatively critical data is now being
stored on the cluster and I am being asked to create a backup solution for
hadoop in case of catasrophic failure of the data center, the application
creating data corruption, and ultimately my company wants that warm fuzzy
feeling that only an offsite backup can provide.

So does anyone else actually backup HDFS?  After a quick google and forum
search I found the following link that creates a full backup and then
incremental backups, anyone use this or something similar?

http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/
http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/ 

Thanks in advance.
--
View this message in context: http://old.nabble.com/Backing-up-HDFS-tp29335698p29335698.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Backing up HDFS

Eric Sammer
Dan:

For backing up HDFS you have 3 options. Two of them are application based
and one is tool based.

1. The distcp command will copy HDFS data in parallel between clusters. See
'hadoop distcp' for details.
2. Upon copying data into HDFS (on data ingestion / incoming ETL) you could
"fan out" the incoming data stream and send a copy to more than one cluster
and run the same processing in both places.
3. As part of the MR jobs that do any daily processing, you could write
"change log" style logs and ship them between clusters. This is similar to
what relational databases do and amounts to incremental log shipping and
replay.

Of course, for all of these options, a second cluster of similar size to the
first is required. Options 2 and 3 require custom development. In practice,
people who need this level of protection normally use a combination of
techniques based on the processing semantics. They each have trade offs.

All of that said, what you're protecting against here is permanent loss of a
data center and human error. Disk, rack, and node level failures are already
handled by HDFS when properly configured. You have to do the cost / benefit
analysis for yourself to decide if it's worth the time, effort, complexity,
and maintenance.

On Tue, Aug 3, 2010 at 9:54 AM, dan.paulus <[hidden email]> wrote:

>
> So I am administering a 10+ node hadoop cluster and everything is going
> swimmingly.  Unfortunately, some relatively critical data is now being
> stored on the cluster and I am being asked to create a backup solution for
> hadoop in case of catasrophic failure of the data center, the application
> creating data corruption, and ultimately my company wants that warm fuzzy
> feeling that only an offsite backup can provide.
>
> So does anyone else actually backup HDFS?  After a quick google and forum
> search I found the following link that creates a full backup and then
> incremental backups, anyone use this or something similar?
>
> http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/
> http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/
>
> Thanks in advance.
> --
> View this message in context:
> http://old.nabble.com/Backing-up-HDFS-tp29335698p29335698.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


--
Eric Sammer
twitter: esammer
data: www.cloudera.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Backing up HDFS

Michael Segel
In reply to this post by dan.paulus

Dan,

Here's quick and dirty solution that works.
I'm assuming that your cloud is part of a larger corporate network and that you have your cloud, and then 'cloud aware machines', machines that have hadoop installed, but are not part of your cloud but are where you launch jobs and applications from... These machines also have file system mounts to SANs or other network attached (fiber channel attached) storage.

Step 1 make a copy of the files that you want to backup in to a separate directory on HDFS
Step 2 from a 'cloud aware machine' that has SAN disk...
     use the hadoop fs -copyToLocal <file name>(s)  where local disk is on the SAN

Now let your normal backup policy take over. (Assuming that you have a policy for backing up data stored on the SAN)

I saw Eric's post about a second Cloud. Not always possible and not always a good idea if all you want to do is to back up data sets for remote storage.

Note the following:
Performance will vary based on the number of data sets and sizes of the data sets you want to store.

HTH

-Mike


> Date: Tue, 3 Aug 2010 06:54:41 -0700
> From: [hidden email]
> To: [hidden email]
> Subject: Backing up HDFS
>
>
> So I am administering a 10+ node hadoop cluster and everything is going
> swimmingly.  Unfortunately, some relatively critical data is now being
> stored on the cluster and I am being asked to create a backup solution for
> hadoop in case of catasrophic failure of the data center, the application
> creating data corruption, and ultimately my company wants that warm fuzzy
> feeling that only an offsite backup can provide.
>
> So does anyone else actually backup HDFS?  After a quick google and forum
> search I found the following link that creates a full backup and then
> incremental backups, anyone use this or something similar?
>
> http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/
> http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/ 
>
> Thanks in advance.
> --
> View this message in context: http://old.nabble.com/Backing-up-HDFS-tp29335698p29335698.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
     
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Backing up HDFS

Brian Bockelman
In reply to this post by Eric Sammer

On Aug 3, 2010, at 9:12 AM, Eric Sammer wrote:

<snip/>

All of that said, what you're protecting against here is permanent loss of a
data center and human error. Disk, rack, and node level failures are already
handled by HDFS when properly configured.

You've forgotten a third cause of loss: undiscovered software bugs.

The downside of spinning disks is one completely fatal bug can destroy all your data in about a minute (at my site, I famously deleted about 100TB in 10 minutes with a scratch-space cleanup script gone awry.  That was one nasty bug).  This is why we keep good backups.

If you're very, very serious about archiving and have a huge budget, you would invest a few million into a tape silo at multiple sites, flip the write-protection tab on the tapes, eject them, and send them off to secure facilities.  This isn't for everyone though :)

Brian

smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Backing up HDFS

Edward Capriolo
On Tue, Aug 3, 2010 at 10:42 AM, Brian Bockelman <[hidden email]> wrote:

>
> On Aug 3, 2010, at 9:12 AM, Eric Sammer wrote:
> <snip/>
>
> All of that said, what you're protecting against here is permanent loss of a
> data center and human error. Disk, rack, and node level failures are already
> handled by HDFS when properly configured.
>
> You've forgotten a third cause of loss: undiscovered software bugs.
> The downside of spinning disks is one completely fatal bug can destroy all
> your data in about a minute (at my site, I famously deleted about 100TB in
> 10 minutes with a scratch-space cleanup script gone awry.  That was one
> nasty bug).  This is why we keep good backups.
> If you're very, very serious about archiving and have a huge budget, you
> would invest a few million into a tape silo at multiple sites, flip the
> write-protection tab on the tapes, eject them, and send them off to secure
> facilities.  This isn't for everyone though :)
> Brian

Since HDFS filesystems are usually very large backing them up is a
challenge in itself. This is actually a financial issue as well as a
technical one. A standard DataNode TaskTracker might have hardware
like this:

8 1TB disks
4X quad core CPU
32 GB RAM

Assuming you are taking the distcp approach you can mirror your
cluster with some scripting/coding. However your destination systems
can be more modest, assuming you wish to use it ONLY for data no job
processing:

8 2TB Disks
1x duel core (AMD for low power consumption)
2 GB RAM (if you an even find this little ram on a server class machine)
single power supply
(whatever else you can strip off to save $)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Backing up HDFS

Michael Segel



> Date: Tue, 3 Aug 2010 11:02:48 -0400
> Subject: Re: Backing up HDFS
> From: [hidden email]
> To: [hidden email]
>

> Assuming you are taking the distcp approach you can mirror your
> cluster with some scripting/coding. However your destination systems
> can be more modest, assuming you wish to use it ONLY for data no job
> processing:
>

And that would be a waste. (Why build a cloud just to store data and not do any processing?)

You're not building your cloud in a vacuum. There are going to be SAN(s), other servers, tape??? available. The trick is getting the important data off the cloud to a place where it can be backed up via the corporation's standard IT practices.

Because of the size of data, you may see people pulling data off the cloud in to a SAN, then to either a tape drive or a SATA Hot Swap Drive for off site storage.
It all depends on the value of the data.

Again, YMMV

HTH

-Mike

     
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Backing up HDFS

Edward Capriolo
On Tue, Aug 3, 2010 at 11:46 AM, Michael Segel
<[hidden email]> wrote:

>
>
>
>> Date: Tue, 3 Aug 2010 11:02:48 -0400
>> Subject: Re: Backing up HDFS
>> From: [hidden email]
>> To: [hidden email]
>>
>
>> Assuming you are taking the distcp approach you can mirror your
>> cluster with some scripting/coding. However your destination systems
>> can be more modest, assuming you wish to use it ONLY for data no job
>> processing:
>>
>
> And that would be a waste. (Why build a cloud just to store data and not do any processing?)
>
> You're not building your cloud in a vacuum. There are going to be SAN(s), other servers, tape??? available. The trick is getting the important data off the cloud to a place where it can be backed up via the corporation's standard IT practices.
>
> Because of the size of data, you may see people pulling data off the cloud in to a SAN, then to either a tape drive or a SATA Hot Swap Drive for off site storage.
> It all depends on the value of the data.
>
> Again, YMMV
>
> HTH
>
> -Mike
>
>

> You're not building your cloud in a vacuum. There are going to be SAN(s), other servers, tape??? available. The trick is getting the >important data off the cloud to a place where it can be backed up via the corporation's standard IT practices.

Right. it all depends on what you want and your needs. In my example I
wanted near line backups for a lot of data that I can recovery
quickly, thus a solution distcp to a second cluster.

If you want to integrate with other backup software you can do local
copying or experiment with fuse hadoop. Mount the drive and backup via
traditional methods (I just hope you have a lot of tapes :)
Loading...