HBase implementation question

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

HBase implementation question

James D Sadler
Hi All,

I'm interested in the architecture of HBase, in particular how it is
implemented on top of Hadoop DFS.  I understand that HDFS files are
write once: after they are initially created they are for all intents
and purpose immutable.  This being the case, how does HBase implement
its table storage on top of such a file system?  Do updates to an
HBase table cause new versions of the file backing the table to be
created (obviously not!)?  Or have I completely misunderstood how HDFS
works (most likely) ?

Best Regards,

James.
Reply | Threaded
Open this post in threaded view
|

RE: HBase implementation question

Jim Kellerman
There are some good diagrams in the most recent presentation
posted on the Wiki:

http://wiki.apache.org/lucene-hadoop/HBase/HBasePresentations


However, I'll provide a brief summary here.

HDFS files are indeed write once. HBase uses a Hadoop MapFile
( org.apache.hadoop.io.MapFile.java ) for storage, and a
SequenceFile for its redo log (note that the latter is a
current weakness in HBase as files don't currently persist
until they are closed (see HADOOP-1700)).

There are basically two operations, reads and writes. When
a write is received by the server, it first writes the
change to the redo log. The change is then stored in memory.

Periodically, the memory cache is flushed to disk creating
a new MapFile. Files are created on a per column basis so
any particular MapFile contains entries only for a particular
column.

When the number of MapFiles for a column exceeds a configurable
threshold, a background thread is started that merges the
existing MapFiles into one. This operation is called compaction.
Writes may continue while the compaction is in progress, and
may cause new MapFiles to be created if the cache is flushed
to disk. Any new MapFiles created after the compaction starts
will not be a part of the current compaction. Reads may also
continue during a compaction because all the files that currently
exist are immutable. At the end of the compaction the new file
created by merging several files together will be put in place
of the files that were a part of the compaction by temporarily
locking the column, moving the new file into place, and deleting
the old files. This takes very little time, so that read or
write operations on the column are stopped only briefly.

Reads are probably a bit more complicated than writes. A read
operation first checks the cache and may satisfy the request
directly from the cache. If not, the operation checks the
newest MapFile for the data, then the next to newest, ...,
to the oldest stopping when the requested data has been
retrieved. Because a random read (or even a sequential read
that is not a scan) can end up checking multiple files
for data they are considerably slower than either writes and
sequential scans (think of a scan as working with a cursor
in a traditional database).

There are other complicating factors like how a table obtains
more storage as it grows, but the above provides the basic
idea.

Hope this helps.
---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of James D Sadler
> Sent: Saturday, December 29, 2007 9:17 PM
> To: [hidden email]
> Subject: HBase implementation question
>
> Hi All,
>
> I'm interested in the architecture of HBase, in particular
> how it is implemented on top of Hadoop DFS.  I understand
> that HDFS files are write once: after they are initially
> created they are for all intents and purpose immutable.  This
> being the case, how does HBase implement its table storage on
> top of such a file system?  Do updates to an HBase table
> cause new versions of the file backing the table to be
> created (obviously not!)?  Or have I completely misunderstood
> how HDFS works (most likely) ?
>
> Best Regards,
>
> James.
>
Reply | Threaded
Open this post in threaded view
|

Re: HBase implementation question

James D Sadler
Thanks, Jim for pointing me in the right direction!
Reply | Threaded
Open this post in threaded view
|

Re: HBase implementation question

Stefan Groschupf
In reply to this post by Jim Kellerman
Hi,

> Reads are probably a bit more complicated than writes. A read
> operation first checks the cache and may satisfy the request
> directly from the cache. If not, the operation checks the
> newest MapFile for the data, then the next to newest, ...,
> to the oldest stopping when the requested data has been
> retrieved. Because a random read (or even a sequential read
> that is not a scan) can end up checking multiple files
> for data they are considerably slower than either writes and
> sequential scans (think of a scan as working with a cursor
> in a traditional database).

Sorry, just to double check I understand it correctly. The number of  
files need to be checked for a read is related to the compaction  
threshold, since all files are merged into one big sorted file after a  
given time by the compaction thread?
Any idea how many files usually need to checked in average?
Would it make any sense here to work with key-spaces similar to the  
map/reduce partitioner to keep the number of files that need to be  
read smaller?

Thanks,
Stefan

Reply | Threaded
Open this post in threaded view
|

RE: HBase implementation question

Jim Kellerman
> -----Original Message-----
> From: Stefan Groschupf [mailto:[hidden email]]
> Sent: Wednesday, January 02, 2008 3:46 AM
> To: [hidden email]
> Subject: Re: HBase implementation question
>
> Hi,
> > Reads are probably a bit more complicated than writes. A read
> > operation first checks the cache and may satisfy the
> request directly
> > from the cache. If not, the operation checks the newest MapFile for
> > the data, then the next to newest, ..., to the oldest stopping when
> > the requested data has been retrieved. Because a random
> read (or even
> > a sequential read that is not a scan) can end up checking multiple
> > files for data they are considerably slower than either writes and
> > sequential scans (think of a scan as working with a cursor in a
> > traditional database).
>
> Sorry, just to double check I understand it correctly. The
> number of files need to be checked for a read is related to
> the compaction threshold, since all files are merged into one
> big sorted file after a given time by the compaction thread?

Correct.

> Any idea how many files usually need to checked in average?

Not yet. We have been focused on getting things to work right
and increasing reliability before we focus on performance.

> Would it make any sense here to work with key-spaces similar
> to the map/reduce partitioner to keep the number of files
> that need to be read smaller?

Ok, well this gets into the part of the story that I left out of
my last message for the sake of simplicity.

Each table is broken into a series of row ranges called regions.
A HRegion object manages a single row range.

Initially there is only one region that spans the entire table.
As a region grows, it is split in half when it exceeds a
configurable threshold in size. We optimize splits so that
each child region shares one half of the parent until the
child does a compaction. When both children have done a
compaction, the parent is garbage collected.

Each region has all the columns for its row range. Columns
are stored separately and each column for a row range is
managed by an HStore object.

So a lookup for a particular row/column value is limited
by only checking the row range that contains the row and
then by only checking for the requested columns.

Negative lookups can be sped up by adding bloom filters.
Bloom filters are configured on a per column basis, so
not every column need have a bloom filter.

If you haven't already read Google's Bigtable paper, I
think you would find it informative as many of the
concepts employed by Bigtable have similar analogues
in HBase. You can find the Bigtable paper here:

http://labs.google.com/papers/bigtable.html

a c

> Thanks,
> Stefan
>
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.516 / Virus Database: 269.17.13/1206 - Release
> Date: 1/1/2008 12:09 PM
>
>

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.17.13/1207 - Release Date: 1/2/2008 11:29 AM