Re: Improving recovery performance for degraded reads

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Improving recovery performance for degraded reads

Rakesh Radhakrishnan-2
Hi Roy,

>>>> (a) In your last email, I am sure you meant => "... submitting read requests to fetch "any" (instead of all) the 'k' chunk (out of k+m-x surviving chunks)  ?
>>>> Do you have any optimization in place to decide which data-nodes will be part of those "k" ?

Answer:-
I hope you know the write path, just adding few details here to support the read explanation part. While writing to an EC file, dfs client writes data stripe(e.g. 64KB cellsize) to multiple datanodes. For (k, m) schema, the client writes data block to the first k datanodes and parity block to the remaining m datanodes. Say, one stripe is (k * cellSize + m * cellSize) data. While reading, client will fetch in the same order, read stripe by stripe. The datanodes with data blocks are first fetched than the datanodes with parity blocks because less EC block reconstruction work is needed. Internally, dfs client reads the whole stripe one by one and contacts k datanodes parallelly for each stripe. If there is any failures then will contact parity datanodes and do reconstruction on the fly. 'DFSStripedInputStream' supported both positional read and read entire buffer(e.g filesize buffer).

>>>> (b) Are there any caching being done (as proposed for QFS in the previously attached "PPR" paper) ?
Answer:-
HDFS-9879, there is an open jira to discuss the caching of striped blocks at the datanode. Perhaps, caching logic could be utilized similar to the QFS and while reconstruction choose those datanodes that have already cached the data in memory. This is an open improvement task as of now.

>>>> (c) When you mentioned stripping is being done, I assume it is probably to reduce the chunk sizes and hence k*c ?
Answer:-
Yes, striping is done by dividing the block into several chunks, we call it as cellSize (e.g. 64KB). (k * c + m * c) is one stripe. A block group comprises of several stripes. I'd  suggest you to read the blog - http://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/ to understand more about the stripe, cells and block group terminolgy etc before reeading the below answer.
   blk_0      blk_1       blk_2  
     |             |                |    
     v            v               v    
  +------+   +------+   +------+
  |cell_0|   |cell_1|   |cell_2| 
  +------+   +------+   +------+ 

>>>> Now, if my object sizes are large (e.g. super HD images) where I would have to get data from multiple stripes to rebuild the images before I can display to the
>>>> client, do you think stripping would still help ?
>>>> Is there a possibility that since I know that all the segments of the HD image would always be read together, by stripping and distributing it on different nodes, I am ignoring 
>>>> its special/temporal locality and further increase any associated delays ?

Answer:-
Since for each stripe it contacts all the k datanodes, assume if there are slow datanodes or some dead datanodes in each data block stripe then it will affect the read performance. AFAIK, for a large file contiguous layout is suitable, this will be supported in phase-2 and design discussions are still going on, please see HDFS-8030 jira. On the otherside, in theory I can say there is a benefit of striping layout, which enables the client to work with multiple data nodes in parallel, greatly enhancing the aggregate throughput(assuming that all datanodes are good servers). But this needs to be tested in your cluster to understand the impact.


Thanks,
Rakesh
Intel

On Sun, Jul 24, 2016 at 12:00 PM, Roy Leonard <[hidden email]> wrote:
Hi Rakesh,

Thanks for sharing your thoughts and updates.

(a) In your last email, I am sure you meant => "... submitting read
requests to fetch "any" (instead of all) the 'k' chunk (out of k+m-x
surviving chunks)  ?
Do you have any optimization in place to decide which data-nodes will be
part of those "k" ?

(b) Are there any caching being done (as proposed for QFS in the previously
attached "PPR" paper) ?

(c) When you mentioned stripping is being done, I assume it is probably to
reduce the chunk sizes and hence k*c ?
Now, if my object sizes are large (e.g. super HD images) where I would have
to get data from multiple stripes to rebuild the images before I can
display to the client, do you think stripping would still help ?
Is there a possibility that since I know that all the segments of the HD
image would always be read together, by stripping and distributing it on
different nodes, I am ignoring its special/temporal locality and further
increase any associated delays ?

Just wanted to know your thoughts.
I am looking forward to the future performance improvements in HDFS.

Regards,
R.

On Fri, Jul 22, 2016 at 8:52 AM, Rakesh Radhakrishnan <[hidden email]>
wrote:

> I'm adding one more point to the above. In my previous mail reply, I've
> explained the striped block reconstruction task which will be triggered by
> the Namenode on identifying a missing/bad block. Similarly, in case of hdfs
> client read failure, currently hdfs client internally submitting read
> requests to fetch all the 'k' chunks(belonging to the same stripe as the
> failed chunk) from k data nodes and perform decoding to rebuild the lost
> data chunk at the client side.
>
> Regards,
> Rakesh
>
> On Fri, Jul 22, 2016 at 5:43 PM, Rakesh Radhakrishnan <[hidden email]>
> wrote:
>
>> Hi Roy,
>>
>> Thanks for the interest in hdfs erasure coding feature and helping us in
>> making the feature more attractive to the users by sharing performance
>> improvement ideas.
>>
>> Presently, the reconstruction work has been implemented in a centralized
>> manner in which the reconstruction task will be given to one data
>> node(first in the pipeline). For example, we have (k, m) erasure code
>> schema, assume one chunk (say c bytes) is lost because of a disk or server
>> failure, k * c bytes of data need to be retrieved from k servers to recover
>> the lost data. The reconstructing data node will fetch k chunks (belonging
>> to the same stripe as the failed chunk) from k different servers and
>> perform decoding to rebuild the lost data chunk. Yes, this k-factor
>> increases the network traffic causes reconstruction to be very slow. IIUC,
>> during the implementation time this point has come up but I think the
>> priority has given for supporting the basic functionality first. I could
>> see quite few jira tasks HDFS-7717, HDFS-7344 where it discussed about
>> distributing the coding works to data nodes which includes - converting a
>> file to a striped layout, reconstruction, error handling etc. But I feel,
>> there is still room for discussing/implementing new approaches to get
>> better performance results.
>>
>> In the shared doc, its mentioned that Partial-Parallel-Repair technique
>> is successfully implemented on top of the Quantcast File System (QFS) [30],
>> which supports RS-based erasure coded storage and got promising results.
>> Its really an encouraging factor for us. I haven't gone through this doc
>> deeply, it would be really great if you (or me or some other folks) could
>> come up with the thoughts to discuss/implement similar mechanisms in HDFS
>> as well. Mostly, will kick start the performance improvement activities
>> after the much awaiting 3.0.0-alpha release:)
>>
>> >>>> Also, I would like to know what others have done to sustain good
>> >>>> performance even under failures (other than keeping fail-over
>> replicas).
>> I'm not having much idea about this part, probably some other folks can
>> pitch in and share thoughts.
>>
>> Regards,
>> Rakesh
>>
>> On Fri, Jul 22, 2016 at 2:03 PM, Roy Leonard <[hidden email]>
>> wrote:
>>
>>> Greetings!
>>>
>>> We are evaluating erasure coding on HDFS to reduce storage cost.
>>> However, the degraded read latency seems like a crucial bottleneck for
>>> our
>>> system.
>>> After exploring some strategies for alleviating the pain of degraded read
>>> latency,
>>> I found a "tree-like recovery" technique might be useful, as described in
>>> the following paper:
>>> "Partial-parallel-repair (PPR): a distributed technique for repairing
>>> erasure coded storage" (Eurosys-2016)
>>> http://dl.acm.org/citation.cfm?id=2901328
>>>
>>> My question is:
>>>
>>> Do you already have such tree-like recovery implemented in HDFS-EC if
>>> not,
>>> do you have any plans to add similar technique is near future ?
>>>
>>> Also, I would like to know what others have done to sustain good
>>> performance even under failures (other than keeping fail-over replicas).
>>>
>>> Regards,
>>> R.
>>>
>>
>>
>