Supervisely, RAID0 provides best io performance whereas no RAID the worst

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Supervisely, RAID0 provides best io performance whereas no RAID the worst

Shady Xu
Hi,

It's widely known that we should mount disks to different directory without any RAID configurations because it provides the best io performance.

However, lately I have done some tests with three different configurations and found this may not be the truth. Below are the configurations and statistics shown by command 'iostat -x'.

Configuration A: RAID 0 all 12 disks to one directory
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.01     0.59  112.02   65.92 15040.07 15856.86   347.27     0.32    1.81    2.36    0.86   0.93  16.49

------------------------------------------------------------------------------

Configuration B: No RAID at all
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.01     0.12    2.88    5.23   364.54  1247.10   397.52     0.76   93.80    9.05  140.42   2.44   1.98
sdg               0.01     0.07    2.39    5.27   328.72  1246.51   410.93     0.75   97.88   10.93  137.33   2.63   2.02
sdl               0.01     0.07    2.59    5.46   340.61  1299.00   407.00     0.82  102.18    9.64  146.09   2.55   2.05
sdf               0.01     0.11    2.28    5.02   291.48  1197.00   407.99     0.72   99.23    9.15  140.12   2.62   1.91
sdb               0.01     0.07    2.69    5.23   334.19  1238.20   396.99     0.74   93.84    8.10  137.98   2.41   1.91
sde               0.01     0.11    2.81    5.27   376.54  1262.25   405.56     0.79   97.62   10.96  143.84   2.58   2.08
sdk               0.01     0.12    3.02    5.20   371.92  1244.48   392.93     0.79   96.07    8.63  146.85   2.48   2.04
sda               0.00     0.07    2.82    5.33   370.06  1260.68   400.52     0.78   96.09    9.72  141.74   2.49   2.03
sdi               0.01     0.11    3.09    5.30   378.19  1269.98   392.63     0.78   92.47    5.98  142.88   2.31   1.94
sdj               0.01     0.07    3.04    5.02   365.32  1185.24   385.01     0.74   92.22    6.31  144.29   2.40   1.93
sdh               0.01     0.07    2.74    5.34   356.22  1264.28   401.06     0.78   96.81   11.36  140.75   2.55   2.06
sdd               0.01     0.11    2.47    5.39   343.22  1292.23   416.20     0.76   96.48   10.26  135.96   2.54   1.99

------------------------------------------------------------------------------

Configuration C: RAID 0 each 12 disks to 12 different directories
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.10    8.88    7.42  1067.65  1761.12   346.94     0.13    7.94    3.64   13.09   0.46   0.75
sdb               0.00     0.09    8.83    7.52  1066.16  1784.79   348.65     0.13    8.02    3.75   13.02   0.47   0.76
sdc               0.00     0.10    8.82    7.48  1073.74  1776.02   349.61     0.13    8.09    3.76   13.19   0.47   0.76
sde               0.00     0.10    8.74    7.46  1060.79  1771.46   349.63     0.13    7.80    3.53   12.81   0.45   0.73
sdg               0.00     0.10    8.93    7.46  1101.14  1772.73   350.64     0.13    7.81    3.70   12.71   0.47   0.77
sdf               0.00     0.09    8.75    7.46  1062.06  1772.08   349.73     0.13    8.03    3.78   13.00   0.46   0.75
sdh               0.00     0.10    9.09    7.45  1114.94  1770.07   348.76     0.13    7.83    3.69   12.89   0.47   0.77
sdi               0.00     0.10    8.91    7.43  1086.85  1761.30   348.48     0.13    7.93    3.64   13.07   0.46   0.75
sdj               0.00     0.10    9.04    7.46  1111.32  1768.79   349.15     0.13    7.79    3.64   12.82   0.46   0.76
sdk               0.00     0.10    9.12    7.51  1122.00  1783.41   349.49     0.13    7.82    3.72   12.80   0.48   0.79
sdl               0.00     0.10    8.91    7.49  1087.98  1777.77   349.49     0.13    7.89    3.69   12.89   0.46   0.75
sdm               0.00     0.09    8.97    7.52  1098.82  1787.10   349.95     0.13    7.96    3.79   12.94   0.47   0.78

It seems the raid 0 all disks to one directory configuration is the one which provides the best disk performance and the no raid configuration provides the worst performance. That's a total opposite fact comparing to the widely known one. Am I doing anything wrong?
Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Andrew Wright
Yes you are. 

If you loose any one of your disks with a raid 0 spanning all drive you will loose all the data in that directory. 

And disks do die. 

Yes you get better single threaded performance but are putting that entire directory/data set at higher risk

Cheers

On Saturday, July 30, 2016, Shady Xu <[hidden email]> wrote:
Hi,

It's widely known that we should mount disks to different directory without any RAID configurations because it provides the best io performance.

However, lately I have done some tests with three different configurations and found this may not be the truth. Below are the configurations and statistics shown by command 'iostat -x'.

Configuration A: RAID 0 all 12 disks to one directory
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.01     0.59  112.02   65.92 15040.07 15856.86   347.27     0.32    1.81    2.36    0.86   0.93  16.49

------------------------------------------------------------------------------

Configuration B: No RAID at all
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.01     0.12    2.88    5.23   364.54  1247.10   397.52     0.76   93.80    9.05  140.42   2.44   1.98
sdg               0.01     0.07    2.39    5.27   328.72  1246.51   410.93     0.75   97.88   10.93  137.33   2.63   2.02
sdl               0.01     0.07    2.59    5.46   340.61  1299.00   407.00     0.82  102.18    9.64  146.09   2.55   2.05
sdf               0.01     0.11    2.28    5.02   291.48  1197.00   407.99     0.72   99.23    9.15  140.12   2.62   1.91
sdb               0.01     0.07    2.69    5.23   334.19  1238.20   396.99     0.74   93.84    8.10  137.98   2.41   1.91
sde               0.01     0.11    2.81    5.27   376.54  1262.25   405.56     0.79   97.62   10.96  143.84   2.58   2.08
sdk               0.01     0.12    3.02    5.20   371.92  1244.48   392.93     0.79   96.07    8.63  146.85   2.48   2.04
sda               0.00     0.07    2.82    5.33   370.06  1260.68   400.52     0.78   96.09    9.72  141.74   2.49   2.03
sdi               0.01     0.11    3.09    5.30   378.19  1269.98   392.63     0.78   92.47    5.98  142.88   2.31   1.94
sdj               0.01     0.07    3.04    5.02   365.32  1185.24   385.01     0.74   92.22    6.31  144.29   2.40   1.93
sdh               0.01     0.07    2.74    5.34   356.22  1264.28   401.06     0.78   96.81   11.36  140.75   2.55   2.06
sdd               0.01     0.11    2.47    5.39   343.22  1292.23   416.20     0.76   96.48   10.26  135.96   2.54   1.99

------------------------------------------------------------------------------

Configuration C: RAID 0 each 12 disks to 12 different directories
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.10    8.88    7.42  1067.65  1761.12   346.94     0.13    7.94    3.64   13.09   0.46   0.75
sdb               0.00     0.09    8.83    7.52  1066.16  1784.79   348.65     0.13    8.02    3.75   13.02   0.47   0.76
sdc               0.00     0.10    8.82    7.48  1073.74  1776.02   349.61     0.13    8.09    3.76   13.19   0.47   0.76
sde               0.00     0.10    8.74    7.46  1060.79  1771.46   349.63     0.13    7.80    3.53   12.81   0.45   0.73
sdg               0.00     0.10    8.93    7.46  1101.14  1772.73   350.64     0.13    7.81    3.70   12.71   0.47   0.77
sdf               0.00     0.09    8.75    7.46  1062.06  1772.08   349.73     0.13    8.03    3.78   13.00   0.46   0.75
sdh               0.00     0.10    9.09    7.45  1114.94  1770.07   348.76     0.13    7.83    3.69   12.89   0.47   0.77
sdi               0.00     0.10    8.91    7.43  1086.85  1761.30   348.48     0.13    7.93    3.64   13.07   0.46   0.75
sdj               0.00     0.10    9.04    7.46  1111.32  1768.79   349.15     0.13    7.79    3.64   12.82   0.46   0.76
sdk               0.00     0.10    9.12    7.51  1122.00  1783.41   349.49     0.13    7.82    3.72   12.80   0.48   0.79
sdl               0.00     0.10    8.91    7.49  1087.98  1777.77   349.49     0.13    7.89    3.69   12.89   0.46   0.75
sdm               0.00     0.09    8.97    7.52  1098.82  1787.10   349.95     0.13    7.96    3.79   12.94   0.47   0.78

It seems the raid 0 all disks to one directory configuration is the one which provides the best disk performance and the no raid configuration provides the worst performance. That's a total opposite fact comparing to the widely known one. Am I doing anything wrong?
Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Shady Xu
Thanks Andrew, I know about the disk failure risk and that it's one of the reasons why we should use JBOD. But JBOD provides worse performance than RAID 0. And take into account the fact that HDFS does have other replications and it will make one more replication on another DataNode when disk failure happens. So why should we sacrifice performance to prevent data loss which can naturally be avoided by HDFS?

2016-07-31 0:36 GMT+08:00 Andrew Wright <[hidden email]>:
Yes you are. 

If you loose any one of your disks with a raid 0 spanning all drive you will loose all the data in that directory. 

And disks do die. 

Yes you get better single threaded performance but are putting that entire directory/data set at higher risk

Cheers


On Saturday, July 30, 2016, Shady Xu <[hidden email]> wrote:
Hi,

It's widely known that we should mount disks to different directory without any RAID configurations because it provides the best io performance.

However, lately I have done some tests with three different configurations and found this may not be the truth. Below are the configurations and statistics shown by command 'iostat -x'.

Configuration A: RAID 0 all 12 disks to one directory
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.01     0.59  112.02   65.92 15040.07 15856.86   347.27     0.32    1.81    2.36    0.86   0.93  16.49

------------------------------------------------------------------------------

Configuration B: No RAID at all
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.01     0.12    2.88    5.23   364.54  1247.10   397.52     0.76   93.80    9.05  140.42   2.44   1.98
sdg               0.01     0.07    2.39    5.27   328.72  1246.51   410.93     0.75   97.88   10.93  137.33   2.63   2.02
sdl               0.01     0.07    2.59    5.46   340.61  1299.00   407.00     0.82  102.18    9.64  146.09   2.55   2.05
sdf               0.01     0.11    2.28    5.02   291.48  1197.00   407.99     0.72   99.23    9.15  140.12   2.62   1.91
sdb               0.01     0.07    2.69    5.23   334.19  1238.20   396.99     0.74   93.84    8.10  137.98   2.41   1.91
sde               0.01     0.11    2.81    5.27   376.54  1262.25   405.56     0.79   97.62   10.96  143.84   2.58   2.08
sdk               0.01     0.12    3.02    5.20   371.92  1244.48   392.93     0.79   96.07    8.63  146.85   2.48   2.04
sda               0.00     0.07    2.82    5.33   370.06  1260.68   400.52     0.78   96.09    9.72  141.74   2.49   2.03
sdi               0.01     0.11    3.09    5.30   378.19  1269.98   392.63     0.78   92.47    5.98  142.88   2.31   1.94
sdj               0.01     0.07    3.04    5.02   365.32  1185.24   385.01     0.74   92.22    6.31  144.29   2.40   1.93
sdh               0.01     0.07    2.74    5.34   356.22  1264.28   401.06     0.78   96.81   11.36  140.75   2.55   2.06
sdd               0.01     0.11    2.47    5.39   343.22  1292.23   416.20     0.76   96.48   10.26  135.96   2.54   1.99

------------------------------------------------------------------------------

Configuration C: RAID 0 each 12 disks to 12 different directories
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.10    8.88    7.42  1067.65  1761.12   346.94     0.13    7.94    3.64   13.09   0.46   0.75
sdb               0.00     0.09    8.83    7.52  1066.16  1784.79   348.65     0.13    8.02    3.75   13.02   0.47   0.76
sdc               0.00     0.10    8.82    7.48  1073.74  1776.02   349.61     0.13    8.09    3.76   13.19   0.47   0.76
sde               0.00     0.10    8.74    7.46  1060.79  1771.46   349.63     0.13    7.80    3.53   12.81   0.45   0.73
sdg               0.00     0.10    8.93    7.46  1101.14  1772.73   350.64     0.13    7.81    3.70   12.71   0.47   0.77
sdf               0.00     0.09    8.75    7.46  1062.06  1772.08   349.73     0.13    8.03    3.78   13.00   0.46   0.75
sdh               0.00     0.10    9.09    7.45  1114.94  1770.07   348.76     0.13    7.83    3.69   12.89   0.47   0.77
sdi               0.00     0.10    8.91    7.43  1086.85  1761.30   348.48     0.13    7.93    3.64   13.07   0.46   0.75
sdj               0.00     0.10    9.04    7.46  1111.32  1768.79   349.15     0.13    7.79    3.64   12.82   0.46   0.76
sdk               0.00     0.10    9.12    7.51  1122.00  1783.41   349.49     0.13    7.82    3.72   12.80   0.48   0.79
sdl               0.00     0.10    8.91    7.49  1087.98  1777.77   349.49     0.13    7.89    3.69   12.89   0.46   0.75
sdm               0.00     0.09    8.97    7.52  1098.82  1787.10   349.95     0.13    7.96    3.79   12.94   0.47   0.78

It seems the raid 0 all disks to one directory configuration is the one which provides the best disk performance and the no raid configuration provides the worst performance. That's a total opposite fact comparing to the widely known one. Am I doing anything wrong?

Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

daemeon reiydelle
Have you considered the probability (mean time to failure - not mean time TO failure) of a disk, then factor the probability is 12 times as likely with a raid 0? Then compare that the the time to replicate in degraded mode where you have such a large number of drives on each node?

Secondly, there I have a question about your configuration description: I suspect you are using "software" raid (vs. hardware raid controllers), yes? If you are using software raid, you consuming one or two cores to handle the raid striping calculations for 12 drives, at a guess.

Lastly, looking at the vmstats/iostats as a baseline for your question seems like it does not include some other aspects of hdfs: As hdfs is multithreading the writes, across all the devices, across all the nodes, vs a large single device, and that parallelization means more parallel IO queues, it seems to me that your question is a bit simplistic. No?


.......


Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872


On Sat, Jul 30, 2016 at 8:12 PM, Shady Xu <[hidden email]> wrote:
Thanks Andrew, I know about the disk failure risk and that it's one of the reasons why we should use JBOD. But JBOD provides worse performance than RAID 0. And take into account the fact that HDFS does have other replications and it will make one more replication on another DataNode when disk failure happens. So why should we sacrifice performance to prevent data loss which can naturally be avoided by HDFS?

2016-07-31 0:36 GMT+08:00 Andrew Wright <[hidden email]>:
Yes you are. 

If you loose any one of your disks with a raid 0 spanning all drive you will loose all the data in that directory. 

And disks do die. 

Yes you get better single threaded performance but are putting that entire directory/data set at higher risk

Cheers


On Saturday, July 30, 2016, Shady Xu <[hidden email]> wrote:
Hi,

It's widely known that we should mount disks to different directory without any RAID configurations because it provides the best io performance.

However, lately I have done some tests with three different configurations and found this may not be the truth. Below are the configurations and statistics shown by command 'iostat -x'.

Configuration A: RAID 0 all 12 disks to one directory
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.01     0.59  112.02   65.92 15040.07 15856.86   347.27     0.32    1.81    2.36    0.86   0.93  16.49

------------------------------------------------------------------------------

Configuration B: No RAID at all
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.01     0.12    2.88    5.23   364.54  1247.10   397.52     0.76   93.80    9.05  140.42   2.44   1.98
sdg               0.01     0.07    2.39    5.27   328.72  1246.51   410.93     0.75   97.88   10.93  137.33   2.63   2.02
sdl               0.01     0.07    2.59    5.46   340.61  1299.00   407.00     0.82  102.18    9.64  146.09   2.55   2.05
sdf               0.01     0.11    2.28    5.02   291.48  1197.00   407.99     0.72   99.23    9.15  140.12   2.62   1.91
sdb               0.01     0.07    2.69    5.23   334.19  1238.20   396.99     0.74   93.84    8.10  137.98   2.41   1.91
sde               0.01     0.11    2.81    5.27   376.54  1262.25   405.56     0.79   97.62   10.96  143.84   2.58   2.08
sdk               0.01     0.12    3.02    5.20   371.92  1244.48   392.93     0.79   96.07    8.63  146.85   2.48   2.04
sda               0.00     0.07    2.82    5.33   370.06  1260.68   400.52     0.78   96.09    9.72  141.74   2.49   2.03
sdi               0.01     0.11    3.09    5.30   378.19  1269.98   392.63     0.78   92.47    5.98  142.88   2.31   1.94
sdj               0.01     0.07    3.04    5.02   365.32  1185.24   385.01     0.74   92.22    6.31  144.29   2.40   1.93
sdh               0.01     0.07    2.74    5.34   356.22  1264.28   401.06     0.78   96.81   11.36  140.75   2.55   2.06
sdd               0.01     0.11    2.47    5.39   343.22  1292.23   416.20     0.76   96.48   10.26  135.96   2.54   1.99

------------------------------------------------------------------------------

Configuration C: RAID 0 each 12 disks to 12 different directories
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.10    8.88    7.42  1067.65  1761.12   346.94     0.13    7.94    3.64   13.09   0.46   0.75
sdb               0.00     0.09    8.83    7.52  1066.16  1784.79   348.65     0.13    8.02    3.75   13.02   0.47   0.76
sdc               0.00     0.10    8.82    7.48  1073.74  1776.02   349.61     0.13    8.09    3.76   13.19   0.47   0.76
sde               0.00     0.10    8.74    7.46  1060.79  1771.46   349.63     0.13    7.80    3.53   12.81   0.45   0.73
sdg               0.00     0.10    8.93    7.46  1101.14  1772.73   350.64     0.13    7.81    3.70   12.71   0.47   0.77
sdf               0.00     0.09    8.75    7.46  1062.06  1772.08   349.73     0.13    8.03    3.78   13.00   0.46   0.75
sdh               0.00     0.10    9.09    7.45  1114.94  1770.07   348.76     0.13    7.83    3.69   12.89   0.47   0.77
sdi               0.00     0.10    8.91    7.43  1086.85  1761.30   348.48     0.13    7.93    3.64   13.07   0.46   0.75
sdj               0.00     0.10    9.04    7.46  1111.32  1768.79   349.15     0.13    7.79    3.64   12.82   0.46   0.76
sdk               0.00     0.10    9.12    7.51  1122.00  1783.41   349.49     0.13    7.82    3.72   12.80   0.48   0.79
sdl               0.00     0.10    8.91    7.49  1087.98  1777.77   349.49     0.13    7.89    3.69   12.89   0.46   0.75
sdm               0.00     0.09    8.97    7.52  1098.82  1787.10   349.95     0.13    7.96    3.79   12.94   0.47   0.78

It seems the raid 0 all disks to one directory configuration is the one which provides the best disk performance and the no raid configuration provides the worst performance. That's a total opposite fact comparing to the widely known one. Am I doing anything wrong?


Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Allen Wittenauer-2
In reply to this post by Shady Xu


On 2016-07-30 20:12 (-0700), Shady Xu <[hidden email]> wrote:
> Thanks Andrew, I know about the disk failure risk and that it's one of the
> reasons why we should use JBOD. But JBOD provides worse performance than
> RAID 0.




 And take into account the fact that HDFS does have other
> replications and it will make one more replication on another DataNode when
> disk failure happens. So why should we sacrifice performance to prevent
> data loss which can naturally be avoided by HDFS?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Allen Wittenauer-2
In reply to this post by Shady Xu


On 2016-07-30 20:12 (-0700), Shady Xu <[hidden email]> wrote:
> Thanks Andrew, I know about the disk failure risk and that it's one of the
> reasons why we should use JBOD. But JBOD provides worse performance than
> RAID 0.

It's not about failure: it's about speed.  RAID0 performance will drop like a rock if any one disk in the set is slow. When all the drives are performing at peak, yes, it's definitely faster.  But over time, drive speed will decline (sometimes to half speed or less!) usually prior to a failure. This failure may take a while, so in the mean time your cluster is getting slower ... and slower ... and slower ...

As a result, JBOD will be significantly faster over the _lifetime_ of the disks vs. a comparison made _today_.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Dejan Menges
Sorry for jumping in, but hence performance... it took as a while to figure out why, whatever disk/RAID0 performance you have, when it comes to HDFS and replication factor bigger then zero, disk write speed drops to 100Mbps... After long long tests with Hortonworks they found that issue is that someone at some point in history hardcoded stuff somewhere, and whatever setup you have, you were limited to this. Luckily we have quite powerful testing environment and plan is to test this patch later this week. I'm not sure if there's either official HDFS bug for this, checked our internal history but didn't see anything like that. 

This was quite disappointing, as whatever tuning, controllers, setups you do, it goes down the water with this. 

On Mon, Aug 1, 2016 at 8:30 AM Allen Wittenauer <[hidden email]> wrote:


On 2016-07-30 20:12 (-0700), Shady Xu <[hidden email]> wrote:
> Thanks Andrew, I know about the disk failure risk and that it's one of the
> reasons why we should use JBOD. But JBOD provides worse performance than
> RAID 0.

It's not about failure: it's about speed.  RAID0 performance will drop like a rock if any one disk in the set is slow. When all the drives are performing at peak, yes, it's definitely faster.  But over time, drive speed will decline (sometimes to half speed or less!) usually prior to a failure. This failure may take a while, so in the mean time your cluster is getting slower ... and slower ... and slower ...

As a result, JBOD will be significantly faster over the _lifetime_ of the disks vs. a comparison made _today_.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Shady Xu
Thanks Allen. I am aware of the fact you said and am wondering what's the await and svctm on your cluster nodes. If there are no signifiant difference, maybe I should try other ways to tune my HBase.

And Dejan, I've never heard of or noticed what you said. If that's true it's really disappointing and please notice us if there's any progress.

2016-08-01 15:33 GMT+08:00 Dejan Menges <[hidden email]>:
Sorry for jumping in, but hence performance... it took as a while to figure out why, whatever disk/RAID0 performance you have, when it comes to HDFS and replication factor bigger then zero, disk write speed drops to 100Mbps... After long long tests with Hortonworks they found that issue is that someone at some point in history hardcoded stuff somewhere, and whatever setup you have, you were limited to this. Luckily we have quite powerful testing environment and plan is to test this patch later this week. I'm not sure if there's either official HDFS bug for this, checked our internal history but didn't see anything like that. 

This was quite disappointing, as whatever tuning, controllers, setups you do, it goes down the water with this. 

On Mon, Aug 1, 2016 at 8:30 AM Allen Wittenauer <[hidden email]> wrote:


On 2016-07-30 20:12 (-0700), Shady Xu <[hidden email]> wrote:
> Thanks Andrew, I know about the disk failure risk and that it's one of the
> reasons why we should use JBOD. But JBOD provides worse performance than
> RAID 0.

It's not about failure: it's about speed.  RAID0 performance will drop like a rock if any one disk in the set is slow. When all the drives are performing at peak, yes, it's definitely faster.  But over time, drive speed will decline (sometimes to half speed or less!) usually prior to a failure. This failure may take a while, so in the mean time your cluster is getting slower ... and slower ... and slower ...

As a result, JBOD will be significantly faster over the _lifetime_ of the disks vs. a comparison made _today_.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Dejan Menges
Hi Shady,

We did extensive tests on this and received fix from Hortonworks which we are probably first and only to test most likely tomorrow evening. If Hortonworks guys are reading this maybe they know official HDFS ticket ID for this, if there is such, as I can not find it in our correspondence. Long story short - single server had RAID controllers with 1G and 2G cache (both scenarios were tested). It started just as a simple benchmark test using TestDFSIO after trying to narrow down best configuration on server side (discussions like this one, JBOD, RAID0, benchmarking etc). However, having 10-12 disks in a single server, and mentioned controllers, we got 6-10 times higher write speed when not using replication (meaning using replication factor one). Took really months to narrow it down to single hardcoded value in HdfsConstants.DEFAULT_DATA_SOCKET_SIZE (just looking into patch). In the end tcpPeerServer.setReceiveBufferSize(HdfsConstants.DEFAULT_DATA_SOCKET_SIZE) basically limited write speed to this constant when using replication, which is super annoying (specially in the context where more or less everyone is using now network speed bigger than 100Mbps). This can be found in b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java

On Mon, Aug 1, 2016 at 11:39 AM Shady Xu <[hidden email]> wrote:
Thanks Allen. I am aware of the fact you said and am wondering what's the await and svctm on your cluster nodes. If there are no signifiant difference, maybe I should try other ways to tune my HBase.

And Dejan, I've never heard of or noticed what you said. If that's true it's really disappointing and please notice us if there's any progress.

2016-08-01 15:33 GMT+08:00 Dejan Menges <[hidden email]>:
Sorry for jumping in, but hence performance... it took as a while to figure out why, whatever disk/RAID0 performance you have, when it comes to HDFS and replication factor bigger then zero, disk write speed drops to 100Mbps... After long long tests with Hortonworks they found that issue is that someone at some point in history hardcoded stuff somewhere, and whatever setup you have, you were limited to this. Luckily we have quite powerful testing environment and plan is to test this patch later this week. I'm not sure if there's either official HDFS bug for this, checked our internal history but didn't see anything like that. 

This was quite disappointing, as whatever tuning, controllers, setups you do, it goes down the water with this. 

On Mon, Aug 1, 2016 at 8:30 AM Allen Wittenauer <[hidden email]> wrote:


On 2016-07-30 20:12 (-0700), Shady Xu <[hidden email]> wrote:
> Thanks Andrew, I know about the disk failure risk and that it's one of the
> reasons why we should use JBOD. But JBOD provides worse performance than
> RAID 0.

It's not about failure: it's about speed.  RAID0 performance will drop like a rock if any one disk in the set is slow. When all the drives are performing at peak, yes, it's definitely faster.  But over time, drive speed will decline (sometimes to half speed or less!) usually prior to a failure. This failure may take a while, so in the mean time your cluster is getting slower ... and slower ... and slower ...

As a result, JBOD will be significantly faster over the _lifetime_ of the disks vs. a comparison made _today_.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Shady Xu
Hi Dejan,

I checked on Github and found that DEFAULT_DATA_SOCKET_SIZE locates in the hadoop-hdfs-project/hadoop-hdfs-client/ package in the apache version of Hadoop, whereas hadoop-hdfs-project/hadoop-hdfs/ in that of Hortonworks.   I am not sure if that means that parameter affects the performance of Hadoop client in Apache HDFS and the performance of DataNode in HortonWorks HDFS. If that's the fact, maybe it's a bug brought in by HortonWorks?

2016-08-01 17:47 GMT+08:00 Dejan Menges <[hidden email]>:
Hi Shady,

We did extensive tests on this and received fix from Hortonworks which we are probably first and only to test most likely tomorrow evening. If Hortonworks guys are reading this maybe they know official HDFS ticket ID for this, if there is such, as I can not find it in our correspondence. Long story short - single server had RAID controllers with 1G and 2G cache (both scenarios were tested). It started just as a simple benchmark test using TestDFSIO after trying to narrow down best configuration on server side (discussions like this one, JBOD, RAID0, benchmarking etc). However, having 10-12 disks in a single server, and mentioned controllers, we got 6-10 times higher write speed when not using replication (meaning using replication factor one). Took really months to narrow it down to single hardcoded value in HdfsConstants.DEFAULT_DATA_SOCKET_SIZE (just looking into patch). In the end tcpPeerServer.setReceiveBufferSize(HdfsConstants.DEFAULT_DATA_SOCKET_SIZE) basically limited write speed to this constant when using replication, which is super annoying (specially in the context where more or less everyone is using now network speed bigger than 100Mbps). This can be found in b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java

On Mon, Aug 1, 2016 at 11:39 AM Shady Xu <[hidden email]> wrote:
Thanks Allen. I am aware of the fact you said and am wondering what's the await and svctm on your cluster nodes. If there are no signifiant difference, maybe I should try other ways to tune my HBase.

And Dejan, I've never heard of or noticed what you said. If that's true it's really disappointing and please notice us if there's any progress.

2016-08-01 15:33 GMT+08:00 Dejan Menges <[hidden email]>:
Sorry for jumping in, but hence performance... it took as a while to figure out why, whatever disk/RAID0 performance you have, when it comes to HDFS and replication factor bigger then zero, disk write speed drops to 100Mbps... After long long tests with Hortonworks they found that issue is that someone at some point in history hardcoded stuff somewhere, and whatever setup you have, you were limited to this. Luckily we have quite powerful testing environment and plan is to test this patch later this week. I'm not sure if there's either official HDFS bug for this, checked our internal history but didn't see anything like that. 

This was quite disappointing, as whatever tuning, controllers, setups you do, it goes down the water with this. 

On Mon, Aug 1, 2016 at 8:30 AM Allen Wittenauer <[hidden email]> wrote:


On 2016-07-30 20:12 (-0700), Shady Xu <[hidden email]> wrote:
> Thanks Andrew, I know about the disk failure risk and that it's one of the
> reasons why we should use JBOD. But JBOD provides worse performance than
> RAID 0.

It's not about failure: it's about speed.  RAID0 performance will drop like a rock if any one disk in the set is slow. When all the drives are performing at peak, yes, it's definitely faster.  But over time, drive speed will decline (sometimes to half speed or less!) usually prior to a failure. This failure may take a while, so in the mean time your cluster is getting slower ... and slower ... and slower ...

As a result, JBOD will be significantly faster over the _lifetime_ of the disks vs. a comparison made _today_.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst

Dejan Menges
Hi Shady,

Great point, didn't know it. Thanks a lot, will definitely check if this was only related to HWX distribution. 

Thanks a lot, and sorry if I spammed this topic, it wasn't my intention at all.

Dejan

On Tue, Aug 2, 2016 at 9:37 AM Shady Xu <[hidden email]> wrote:
Hi Dejan,

I checked on Github and found that DEFAULT_DATA_SOCKET_SIZE locates in the hadoop-hdfs-project/hadoop-hdfs-client/ package in the apache version of Hadoop, whereas hadoop-hdfs-project/hadoop-hdfs/ in that of Hortonworks.   I am not sure if that means that parameter affects the performance of Hadoop client in Apache HDFS and the performance of DataNode in HortonWorks HDFS. If that's the fact, maybe it's a bug brought in by HortonWorks?

2016-08-01 17:47 GMT+08:00 Dejan Menges <[hidden email]>:
Hi Shady,

We did extensive tests on this and received fix from Hortonworks which we are probably first and only to test most likely tomorrow evening. If Hortonworks guys are reading this maybe they know official HDFS ticket ID for this, if there is such, as I can not find it in our correspondence. Long story short - single server had RAID controllers with 1G and 2G cache (both scenarios were tested). It started just as a simple benchmark test using TestDFSIO after trying to narrow down best configuration on server side (discussions like this one, JBOD, RAID0, benchmarking etc). However, having 10-12 disks in a single server, and mentioned controllers, we got 6-10 times higher write speed when not using replication (meaning using replication factor one). Took really months to narrow it down to single hardcoded value in HdfsConstants.DEFAULT_DATA_SOCKET_SIZE (just looking into patch). In the end tcpPeerServer.setReceiveBufferSize(HdfsConstants.DEFAULT_DATA_SOCKET_SIZE) basically limited write speed to this constant when using replication, which is super annoying (specially in the context where more or less everyone is using now network speed bigger than 100Mbps). This can be found in b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java

On Mon, Aug 1, 2016 at 11:39 AM Shady Xu <[hidden email]> wrote:
Thanks Allen. I am aware of the fact you said and am wondering what's the await and svctm on your cluster nodes. If there are no signifiant difference, maybe I should try other ways to tune my HBase.

And Dejan, I've never heard of or noticed what you said. If that's true it's really disappointing and please notice us if there's any progress.

2016-08-01 15:33 GMT+08:00 Dejan Menges <[hidden email]>:
Sorry for jumping in, but hence performance... it took as a while to figure out why, whatever disk/RAID0 performance you have, when it comes to HDFS and replication factor bigger then zero, disk write speed drops to 100Mbps... After long long tests with Hortonworks they found that issue is that someone at some point in history hardcoded stuff somewhere, and whatever setup you have, you were limited to this. Luckily we have quite powerful testing environment and plan is to test this patch later this week. I'm not sure if there's either official HDFS bug for this, checked our internal history but didn't see anything like that. 

This was quite disappointing, as whatever tuning, controllers, setups you do, it goes down the water with this. 

On Mon, Aug 1, 2016 at 8:30 AM Allen Wittenauer <[hidden email]> wrote:


On 2016-07-30 20:12 (-0700), Shady Xu <[hidden email]> wrote:
> Thanks Andrew, I know about the disk failure risk and that it's one of the
> reasons why we should use JBOD. But JBOD provides worse performance than
> RAID 0.

It's not about failure: it's about speed.  RAID0 performance will drop like a rock if any one disk in the set is slow. When all the drives are performing at peak, yes, it's definitely faster.  But over time, drive speed will decline (sometimes to half speed or less!) usually prior to a failure. This failure may take a while, so in the mean time your cluster is getting slower ... and slower ... and slower ...

As a result, JBOD will be significantly faster over the _lifetime_ of the disks vs. a comparison made _today_.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]