HDFS2 vs MaprFS

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

HDFS2 vs MaprFS

Ascot Moss
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 


Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Gavin Yue
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin: 0px; padding: 0px; list-style: none; box-sizing: border-box; border: 0px; text-decoration: none;">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 


Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

daemeon reiydelle

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 


Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Ascot Moss
Will the the common pool of datanodes and namenode federation be a more effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <[hidden email]> wrote:

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA <a href="tel:415.501.0198" value="+14155010198" target="_blank">415.501.0198
London <a href="tel:%2B44.0.20.8144.9872" value="+442081449872" target="_blank">+44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 



Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Hayati Gonultas
Hi,

In most cases I think one cluster is enough. Since HDFS is a file system, and with federation you may have multiple namenodes for different mount points. So, you may mount /images/facebook to a namenode1 and /images/instagram to namenode2, similar to linux file system mounts. With such a way you hardly need another cluster. I do no know much about inter-namenode read/write requests by the way.

Additionally, having a namenode is good for performance, HDFS2.0 it supports SSDs and other kinds of storage types to be used with caching and many other configuration options are also coming with HDFS2.0.

Last but not the least, Namenode hardware had better to be a redundant server, which means, backed up power supplies, RAID and other redundant options is good for Namenode hardwares, which is contrary to datanodes whose hardware typically has no RAID and are commodity. So NAS is not required for HDFS. Only filesystem image and edit log is stored in filesystem, rest of the HDFS work is in RAM. It is also recommended to store a backup of filesystem image to a safe location (for example: to NFS mount), which can also be configured. So using NAS for reliability (to store filesystem image/edit logs) is not making sense because in the end rest of the work is done in RAM and if you're backing up your filesystem image and your hardware is reliable enough (RAID, redundant power supplies, multiple nics etc.) then SAN/NAS is not required at all, except if your filesytem image is too big to fit on your single server. (File system image is similar to traditional ext3/ext4/fat32/ntfs filesystems system tables, it holds metadata so, it should fit on a single "good enough" server in most cases).


On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <[hidden email]> wrote:
Will the the common pool of datanodes and namenode federation be a more effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <[hidden email]> wrote:

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA <a href="tel:415.501.0198" value="+14155010198" target="_blank">415.501.0198
London <a href="tel:%2B44.0.20.8144.9872" value="+442081449872" target="_blank">+44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 






--
Hayati Gonultas
Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Hayati Gonultas
In reply to this post by Ascot Moss
I forgot to mention about file system limit.

Yes HDFS has limit, because for the performance considirations HDFS filesystem is read from disk to RAM and rest of the work is done with RAM. So RAM should be big enough to fit the filesystem image. But HDFS has configuration options like har files (Hadoop Archive) to defeat these limitations.

On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <[hidden email]> wrote:
Will the the common pool of datanodes and namenode federation be a more effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <[hidden email]> wrote:

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA <a href="tel:415.501.0198" value="+14155010198" target="_blank">415.501.0198
London <a href="tel:%2B44.0.20.8144.9872" value="+442081449872" target="_blank">+44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 






--
Hayati Gonultas
Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Ascot Moss
HDFS2 "Limit to 50-200 million files", is it really true like what MapR says? 

On Sun, Jun 5, 2016 at 7:55 PM, Hayati Gonultas <[hidden email]> wrote:
I forgot to mention about file system limit.

Yes HDFS has limit, because for the performance considirations HDFS filesystem is read from disk to RAM and rest of the work is done with RAM. So RAM should be big enough to fit the filesystem image. But HDFS has configuration options like har files (Hadoop Archive) to defeat these limitations.

On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <[hidden email]> wrote:
Will the the common pool of datanodes and namenode federation be a more effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <[hidden email]> wrote:

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA <a href="tel:415.501.0198" value="+14155010198" target="_blank">415.501.0198
London <a href="tel:%2B44.0.20.8144.9872" value="+442081449872" target="_blank">+44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 






--
Hayati Gonultas

Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Hayati Gonultas

No it is Not true. it totally depends of server's Ram.

Assume that each file holds 1k on Ram and your server has 128gb of ram. So you will have 128 000 000 million file. But 1k is just approximation. Raughtly 1gb holds 1million blocks. So if your server has 512gb of ram then you can approximately hold 512million blocks.

5 Haz 2016 16:58 tarihinde "Ascot Moss" <[hidden email]> yazdı:
HDFS2 "Limit to 50-200 million files", is it really true like what MapR says? 

On Sun, Jun 5, 2016 at 7:55 PM, Hayati Gonultas <[hidden email]> wrote:
I forgot to mention about file system limit.

Yes HDFS has limit, because for the performance considirations HDFS filesystem is read from disk to RAM and rest of the work is done with RAM. So RAM should be big enough to fit the filesystem image. But HDFS has configuration options like har files (Hadoop Archive) to defeat these limitations.

On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <[hidden email]> wrote:
Will the the common pool of datanodes and namenode federation be a more effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <[hidden email]> wrote:

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA <a href="tel:415.501.0198" value="+14155010198" target="_blank">415.501.0198
London <a href="tel:%2B44.0.20.8144.9872" value="+442081449872" target="_blank">+44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 






--
Hayati Gonultas

Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Hayati Gonultas
In reply to this post by Ascot Moss

it is written 128 000 000 million in my previous post. it was incorrect (million million)

what i mean is 128 million.

1gb raughly 1 million.

5 Haz 2016 16:58 tarihinde "Ascot Moss" <[hidden email]> yazdı:
HDFS2 "Limit to 50-200 million files", is it really true like what MapR says? 

On Sun, Jun 5, 2016 at 7:55 PM, Hayati Gonultas <[hidden email]> wrote:
I forgot to mention about file system limit.

Yes HDFS has limit, because for the performance considirations HDFS filesystem is read from disk to RAM and rest of the work is done with RAM. So RAM should be big enough to fit the filesystem image. But HDFS has configuration options like har files (Hadoop Archive) to defeat these limitations.

On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <[hidden email]> wrote:
Will the the common pool of datanodes and namenode federation be a more effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <[hidden email]> wrote:

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA <a href="tel:415.501.0198" value="+14155010198" target="_blank">415.501.0198
London <a href="tel:%2B44.0.20.8144.9872" value="+442081449872" target="_blank">+44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 






--
Hayati Gonultas

Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Hayati Gonultas

Another correction about the terminology needs to be made.

i said 1gb = 1 million blocks. Pay attention to term block. it is not file. A file may contain more than one block. Default block size 64mb so 640 mb file will hold 10 blocks. Each file has its name ,permissions, path, creation date and etc. These metadata is held in memory for all files but not blocks. So it is good to have files with many blocks.

So by the terms of file count, the worst case scenerio is each file only contained in one block. Resulting my 1gb = 1million files. Typically files have many blocks and this count may increase.

5 Haz 2016 17:33 tarihinde "Hayati Gonultas" <[hidden email]> yazdı:

it is written 128 000 000 million in my previous post. it was incorrect (million million)

what i mean is 128 million.

1gb raughly 1 million.

5 Haz 2016 16:58 tarihinde "Ascot Moss" <[hidden email]> yazdı:
HDFS2 "Limit to 50-200 million files", is it really true like what MapR says? 

On Sun, Jun 5, 2016 at 7:55 PM, Hayati Gonultas <[hidden email]> wrote:
I forgot to mention about file system limit.

Yes HDFS has limit, because for the performance considirations HDFS filesystem is read from disk to RAM and rest of the work is done with RAM. So RAM should be big enough to fit the filesystem image. But HDFS has configuration options like har files (Hadoop Archive) to defeat these limitations.

On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <[hidden email]> wrote:
Will the the common pool of datanodes and namenode federation be a more effective alternative in HDFS2  than multiple clusters?

On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <[hidden email]> wrote:

There are indeed many tuning points here. If the name nodes and journal nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can easily scale. I did have one client where the file counts forced multiple clusters. But we were able to differentiate by airframe types ... eg fixed wing in one, rotary subsonic in another, etc.

sent from my mobile
Daemeon C.M. Reiydelle
USA <a href="tel:415.501.0198" value="+14155010198" target="_blank">415.501.0198
London <a href="tel:%2B44.0.20.8144.9872" value="+442081449872" target="_blank">+44.0.20.8144.9872

On Jun 4, 2016 2:23 PM, "Gavin Yue" <[hidden email]> wrote:
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. Currently the namespace can only be vertically scaled on a single namenode.  The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. A typical large deployment at Yahoo! includes an HDFS cluster with <a href="tel:2700-4200" style="margin:0px;padding:0px;list-style:none;border:0px;text-decoration:none" target="_blank">2700-4200 datanodes with 180 million files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing up to 60PB of storage. While these are very large systems and good enough for majority of Hadoop users, a few deployments that might want to grow even larger could find the namespace scalability limiting.



On Jun 4, 2016, at 04:43, Ascot Moss <[hidden email]> wrote:

Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 






--
Hayati Gonultas

Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Marcin Tustin
In reply to this post by Ascot Moss
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity

Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

mohajeri
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity


Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Ascot Moss
I don't think HDFS2 needs SAN, use the QuorumJournal approach is much better than using Shared edits directory SAN approach.



On Monday, June 6, 2016, Peyman Mohajerian <[hidden email]> wrote:
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity


Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Ascot Moss
Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects, or even worse, lose Hadoop's features.  For instances, Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.

On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <[hidden email]> wrote:
I don't think HDFS2 needs SAN, use the QuorumJournal approach is much better than using Shared edits directory SAN approach.




On Monday, June 6, 2016, Peyman Mohajerian <[hidden email]> wrote:
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity



Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Aaron Eng
>Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects

This is faulty logic. And rather than saying it has "many compatibility issues", perhaps you can describe one.

Both MapRFS and HDFS are accessible through the same API.  The backend implementations are what differs.  

>Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS?

Long before HDFS had storage policies, MapRFS had topologies.  You can restrict particular types of storage to a topology and then assign a volume (subset of data stored in MapRFS) to the topology, and hence the data in that subset would be served by whatever hardware was mapped into the topology. 

>no to mention that Mapr-FS  loses Data-Locality.

This statement is false.



On Mon, Jun 6, 2016 at 8:32 AM, Ascot Moss <[hidden email]> wrote:
Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects, or even worse, lose Hadoop's features.  For instances, Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.

On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <[hidden email]> wrote:
I don't think HDFS2 needs SAN, use the QuorumJournal approach is much better than using Shared edits directory SAN approach.




On Monday, June 6, 2016, Peyman Mohajerian <[hidden email]> wrote:
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity




Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Ascot Moss
In HDFS2, I can find "dfs.storage.policy",  for instances, HDFS2 allows to Apply the COLD storage policy to a directory,
 where are these features in Mapr-FS?

On Mon, Jun 6, 2016 at 11:43 PM, Aaron Eng <[hidden email]> wrote:
>Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects

This is faulty logic. And rather than saying it has "many compatibility issues", perhaps you can describe one.

Both MapRFS and HDFS are accessible through the same API.  The backend implementations are what differs.  

>Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS?

Long before HDFS had storage policies, MapRFS had topologies.  You can restrict particular types of storage to a topology and then assign a volume (subset of data stored in MapRFS) to the topology, and hence the data in that subset would be served by whatever hardware was mapped into the topology. 

>no to mention that Mapr-FS  loses Data-Locality.

This statement is false.



On Mon, Jun 6, 2016 at 8:32 AM, Ascot Moss <[hidden email]> wrote:
Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects, or even worse, lose Hadoop's features.  For instances, Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.

On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <[hidden email]> wrote:
I don't think HDFS2 needs SAN, use the QuorumJournal approach is much better than using Shared edits directory SAN approach.




On Monday, June 6, 2016, Peyman Mohajerian <[hidden email]> wrote:
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity





Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Aaron Eng
As I said, MapRFS has topologies.  You assign a volume (which is mounted at a directory path) to a topology and in turn all the data for the volume (e.g. under the directory) is stored on the storage hardware assigned to the topology.  

These topological labels provide the same benefits as dfs.storage.policy as well as enabling additional types of use cases.

On Mon, Jun 6, 2016 at 9:02 AM, Ascot Moss <[hidden email]> wrote:
In HDFS2, I can find "dfs.storage.policy",  for instances, HDFS2 allows to Apply the COLD storage policy to a directory,
 where are these features in Mapr-FS?

On Mon, Jun 6, 2016 at 11:43 PM, Aaron Eng <[hidden email]> wrote:
>Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects

This is faulty logic. And rather than saying it has "many compatibility issues", perhaps you can describe one.

Both MapRFS and HDFS are accessible through the same API.  The backend implementations are what differs.  

>Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS?

Long before HDFS had storage policies, MapRFS had topologies.  You can restrict particular types of storage to a topology and then assign a volume (subset of data stored in MapRFS) to the topology, and hence the data in that subset would be served by whatever hardware was mapped into the topology. 

>no to mention that Mapr-FS  loses Data-Locality.

This statement is false.



On Mon, Jun 6, 2016 at 8:32 AM, Ascot Moss <[hidden email]> wrote:
Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects, or even worse, lose Hadoop's features.  For instances, Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.

On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <[hidden email]> wrote:
I don't think HDFS2 needs SAN, use the QuorumJournal approach is much better than using Shared edits directory SAN approach.




On Monday, June 6, 2016, Peyman Mohajerian <[hidden email]> wrote:
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity






Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Ascot Moss
Hi Aaron, from MapR site, [now HDSF2] "Limit to 50-200 million files", is it really true?

On Tue, Jun 7, 2016 at 12:09 AM, Aaron Eng <[hidden email]> wrote:
As I said, MapRFS has topologies.  You assign a volume (which is mounted at a directory path) to a topology and in turn all the data for the volume (e.g. under the directory) is stored on the storage hardware assigned to the topology.  

These topological labels provide the same benefits as dfs.storage.policy as well as enabling additional types of use cases.

On Mon, Jun 6, 2016 at 9:02 AM, Ascot Moss <[hidden email]> wrote:
In HDFS2, I can find "dfs.storage.policy",  for instances, HDFS2 allows to Apply the COLD storage policy to a directory,
 where are these features in Mapr-FS?

On Mon, Jun 6, 2016 at 11:43 PM, Aaron Eng <[hidden email]> wrote:
>Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects

This is faulty logic. And rather than saying it has "many compatibility issues", perhaps you can describe one.

Both MapRFS and HDFS are accessible through the same API.  The backend implementations are what differs.  

>Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS?

Long before HDFS had storage policies, MapRFS had topologies.  You can restrict particular types of storage to a topology and then assign a volume (subset of data stored in MapRFS) to the topology, and hence the data in that subset would be served by whatever hardware was mapped into the topology. 

>no to mention that Mapr-FS  loses Data-Locality.

This statement is false.



On Mon, Jun 6, 2016 at 8:32 AM, Ascot Moss <[hidden email]> wrote:
Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects, or even worse, lose Hadoop's features.  For instances, Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.

On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <[hidden email]> wrote:
I don't think HDFS2 needs SAN, use the QuorumJournal approach is much better than using Shared edits directory SAN approach.




On Monday, June 6, 2016, Peyman Mohajerian <[hidden email]> wrote:
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity







Reply | Threaded
Open this post in threaded view
|

Re: HDFS2 vs MaprFS

Aaron Eng
As others have answered, the number of blocks/files/directories that can be addressed by a NameNode is limited by the amount of heap space available to the NameNode JVM.  If you need more background on this topic, I'd suggest reviewing various materials from Hadoop JIRA and other vendors that supply and support HDFS.

For instance, this JIRA:

Or, for instance, Cloudera discusses this topic:

I don't intend to speak for Cloudera (obviously), but you can see on that page:
Cloudera recommends 1 GB of NameNode heap space per million blocks to account for the namespace objects

So, do you have >200GB of memory to give to the NameNode JVM? And do you want to do that?  If yes, then you could probably address more than 200 million blocks.

On Mon, Jun 6, 2016 at 9:35 AM, Ascot Moss <[hidden email]> wrote:
Hi Aaron, from MapR site, [now HDSF2] "Limit to 50-200 million files", is it really true?

On Tue, Jun 7, 2016 at 12:09 AM, Aaron Eng <[hidden email]> wrote:
As I said, MapRFS has topologies.  You assign a volume (which is mounted at a directory path) to a topology and in turn all the data for the volume (e.g. under the directory) is stored on the storage hardware assigned to the topology.  

These topological labels provide the same benefits as dfs.storage.policy as well as enabling additional types of use cases.

On Mon, Jun 6, 2016 at 9:02 AM, Ascot Moss <[hidden email]> wrote:
In HDFS2, I can find "dfs.storage.policy",  for instances, HDFS2 allows to Apply the COLD storage policy to a directory,
 where are these features in Mapr-FS?

On Mon, Jun 6, 2016 at 11:43 PM, Aaron Eng <[hidden email]> wrote:
>Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects

This is faulty logic. And rather than saying it has "many compatibility issues", perhaps you can describe one.

Both MapRFS and HDFS are accessible through the same API.  The backend implementations are what differs.  

>Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS?

Long before HDFS had storage policies, MapRFS had topologies.  You can restrict particular types of storage to a topology and then assign a volume (subset of data stored in MapRFS) to the topology, and hence the data in that subset would be served by whatever hardware was mapped into the topology. 

>no to mention that Mapr-FS  loses Data-Locality.

This statement is false.



On Mon, Jun 6, 2016 at 8:32 AM, Ascot Moss <[hidden email]> wrote:
Since MapR  is proprietary, I find that it has many compatibility issues in Apache open source projects, or even worse, lose Hadoop's features.  For instances, Hadoop has a built-in storage policy named COLD, where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.

On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <[hidden email]> wrote:
I don't think HDFS2 needs SAN, use the QuorumJournal approach is much better than using Shared edits directory SAN approach.




On Monday, June 6, 2016, Peyman Mohajerian <[hidden email]> wrote:
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas.

On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[hidden email]> wrote:
The namenode architecture is a source of fragility in HDFS. While a high availability deployment (with two namenodes, and a failover mechanism) means you're unlikely to see service interruption, it is still possible to have a complete loss of filesystem metadata with the loss of two machines.

Secondly, because HDFS identifies datanodes by their hostname/ip, dns changes can cause havoc with HDFS (see my war story on this here: https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab).

Also, the namenode/datanode architecture probably does contribute to the small files problem being a problem. That said, there are lot of practical solutions for the small files problem. 

If you're just setting up a data infrastructure, I would say consider alternatives before you pick HDFS. If you run in AWS, S3 is a good alternative. If you run in some other cloud, it's probably worth considering whatever their equivalent storage system is.


On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[hidden email]> wrote:
Hi,

I read some (old?) articles from Internet about Mapr-FS vs HDFS.

https://www.mapr.com/products/m5-features/no-namenode-architecture

It states that HDFS Federation has

a) "Multiple Single Points of Failure", is it really true? 
Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any Single Points of  Failure in HDFS2.

b) "Limit to 50-200 million files", is it really true?
I have seen so many real world Hadoop Clusters with over 10PB data, some even with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, why are there so many production Hadoop clusters in real world? how can they mange well the issue of  "Limit to 50-200 million files"? For instances,  the Facebook's "Like" implementation runs on HBase at Web Scale, I can image HBase generates huge number of files in Facbook's Hadoop cluster, the number of files in Facebook's Hadoop cluster should be much much bigger than 50-200 million.

From my point of view, in contrast, MaprFS should have true limitation up to 1T files while HDFS2 can handle true unlimited files, please do correct me if I am wrong.

c) "Performance Bottleneck", again, is it really true?
MaprFS does not have namenode in order to gain file system performance. If without Namenode, MaprFS would lose Data Locality which is one of the beauties of Hadoop  If Data Locality is no longer available, any big data application running on MaprFS might gain some file system performance but it would totally lose the true gain of performance from Data Locality provided by Hadoop's namenode (gain small lose big)

d) "Commercial NAS required"
Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?

regards
 




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity