Bulk chmod,chown operations on HDFS

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Bulk chmod,chown operations on HDFS

ravi teja-2
Hi Community,

As part of the new authorisation changes, we need to change the permissions and owners of many files in hdfs (2.6.0) with chmod and chown.

To do this we need to stop the processing on the directories to avoid inconsistencies in permissions, hence we need to take a downtime for those specific pipelines operating on these folders.


The total number of files/directories to be operated upon is around 10 Million.
A chmod recursive (chmod -R) on 160K objects, has taken around 15 minutes.

At this rate it will take a long time to complete the operation and the downtime would be couple of hours.

Mapreduce program  is one option, but chmod,chown being a heavy operations, will slow down the cluster for other users, if done at this scale.

Are there any options to do a bulk permissions changes chmod,chown to avoid these issues?
If not are there any alternative approaches to carry the same operation at this scale something like admin backdoor to fsimage?
 


Thanks,
Ravi Teja
Reply | Threaded
Open this post in threaded view
|

Re: Bulk chmod,chown operations on HDFS

Chris Nauroth
Hello Ravi,

You might consider using DistCh.  In the same way that DistCp is a distributed copy implemented as a MapReduce job, DistCh is a MapReduce job that distributes the work of chmod/chown.

DistCh will become easier to access through convenient shell commands in Apache Hadoop 3.  In version 2.6.0, it's undocumented and hard to find, but it's still there.  It's inside the hadoop-extras.jar.  Here is an example invocation:

hadoop jar share/hadoop/tools/lib/hadoop-extras-*.jar org.apache.hadoop.tools.DistCh

It might take some fiddling with the classpath to get this right.  If so, then I recommend looking at how the shell scripts in trunk set up the classpath.


As you pointed out, this would generate higher NameNode traffic compared to your typical baseline load.  To mitigate this, I recommend that you start with a test run in a non-production environment to see how it reacts.

--Chris Nauroth

From: ravi teja <[hidden email]>
Date: Wednesday, June 15, 2016 at 8:33 PM
To: "[hidden email]" <[hidden email]>
Subject: Bulk chmod,chown operations on HDFS

Hi Community,

As part of the new authorisation changes, we need to change the permissions and owners of many files in hdfs (2.6.0) with chmod and chown.

To do this we need to stop the processing on the directories to avoid inconsistencies in permissions, hence we need to take a downtime for those specific pipelines operating on these folders.


The total number of files/directories to be operated upon is around 10 Million.
A chmod recursive (chmod -R) on 160K objects, has taken around 15 minutes.

At this rate it will take a long time to complete the operation and the downtime would be couple of hours.

Mapreduce program  is one option, but chmod,chown being a heavy operations, will slow down the cluster for other users, if done at this scale.

Are there any options to do a bulk permissions changes chmod,chown to avoid these issues?
If not are there any alternative approaches to carry the same operation at this scale something like admin backdoor to fsimage?
 


Thanks,
Ravi Teja
Reply | Threaded
Open this post in threaded view
|

Re: Bulk chmod,chown operations on HDFS

ravi teja-2
Thanks for the info Nauroth, will try the distch.
Sorry for the late response.

For a chmod -R call on one directory, I see that there are many calls to the namenode, I assume the recursion is done by the client.

Isn't it better that the recursion is done by the name and having a re-entrant lock, instead of having a recursion over the network and taking the lock for every call?





On Thu, Jun 16, 2016 at 11:24 AM, Chris Nauroth <[hidden email]> wrote:
Hello Ravi,

You might consider using DistCh.  In the same way that DistCp is a distributed copy implemented as a MapReduce job, DistCh is a MapReduce job that distributes the work of chmod/chown.

DistCh will become easier to access through convenient shell commands in Apache Hadoop 3.  In version 2.6.0, it's undocumented and hard to find, but it's still there.  It's inside the hadoop-extras.jar.  Here is an example invocation:

hadoop jar share/hadoop/tools/lib/hadoop-extras-*.jar org.apache.hadoop.tools.DistCh

It might take some fiddling with the classpath to get this right.  If so, then I recommend looking at how the shell scripts in trunk set up the classpath.


As you pointed out, this would generate higher NameNode traffic compared to your typical baseline load.  To mitigate this, I recommend that you start with a test run in a non-production environment to see how it reacts.

--Chris Nauroth

From: ravi teja <[hidden email]>
Date: Wednesday, June 15, 2016 at 8:33 PM
To: "[hidden email]" <[hidden email]>
Subject: Bulk chmod,chown operations on HDFS

Hi Community,

As part of the new authorisation changes, we need to change the permissions and owners of many files in hdfs (2.6.0) with chmod and chown.

To do this we need to stop the processing on the directories to avoid inconsistencies in permissions, hence we need to take a downtime for those specific pipelines operating on these folders.


The total number of files/directories to be operated upon is around 10 Million.
A chmod recursive (chmod -R) on 160K objects, has taken around 15 minutes.

At this rate it will take a long time to complete the operation and the downtime would be couple of hours.

Mapreduce program  is one option, but chmod,chown being a heavy operations, will slow down the cluster for other users, if done at this scale.

Are there any options to do a bulk permissions changes chmod,chown to avoid these issues?
If not are there any alternative approaches to carry the same operation at this scale something like admin backdoor to fsimage?
 


Thanks,
Ravi Teja

Reply | Threaded
Open this post in threaded view
|

Re: Bulk chmod,chown operations on HDFS

Chris Nauroth
The community has nearly always steered away from NameNode RPC implementations that operate recursively on an entire sub-tree, instead implementing recursion on the client side.  (The getContentSummary RPC is a notable exception.)  This helps avoid unpredictable long execution time and long lock duration on the server side for any individual RPC.  Admittedly, this is an engineering trade-off, but the choice has worked well in practice.

--Chris Nauroth

From: ravi teja <[hidden email]>
Date: Wednesday, June 29, 2016 at 4:55 AM
To: Chris Nauroth <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Bulk chmod,chown operations on HDFS

Thanks for the info Nauroth, will try the distch.
Sorry for the late response.

For a chmod -R call on one directory, I see that there are many calls to the namenode, I assume the recursion is done by the client.

Isn't it better that the recursion is done by the name and having a re-entrant lock, instead of having a recursion over the network and taking the lock for every call?





On Thu, Jun 16, 2016 at 11:24 AM, Chris Nauroth <[hidden email]> wrote:
Hello Ravi,

You might consider using DistCh.  In the same way that DistCp is a distributed copy implemented as a MapReduce job, DistCh is a MapReduce job that distributes the work of chmod/chown.

DistCh will become easier to access through convenient shell commands in Apache Hadoop 3.  In version 2.6.0, it's undocumented and hard to find, but it's still there.  It's inside the hadoop-extras.jar.  Here is an example invocation:

hadoop jar share/hadoop/tools/lib/hadoop-extras-*.jar org.apache.hadoop.tools.DistCh

It might take some fiddling with the classpath to get this right.  If so, then I recommend looking at how the shell scripts in trunk set up the classpath.


As you pointed out, this would generate higher NameNode traffic compared to your typical baseline load.  To mitigate this, I recommend that you start with a test run in a non-production environment to see how it reacts.

--Chris Nauroth

From: ravi teja <[hidden email]>
Date: Wednesday, June 15, 2016 at 8:33 PM
To: "[hidden email]" <[hidden email]>
Subject: Bulk chmod,chown operations on HDFS

Hi Community,

As part of the new authorisation changes, we need to change the permissions and owners of many files in hdfs (2.6.0) with chmod and chown.

To do this we need to stop the processing on the directories to avoid inconsistencies in permissions, hence we need to take a downtime for those specific pipelines operating on these folders.


The total number of files/directories to be operated upon is around 10 Million.
A chmod recursive (chmod -R) on 160K objects, has taken around 15 minutes.

At this rate it will take a long time to complete the operation and the downtime would be couple of hours.

Mapreduce program  is one option, but chmod,chown being a heavy operations, will slow down the cluster for other users, if done at this scale.

Are there any options to do a bulk permissions changes chmod,chown to avoid these issues?
If not are there any alternative approaches to carry the same operation at this scale something like admin backdoor to fsimage?
 


Thanks,
Ravi Teja