[PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Matt Foley-2
This discussion started in
HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>
, where it was proposed to replace the build-time utility "saveVersion.sh"
with a python script.  This would require Python as a build-time
dependency.  Here's the background:

Those of us involved in the branch-1-win port of Hadoop to Windows without
use of Cygwin, have faced the issue of frequent use of shell scripts
throughout the system, both in build time (eg, the utility "saveVersion.sh"),
and run time (config files like "hadoop-env.sh" and the start/stop scripts
in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
projects.

The vast majority of these shell scripts do not do anything platform
specific; they can be expressed in a posix-conforming way.  Therefore, it
seems to us that it makes sense to start using a cross-platform scripting
language, such as python, in place of shell for these purposes.  For those
rare occasions where platform-specific functionality really is needed,
python also supports quite a lot of platform-specific functionality on both
Linux and Windows; but where that is inadequate, one could still
conditionally invoke a platform-specific module written in shell (for
Linux/*nix) or powershell or bat (for Windows).

The primary motive for moving to a cross-platform scripting language is
maintainability.  The alternative would be to maintain two complete suites
of scripts, one for Linux and one for Windows (and perhaps others in the
future).  We want to avoid the need to update dual modules in two different
languages when functionality changes, especially given that many Linux
developers are not familiar with powershell or bat, and many Windows
developers are not familiar with shell or bash.

Regarding the choice of python:

   - There are already a few instances of python usage in Hadoop, such as
   the utility (currently broken) "relnotes.py", and massive usage of python
   in the examples/ and contrib/ directories.
   - Python is also used in Bigtop build-time.
   - The Python language is available for free on essentially all
   platforms, under an Apache-compatible
license<http://www.apache.org/legal/resolved.html>.

   - It is supported in Eclipse and similar IDEs.
   - Most importantly, it is widely accepted as a reasonably good OO
   scripting language, and it is easily learned by anyone who already knows
   shell or perl, or other common scripting languages.
   - On the Tiobe index of programming language
popularity<http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,
   which seeks to measure the relative number of software engineers who know
   and use each language, Python far exceeds Perl and Ruby.  The only more
   well-known scripting languages are PHP and Visual Basic, neither of which
   seems a prime candidate for this use.

For build-time usage, I think we should immediately approve python as a
build-time dependency, and allow people who are motivated to do so, to open
jiras for migrating existing build-time shell scripts to python.

For run-time, there is likely to be a lot more discussion.  Lots of folks,
including me, aren't real happy with use of active scripts for
configuration, and various others, including I believe some of the Bigtop
folks, have issues with the way the start/stop scripts work.  Nevertheless,
all those scripts exist today and are widely used.  And they present an
impediment to porting to Windows-without-cygwin.

Nothing about run-time use of scripts has changed significantly over the
past three years, and I don't think we should hold up the Windows port
while we have a huge discussion about issues that veer dangerously into
religious/aesthetic domains. It would be fun to have that discussion, but I
don't want this decision to be dependent on it!

So I propose that we go ahead and also approve python as a run-time
dependency, and allow the inclusion of python scripts in place of current
shell-based functionality.  The unpleasant alternative is to spawn a bunch
of powershell scripts in parallel to the current shell scripts, with a very
negative impact on maintainability.  The Windows port must, after all, be
allowed to proceed.

Let's have a discussion, and then I'll put both issues, separately, to a
vote (unless we miraculously achieve consensus without a vote :-)

I also encourage members of the other Hadoop-related projects, to carry
this discussion into those forums.  It would be very cool to agree on a
whole-stack solution for the scripting problem.

Best regards,
--Matt
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Alejandro Abdelnur
Hey Matt,

We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
its way out with the move of docs to APT)

Why not do a maven-plugin to do that?

Colin already has something to simplify all the cmake calls from the builds
using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)

We could do the same with protoc, thus simplifying the POMs.

The saveVersion.sh seems like another prime candidate for a maven plugin,
and in this case it would not require external tools.

Does this make sense?

Thx

On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[hidden email]> wrote:

> This discussion started in
> HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>
> , where it was proposed to replace the build-time utility "saveVersion.sh"
> with a python script.  This would require Python as a build-time
> dependency.  Here's the background:
>
> Those of us involved in the branch-1-win port of Hadoop to Windows without
> use of Cygwin, have faced the issue of frequent use of shell scripts
> throughout the system, both in build time (eg, the utility
> "saveVersion.sh"),
> and run time (config files like "hadoop-env.sh" and the start/stop scripts
> in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
> projects.
>
> The vast majority of these shell scripts do not do anything platform
> specific; they can be expressed in a posix-conforming way.  Therefore, it
> seems to us that it makes sense to start using a cross-platform scripting
> language, such as python, in place of shell for these purposes.  For those
> rare occasions where platform-specific functionality really is needed,
> python also supports quite a lot of platform-specific functionality on both
> Linux and Windows; but where that is inadequate, one could still
> conditionally invoke a platform-specific module written in shell (for
> Linux/*nix) or powershell or bat (for Windows).
>
> The primary motive for moving to a cross-platform scripting language is
> maintainability.  The alternative would be to maintain two complete suites
> of scripts, one for Linux and one for Windows (and perhaps others in the
> future).  We want to avoid the need to update dual modules in two different
> languages when functionality changes, especially given that many Linux
> developers are not familiar with powershell or bat, and many Windows
> developers are not familiar with shell or bash.
>
> Regarding the choice of python:
>
>    - There are already a few instances of python usage in Hadoop, such as
>    the utility (currently broken) "relnotes.py", and massive usage of
> python
>    in the examples/ and contrib/ directories.
>    - Python is also used in Bigtop build-time.
>    - The Python language is available for free on essentially all
>    platforms, under an Apache-compatible
> license<http://www.apache.org/legal/resolved.html>.
>
>    - It is supported in Eclipse and similar IDEs.
>    - Most importantly, it is widely accepted as a reasonably good OO
>    scripting language, and it is easily learned by anyone who already knows
>    shell or perl, or other common scripting languages.
>    - On the Tiobe index of programming language
> popularity<
> http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,
>    which seeks to measure the relative number of software engineers who
> know
>    and use each language, Python far exceeds Perl and Ruby.  The only more
>    well-known scripting languages are PHP and Visual Basic, neither of
> which
>    seems a prime candidate for this use.
>
> For build-time usage, I think we should immediately approve python as a
> build-time dependency, and allow people who are motivated to do so, to open
> jiras for migrating existing build-time shell scripts to python.
>
> For run-time, there is likely to be a lot more discussion.  Lots of folks,
> including me, aren't real happy with use of active scripts for
> configuration, and various others, including I believe some of the Bigtop
> folks, have issues with the way the start/stop scripts work.  Nevertheless,
> all those scripts exist today and are widely used.  And they present an
> impediment to porting to Windows-without-cygwin.
>
> Nothing about run-time use of scripts has changed significantly over the
> past three years, and I don't think we should hold up the Windows port
> while we have a huge discussion about issues that veer dangerously into
> religious/aesthetic domains. It would be fun to have that discussion, but I
> don't want this decision to be dependent on it!
>
> So I propose that we go ahead and also approve python as a run-time
> dependency, and allow the inclusion of python scripts in place of current
> shell-based functionality.  The unpleasant alternative is to spawn a bunch
> of powershell scripts in parallel to the current shell scripts, with a very
> negative impact on maintainability.  The Windows port must, after all, be
> allowed to proceed.
>
> Let's have a discussion, and then I'll put both issues, separately, to a
> vote (unless we miraculously achieve consensus without a vote :-)
>
> I also encourage members of the other Hadoop-related projects, to carry
> this discussion into those forums.  It would be very cool to agree on a
> whole-stack solution for the scripting problem.
>
> Best regards,
> --Matt
>



--
Alejandro
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Matt Foley
Hi Alejandro,
For build-time issues in branch-2 and beyond, this may make sense (although
I'm concerned about obscuring functionality in a way that only maven
experts will be able to understand).  In the particular case of
saveVersion.sh, I'd be happy to see it done automatically by the build
tools.

However, for build-time issues in the non-mavenized branch-1, and for
run-time issues in both worlds, the need for cross-platform scripting
remains.

Thanks,
--Matt

On Wed, Nov 21, 2012 at 11:25 AM, Alejandro Abdelnur <[hidden email]>wrote:

> Hey Matt,
>
> We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
> its way out with the move of docs to APT)
>
> Why not do a maven-plugin to do that?
>
> Colin already has something to simplify all the cmake calls from the builds
> using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
>
> We could do the same with protoc, thus simplifying the POMs.
>
> The saveVersion.sh seems like another prime candidate for a maven plugin,
> and in this case it would not require external tools.
>
> Does this make sense?
>
> Thx
>
> --
> Alejandro
>
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Alejandro Abdelnur
Got it, thx.

BTW, for branch-1, how about doing an ant task as part of the build that
does that.

Thx



On Wed, Nov 21, 2012 at 11:44 AM, Matt Foley <[hidden email]> wrote:

> Hi Alejandro,
> For build-time issues in branch-2 and beyond, this may make sense (although
> I'm concerned about obscuring functionality in a way that only maven
> experts will be able to understand).  In the particular case of
> saveVersion.sh, I'd be happy to see it done automatically by the build
> tools.
>
> However, for build-time issues in the non-mavenized branch-1, and for
> run-time issues in both worlds, the need for cross-platform scripting
> remains.
>
> Thanks,
> --Matt
>
> On Wed, Nov 21, 2012 at 11:25 AM, Alejandro Abdelnur <[hidden email]
> >wrote:
>
> > Hey Matt,
> >
> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
> > its way out with the move of docs to APT)
> >
> > Why not do a maven-plugin to do that?
> >
> > Colin already has something to simplify all the cmake calls from the
> builds
> > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
> >
> > We could do the same with protoc, thus simplifying the POMs.
> >
> > The saveVersion.sh seems like another prime candidate for a maven plugin,
> > and in this case it would not require external tools.
> >
> > Does this make sense?
> >
> > Thx
> >
> > --
> > Alejandro
> >
>



--
Alejandro
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Konstantin Boudnik-2
In reply to this post by Alejandro Abdelnur
I like Alejandro's idea about Maven for a few of reasons:
  - bringing in a scripting environment which is known for its inter-version
    idiosyncrasies just because Windows can't handle trivial shell scripting
    looks like an overkill to me
  - relative to above, there's a chance that Python's pre-requisites used in
    Hadoop might get into a conflict with some other components in the stack.
    This will be a nightmare for the integrator projects i.e. Bigtop
  - Maven is de-facto standard for Java stacks
  - Maven has built-in scripting language (Groovy) if some plugins aren't
    sufficient for achieving whatever goals

Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses Maven
stuff suchs as deploy/install via custom ant tasks. Same approach would work
for saveVersion.sh and others, I am sure.

Cos

On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:

> Hey Matt,
>
> We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
> its way out with the move of docs to APT)
>
> Why not do a maven-plugin to do that?
>
> Colin already has something to simplify all the cmake calls from the builds
> using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
>
> We could do the same with protoc, thus simplifying the POMs.
>
> The saveVersion.sh seems like another prime candidate for a maven plugin,
> and in this case it would not require external tools.
>
> Does this make sense?
>
> Thx
>
> On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[hidden email]> wrote:
>
> > This discussion started in
> > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>
> > , where it was proposed to replace the build-time utility "saveVersion.sh"
> > with a python script.  This would require Python as a build-time
> > dependency.  Here's the background:
> >
> > Those of us involved in the branch-1-win port of Hadoop to Windows without
> > use of Cygwin, have faced the issue of frequent use of shell scripts
> > throughout the system, both in build time (eg, the utility
> > "saveVersion.sh"),
> > and run time (config files like "hadoop-env.sh" and the start/stop scripts
> > in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
> > projects.
> >
> > The vast majority of these shell scripts do not do anything platform
> > specific; they can be expressed in a posix-conforming way.  Therefore, it
> > seems to us that it makes sense to start using a cross-platform scripting
> > language, such as python, in place of shell for these purposes.  For those
> > rare occasions where platform-specific functionality really is needed,
> > python also supports quite a lot of platform-specific functionality on both
> > Linux and Windows; but where that is inadequate, one could still
> > conditionally invoke a platform-specific module written in shell (for
> > Linux/*nix) or powershell or bat (for Windows).
> >
> > The primary motive for moving to a cross-platform scripting language is
> > maintainability.  The alternative would be to maintain two complete suites
> > of scripts, one for Linux and one for Windows (and perhaps others in the
> > future).  We want to avoid the need to update dual modules in two different
> > languages when functionality changes, especially given that many Linux
> > developers are not familiar with powershell or bat, and many Windows
> > developers are not familiar with shell or bash.
> >
> > Regarding the choice of python:
> >
> >    - There are already a few instances of python usage in Hadoop, such as
> >    the utility (currently broken) "relnotes.py", and massive usage of
> > python
> >    in the examples/ and contrib/ directories.
> >    - Python is also used in Bigtop build-time.
> >    - The Python language is available for free on essentially all
> >    platforms, under an Apache-compatible
> > license<http://www.apache.org/legal/resolved.html>.
> >
> >    - It is supported in Eclipse and similar IDEs.
> >    - Most importantly, it is widely accepted as a reasonably good OO
> >    scripting language, and it is easily learned by anyone who already knows
> >    shell or perl, or other common scripting languages.
> >    - On the Tiobe index of programming language
> > popularity<
> > http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,
> >    which seeks to measure the relative number of software engineers who
> > know
> >    and use each language, Python far exceeds Perl and Ruby.  The only more
> >    well-known scripting languages are PHP and Visual Basic, neither of
> > which
> >    seems a prime candidate for this use.
> >
> > For build-time usage, I think we should immediately approve python as a
> > build-time dependency, and allow people who are motivated to do so, to open
> > jiras for migrating existing build-time shell scripts to python.
> >
> > For run-time, there is likely to be a lot more discussion.  Lots of folks,
> > including me, aren't real happy with use of active scripts for
> > configuration, and various others, including I believe some of the Bigtop
> > folks, have issues with the way the start/stop scripts work.  Nevertheless,
> > all those scripts exist today and are widely used.  And they present an
> > impediment to porting to Windows-without-cygwin.
> >
> > Nothing about run-time use of scripts has changed significantly over the
> > past three years, and I don't think we should hold up the Windows port
> > while we have a huge discussion about issues that veer dangerously into
> > religious/aesthetic domains. It would be fun to have that discussion, but I
> > don't want this decision to be dependent on it!
> >
> > So I propose that we go ahead and also approve python as a run-time
> > dependency, and allow the inclusion of python scripts in place of current
> > shell-based functionality.  The unpleasant alternative is to spawn a bunch
> > of powershell scripts in parallel to the current shell scripts, with a very
> > negative impact on maintainability.  The Windows port must, after all, be
> > allowed to proceed.
> >
> > Let's have a discussion, and then I'll put both issues, separately, to a
> > vote (unless we miraculously achieve consensus without a vote :-)
> >
> > I also encourage members of the other Hadoop-related projects, to carry
> > this discussion into those forums.  It would be very cool to agree on a
> > whole-stack solution for the scripting problem.
> >
> > Best regards,
> > --Matt
> >
>
>
>
> --
> Alejandro

signature.asc (237 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Radim Kolar-2
In reply to this post by Alejandro Abdelnur

> Why not do a maven-plugin to do that?
maven plugins are difficult to maintain. its better to use inline
scripts, with something like this:

http://docs.codehaus.org/display/GMAVEN/Home;jsessionid=E29093B96230BBB4461F02A1718A6B71
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Chris Nauroth
In reply to this post by Konstantin Boudnik-2
I worked on some of the Python build scripting that currently resides in
branch-trunk-win.  Initially, my goal was to keep a "pure" Maven
implementation to the greatest degree possible without external scripting,
but I encountered a few problems:

1. One approach is to try to express all of the build logic with existing
Maven plugins.  This turned out to be infeasible in some cases.  I don't
know of an existing plugin that does anything like the logic in
saveVersion.sh/.py for walking the source tree and checksumming the files.
 For protoc, I saw a proposed plugin in open source, but it hadn't reached
release status yet.  For creation of the distribution tarballs, the Maven
Ant Plugin (and actually the underlying Ant tool) cannot preserve file
permissions or symlinks.

2. Considering that the first approach isn't possible, another possibility
is to write custom Maven plugins.  This would require significantly more
engineering time to write and test the code.  I think there are some
legitimate concerns too about supportability, because this approach would
put significant build logic into Maven plugin code instead of something
more easily visible to release engineers, like pom.xml and external
scripts.  Also, I'm actually not sure that we can implement everything with
a Maven plugin.  For example, I mentioned the problem of preserving file
permissions and symlinks in the distribution tarballs.  Ant hasn't been
able to fix that problem due to a Java limitation, so our Maven plugins
coded in Java (or another JVM language) likely would suffer the same fate.
 We might be stuck with some amount of external scripting no matter what.

Thank you,
--Chris


On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <[hidden email]> wrote:

> I like Alejandro's idea about Maven for a few of reasons:
>   - bringing in a scripting environment which is known for its
> inter-version
>     idiosyncrasies just because Windows can't handle trivial shell
> scripting
>     looks like an overkill to me
>   - relative to above, there's a chance that Python's pre-requisites used
> in
>     Hadoop might get into a conflict with some other components in the
> stack.
>     This will be a nightmare for the integrator projects i.e. Bigtop
>   - Maven is de-facto standard for Java stacks
>   - Maven has built-in scripting language (Groovy) if some plugins aren't
>     sufficient for achieving whatever goals
>
> Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses
> Maven
> stuff suchs as deploy/install via custom ant tasks. Same approach would
> work
> for saveVersion.sh and others, I am sure.
>
> Cos
>
> On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:
> > Hey Matt,
> >
> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
> > its way out with the move of docs to APT)
> >
> > Why not do a maven-plugin to do that?
> >
> > Colin already has something to simplify all the cmake calls from the
> builds
> > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
> >
> > We could do the same with protoc, thus simplifying the POMs.
> >
> > The saveVersion.sh seems like another prime candidate for a maven plugin,
> > and in this case it would not require external tools.
> >
> > Does this make sense?
> >
> > Thx
> >
> > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[hidden email]> wrote:
> >
> > > This discussion started in
> > > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>
> > > , where it was proposed to replace the build-time utility
> "saveVersion.sh"
> > > with a python script.  This would require Python as a build-time
> > > dependency.  Here's the background:
> > >
> > > Those of us involved in the branch-1-win port of Hadoop to Windows
> without
> > > use of Cygwin, have faced the issue of frequent use of shell scripts
> > > throughout the system, both in build time (eg, the utility
> > > "saveVersion.sh"),
> > > and run time (config files like "hadoop-env.sh" and the start/stop
> scripts
> > > in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
> > > projects.
> > >
> > > The vast majority of these shell scripts do not do anything platform
> > > specific; they can be expressed in a posix-conforming way.  Therefore,
> it
> > > seems to us that it makes sense to start using a cross-platform
> scripting
> > > language, such as python, in place of shell for these purposes.  For
> those
> > > rare occasions where platform-specific functionality really is needed,
> > > python also supports quite a lot of platform-specific functionality on
> both
> > > Linux and Windows; but where that is inadequate, one could still
> > > conditionally invoke a platform-specific module written in shell (for
> > > Linux/*nix) or powershell or bat (for Windows).
> > >
> > > The primary motive for moving to a cross-platform scripting language is
> > > maintainability.  The alternative would be to maintain two complete
> suites
> > > of scripts, one for Linux and one for Windows (and perhaps others in
> the
> > > future).  We want to avoid the need to update dual modules in two
> different
> > > languages when functionality changes, especially given that many Linux
> > > developers are not familiar with powershell or bat, and many Windows
> > > developers are not familiar with shell or bash.
> > >
> > > Regarding the choice of python:
> > >
> > >    - There are already a few instances of python usage in Hadoop, such
> as
> > >    the utility (currently broken) "relnotes.py", and massive usage of
> > > python
> > >    in the examples/ and contrib/ directories.
> > >    - Python is also used in Bigtop build-time.
> > >    - The Python language is available for free on essentially all
> > >    platforms, under an Apache-compatible
> > > license<http://www.apache.org/legal/resolved.html>.
> > >
> > >    - It is supported in Eclipse and similar IDEs.
> > >    - Most importantly, it is widely accepted as a reasonably good OO
> > >    scripting language, and it is easily learned by anyone who already
> knows
> > >    shell or perl, or other common scripting languages.
> > >    - On the Tiobe index of programming language
> > > popularity<
> > > http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,
> > >    which seeks to measure the relative number of software engineers who
> > > know
> > >    and use each language, Python far exceeds Perl and Ruby.  The only
> more
> > >    well-known scripting languages are PHP and Visual Basic, neither of
> > > which
> > >    seems a prime candidate for this use.
> > >
> > > For build-time usage, I think we should immediately approve python as a
> > > build-time dependency, and allow people who are motivated to do so, to
> open
> > > jiras for migrating existing build-time shell scripts to python.
> > >
> > > For run-time, there is likely to be a lot more discussion.  Lots of
> folks,
> > > including me, aren't real happy with use of active scripts for
> > > configuration, and various others, including I believe some of the
> Bigtop
> > > folks, have issues with the way the start/stop scripts work.
>  Nevertheless,
> > > all those scripts exist today and are widely used.  And they present an
> > > impediment to porting to Windows-without-cygwin.
> > >
> > > Nothing about run-time use of scripts has changed significantly over
> the
> > > past three years, and I don't think we should hold up the Windows port
> > > while we have a huge discussion about issues that veer dangerously into
> > > religious/aesthetic domains. It would be fun to have that discussion,
> but I
> > > don't want this decision to be dependent on it!
> > >
> > > So I propose that we go ahead and also approve python as a run-time
> > > dependency, and allow the inclusion of python scripts in place of
> current
> > > shell-based functionality.  The unpleasant alternative is to spawn a
> bunch
> > > of powershell scripts in parallel to the current shell scripts, with a
> very
> > > negative impact on maintainability.  The Windows port must, after all,
> be
> > > allowed to proceed.
> > >
> > > Let's have a discussion, and then I'll put both issues, separately, to
> a
> > > vote (unless we miraculously achieve consensus without a vote :-)
> > >
> > > I also encourage members of the other Hadoop-related projects, to carry
> > > this discussion into those forums.  It would be very cool to agree on a
> > > whole-stack solution for the scripting problem.
> > >
> > > Best regards,
> > > --Matt
> > >
> >
> >
> >
> > --
> > Alejandro
>
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Matt Foley
In reply to this post by Konstantin Boudnik-2
Cos,
Please see in-line.

On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <[hidden email]> wrote:

> I like Alejandro's idea about Maven for a few of reasons:
>   - bringing in a scripting environment which is known for its
> inter-version
>     idiosyncrasies just because Windows can't handle trivial shell
> scripting
>     looks like an overkill to me
>

Excuse me?  Can we at least try not to belittle other people's platforms on
a public Apache forum?  There's nothing trivial about implementing shell on
Windows, as cygwin regrettably proved.


>   - relative to above, there's a chance that Python's pre-requisites used
> in
>     Hadoop might get into a conflict with some other components in the
> stack.
>     This will be a nightmare for the integrator projects i.e. Bigtop
>

Said Bigtop project actually uses python, does it not?


>   - Maven is de-facto standard for Java stacks
>

Sure -- except for when Ant was the de-facto standard for Java stacks.  And
let's remember what maven and ant are/were the de-facto standard for:
 Doing builds.  Not scripting everything that needs scripting.


>   - Maven has built-in scripting language (Groovy) if some plugins aren't
>     sufficient for achieving whatever goals
>

Are you proposing Groovy as a better scripting language than Python?


>
> Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses
> Maven
> stuff suchs as deploy/install via custom ant tasks. Same approach would
> work
> for saveVersion.sh and others, I am sure.
>

Current ant scripts in Hadoop seem to use maven only for artifact
management via the maven repository.  If I'm missing something, please
point it out.  The ant build task currently calls out to saveVersion.sh.
 Having it call out to maven, which then calls out to a plug-in and/or a
Groovy script, doesn't sound like an improvement to me.  And it's a way
different use of maven than currently in the Hadoop-1 line, not a
continuation of established practice.

--Matt


>
> Cos
>
> On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:
> > Hey Matt,
> >
> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
> > its way out with the move of docs to APT)
> >
> > Why not do a maven-plugin to do that?
> >
> > Colin already has something to simplify all the cmake calls from the
> builds
> > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
> >
> > We could do the same with protoc, thus simplifying the POMs.
> >
> > The saveVersion.sh seems like another prime candidate for a maven plugin,
> > and in this case it would not require external tools.
> >
> > Does this make sense?
> >
> > Thx
> >
> > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[hidden email]> wrote:
> >
> > > This discussion started in
> > > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>
> > > , where it was proposed to replace the build-time utility
> "saveVersion.sh"
> > > with a python script.  This would require Python as a build-time
> > > dependency.  Here's the background:
> > >
> > > Those of us involved in the branch-1-win port of Hadoop to Windows
> without
> > > use of Cygwin, have faced the issue of frequent use of shell scripts
> > > throughout the system, both in build time (eg, the utility
> > > "saveVersion.sh"),
> > > and run time (config files like "hadoop-env.sh" and the start/stop
> scripts
> > > in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
> > > projects.
> > >
> > > The vast majority of these shell scripts do not do anything platform
> > > specific; they can be expressed in a posix-conforming way.  Therefore,
> it
> > > seems to us that it makes sense to start using a cross-platform
> scripting
> > > language, such as python, in place of shell for these purposes.  For
> those
> > > rare occasions where platform-specific functionality really is needed,
> > > python also supports quite a lot of platform-specific functionality on
> both
> > > Linux and Windows; but where that is inadequate, one could still
> > > conditionally invoke a platform-specific module written in shell (for
> > > Linux/*nix) or powershell or bat (for Windows).
> > >
> > > The primary motive for moving to a cross-platform scripting language is
> > > maintainability.  The alternative would be to maintain two complete
> suites
> > > of scripts, one for Linux and one for Windows (and perhaps others in
> the
> > > future).  We want to avoid the need to update dual modules in two
> different
> > > languages when functionality changes, especially given that many Linux
> > > developers are not familiar with powershell or bat, and many Windows
> > > developers are not familiar with shell or bash.
> > >
> > > Regarding the choice of python:
> > >
> > >    - There are already a few instances of python usage in Hadoop, such
> as
> > >    the utility (currently broken) "relnotes.py", and massive usage of
> > > python
> > >    in the examples/ and contrib/ directories.
> > >    - Python is also used in Bigtop build-time.
> > >    - The Python language is available for free on essentially all
> > >    platforms, under an Apache-compatible
> > > license<http://www.apache.org/legal/resolved.html>.
> > >
> > >    - It is supported in Eclipse and similar IDEs.
> > >    - Most importantly, it is widely accepted as a reasonably good OO
> > >    scripting language, and it is easily learned by anyone who already
> knows
> > >    shell or perl, or other common scripting languages.
> > >    - On the Tiobe index of programming language
> > > popularity<
> > > http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,
> > >    which seeks to measure the relative number of software engineers who
> > > know
> > >    and use each language, Python far exceeds Perl and Ruby.  The only
> more
> > >    well-known scripting languages are PHP and Visual Basic, neither of
> > > which
> > >    seems a prime candidate for this use.
> > >
> > > For build-time usage, I think we should immediately approve python as a
> > > build-time dependency, and allow people who are motivated to do so, to
> open
> > > jiras for migrating existing build-time shell scripts to python.
> > >
> > > For run-time, there is likely to be a lot more discussion.  Lots of
> folks,
> > > including me, aren't real happy with use of active scripts for
> > > configuration, and various others, including I believe some of the
> Bigtop
> > > folks, have issues with the way the start/stop scripts work.
>  Nevertheless,
> > > all those scripts exist today and are widely used.  And they present an
> > > impediment to porting to Windows-without-cygwin.
> > >
> > > Nothing about run-time use of scripts has changed significantly over
> the
> > > past three years, and I don't think we should hold up the Windows port
> > > while we have a huge discussion about issues that veer dangerously into
> > > religious/aesthetic domains. It would be fun to have that discussion,
> but I
> > > don't want this decision to be dependent on it!
> > >
> > > So I propose that we go ahead and also approve python as a run-time
> > > dependency, and allow the inclusion of python scripts in place of
> current
> > > shell-based functionality.  The unpleasant alternative is to spawn a
> bunch
> > > of powershell scripts in parallel to the current shell scripts, with a
> very
> > > negative impact on maintainability.  The Windows port must, after all,
> be
> > > allowed to proceed.
> > >
> > > Let's have a discussion, and then I'll put both issues, separately, to
> a
> > > vote (unless we miraculously achieve consensus without a vote :-)
> > >
> > > I also encourage members of the other Hadoop-related projects, to carry
> > > this discussion into those forums.  It would be very cool to agree on a
> > > whole-stack solution for the scripting problem.
> > >
> > > Best regards,
> > > --Matt
> > >
> >
> >
> >
> > --
> > Alejandro
>
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Radim Kolar-2
In reply to this post by Chris Nauroth
Dne 21.11.2012 22:03, Chris Nauroth napsal(a):
> For creation of the distribution tarballs, the Maven
> Ant Plugin (and actually the underlying Ant tool) cannot preserve file
> permissions or symlinks.
maven assembly plugin can deal with file permissions. not sure about
symlinks. I do not remember dist tar to have symlinks inside.
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Konstantin Boudnik-2
In reply to this post by Radim Kolar-2
On Wed, Nov 21, 2012 at 09:46PM, Radim Kolar wrote:
>
> >Why not do a maven-plugin to do that?
> maven plugins are difficult to maintain. its better to use inline
> scripts, with something like this:
>
> http://docs.codehaus.org/display/GMAVEN/Home;jsessionid=E29093B96230BBB4461F02A1718A6B71

Exactly my point, thank you!

Cos


signature.asc (237 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Chris Nauroth
In reply to this post by Radim Kolar-2
Sorry, to clarify my point a little more, Ant does allow you to make
declarations to explicitly set the desired file permissions via the
fileMode attribute of a tarfileset.  However, it does not have the
capability to preserve whatever permissions were naturally created on files
earlier in the build process.  This is a difference in maintainability, as
adding new files to the build may then require extra maintenance of the Ant
directives to apply the desired fileMode.  This is an easy thing to
overlook.  A solution that preserves the natural permissions requires less
maintenance overhead.

I couldn't find a way to make assembly plugin preserve permissions like
this either.  It just has explicit fileMode directives similar to Ant.
 (Let me know if I missed something though.)

To see symlinks show up in distribution tarballs, you need to build with
the native components, like libhadoop.so or bundled Snappy.

Thanks,
--Chris


On Wed, Nov 21, 2012 at 1:30 PM, Radim Kolar <[hidden email]> wrote:

> Dne 21.11.2012 22:03, Chris Nauroth napsal(a):
>
>  For creation of the distribution tarballs, the Maven
>> Ant Plugin (and actually the underlying Ant tool) cannot preserve file
>> permissions or symlinks.
>>
> maven assembly plugin can deal with file permissions. not sure about
> symlinks. I do not remember dist tar to have symlinks inside.
>
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Konstantin Boudnik-2
In reply to this post by Matt Foley
Ditto...

On Wed, Nov 21, 2012 at 01:14PM, Matt Foley wrote:

> Cos,
> Please see in-line.
>
> On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <[hidden email]> wrote:
>
> > I like Alejandro's idea about Maven for a few of reasons:
> >   - bringing in a scripting environment which is known for its
> >   inter-version idiosyncrasies just because Windows can't handle trivial
> >   shell scripting looks like an overkill to me
>
> Excuse me?  Can we at least try not to belittle other people's platforms on
> a public Apache forum?  There's nothing trivial about implementing shell on
> Windows, as cygwin regrettably proved.
Belittle? Hardly ;) Because we all know very well why shell is so awkward to
implement on any Windows system.

> >   - relative to above, there's a chance that Python's pre-requisites used
> >   in Hadoop might get into a conflict with some other components in the
> >   stack.  This will be a nightmare for the integrator projects i.e. Bigtop
>
> Said Bigtop project actually uses python, does it not?

It does, Matt. The main concern I have is at some point Hadoop's Python might
all of a sudden be of a different version than the one in BigTop. And all the
hell will break lose compatibility wise. What would be the solution then?

> >   - Maven is de-facto standard for Java stacks
> >
>
> Sure -- except for when Ant was the de-facto standard for Java stacks.  And

Arguable. Yet beyond the point.

> let's remember what maven and ant are/were the de-facto standard for:
>  Doing builds.  Not scripting everything that needs scripting.

Arguable as well, due to the very definition of a build system.

> >   - Maven has built-in scripting language (Groovy) if some plugins aren't
> >     sufficient for achieving whatever goals
>
> Are you proposing Groovy as a better scripting language than Python?

I am proposing Groovy is a better language than Python. Because, in part, it
goes far beyond scripting. And doesn't have permanent runtime backward
compatibility issues. What was the last time JDK had backward compatibility
problems?

> > Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses
> > Maven
> > stuff suchs as deploy/install via custom ant tasks. Same approach would
> > work
> > for saveVersion.sh and others, I am sure.
>
> Current ant scripts in Hadoop seem to use maven only for artifact
> management via the maven repository.  If I'm missing something, please
> point it out.  The ant build task currently calls out to saveVersion.sh.
> Having it call out to maven, which then calls out to a plug-in and/or a
> Groovy script, doesn't sound like an improvement to me.  And it's a way
At least it it guaranteed to work everywhere. And all we need in this case is
an extra jar file that can be pulled down through the same ivy/maven
dependency mechanism.

In case of Python you'd have to make sure that you're having the right version
of the interpreter and runtime. And you will have to do it manually or have an
extra requirement expressed via a system maintenance DSL.

> different use of maven than currently in the Hadoop-1 line, not a
> continuation of established practice.

The main point of my argument expressed in a lesser than 100 words: adding
Python that is inconsistent across different Linux distros and has a history
of backward incompatibilities (2.6 vs 2.5, 3.0 vs earlier, etc.) doesn't seem
to leverage the benefit of having a somewhat easier build in Windows.

Perhaps, we can do a more format benefit analysis by just comparing the
number of Hadoop installations on MS Win vs. Unix's.

Cos

> > On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:
> > > Hey Matt,
> > >
> > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on
> > > its way out with the move of docs to APT)
> > >
> > > Why not do a maven-plugin to do that?
> > >
> > > Colin already has something to simplify all the cmake calls from the
> > builds
> > > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)
> > >
> > > We could do the same with protoc, thus simplifying the POMs.
> > >
> > > The saveVersion.sh seems like another prime candidate for a maven plugin,
> > > and in this case it would not require external tools.
> > >
> > > Does this make sense?
> > >
> > > Thx
> > >
> > > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[hidden email]> wrote:
> > >
> > > > This discussion started in
> > > > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>
> > > > , where it was proposed to replace the build-time utility
> > "saveVersion.sh"
> > > > with a python script.  This would require Python as a build-time
> > > > dependency.  Here's the background:
> > > >
> > > > Those of us involved in the branch-1-win port of Hadoop to Windows
> > without
> > > > use of Cygwin, have faced the issue of frequent use of shell scripts
> > > > throughout the system, both in build time (eg, the utility
> > > > "saveVersion.sh"),
> > > > and run time (config files like "hadoop-env.sh" and the start/stop
> > scripts
> > > > in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all
> > > > projects.
> > > >
> > > > The vast majority of these shell scripts do not do anything platform
> > > > specific; they can be expressed in a posix-conforming way.  Therefore,
> > it
> > > > seems to us that it makes sense to start using a cross-platform
> > scripting
> > > > language, such as python, in place of shell for these purposes.  For
> > those
> > > > rare occasions where platform-specific functionality really is needed,
> > > > python also supports quite a lot of platform-specific functionality on
> > both
> > > > Linux and Windows; but where that is inadequate, one could still
> > > > conditionally invoke a platform-specific module written in shell (for
> > > > Linux/*nix) or powershell or bat (for Windows).
> > > >
> > > > The primary motive for moving to a cross-platform scripting language is
> > > > maintainability.  The alternative would be to maintain two complete
> > suites
> > > > of scripts, one for Linux and one for Windows (and perhaps others in
> > the
> > > > future).  We want to avoid the need to update dual modules in two
> > different
> > > > languages when functionality changes, especially given that many Linux
> > > > developers are not familiar with powershell or bat, and many Windows
> > > > developers are not familiar with shell or bash.
> > > >
> > > > Regarding the choice of python:
> > > >
> > > >    - There are already a few instances of python usage in Hadoop, such
> > as
> > > >    the utility (currently broken) "relnotes.py", and massive usage of
> > > > python
> > > >    in the examples/ and contrib/ directories.
> > > >    - Python is also used in Bigtop build-time.
> > > >    - The Python language is available for free on essentially all
> > > >    platforms, under an Apache-compatible
> > > > license<http://www.apache.org/legal/resolved.html>.
> > > >
> > > >    - It is supported in Eclipse and similar IDEs.
> > > >    - Most importantly, it is widely accepted as a reasonably good OO
> > > >    scripting language, and it is easily learned by anyone who already
> > knows
> > > >    shell or perl, or other common scripting languages.
> > > >    - On the Tiobe index of programming language
> > > > popularity<
> > > > http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,
> > > >    which seeks to measure the relative number of software engineers who
> > > > know
> > > >    and use each language, Python far exceeds Perl and Ruby.  The only
> > more
> > > >    well-known scripting languages are PHP and Visual Basic, neither of
> > > > which
> > > >    seems a prime candidate for this use.
> > > >
> > > > For build-time usage, I think we should immediately approve python as a
> > > > build-time dependency, and allow people who are motivated to do so, to
> > open
> > > > jiras for migrating existing build-time shell scripts to python.
> > > >
> > > > For run-time, there is likely to be a lot more discussion.  Lots of
> > folks,
> > > > including me, aren't real happy with use of active scripts for
> > > > configuration, and various others, including I believe some of the
> > Bigtop
> > > > folks, have issues with the way the start/stop scripts work.
> >  Nevertheless,
> > > > all those scripts exist today and are widely used.  And they present an
> > > > impediment to porting to Windows-without-cygwin.
> > > >
> > > > Nothing about run-time use of scripts has changed significantly over
> > the
> > > > past three years, and I don't think we should hold up the Windows port
> > > > while we have a huge discussion about issues that veer dangerously into
> > > > religious/aesthetic domains. It would be fun to have that discussion,
> > but I
> > > > don't want this decision to be dependent on it!
> > > >
> > > > So I propose that we go ahead and also approve python as a run-time
> > > > dependency, and allow the inclusion of python scripts in place of
> > current
> > > > shell-based functionality.  The unpleasant alternative is to spawn a
> > bunch
> > > > of powershell scripts in parallel to the current shell scripts, with a
> > very
> > > > negative impact on maintainability.  The Windows port must, after all,
> > be
> > > > allowed to proceed.
> > > >
> > > > Let's have a discussion, and then I'll put both issues, separately, to
> > a
> > > > vote (unless we miraculously achieve consensus without a vote :-)
> > > >
> > > > I also encourage members of the other Hadoop-related projects, to carry
> > > > this discussion into those forums.  It would be very cool to agree on a
> > > > whole-stack solution for the scripting problem.
> > > >
> > > > Best regards,
> > > > --Matt
> > > >
> > >
> > >
> > >
> > > --
> > > Alejandro
> >

signature.asc (237 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Andy Isaacson
On Wed, Nov 21, 2012 at 1:50 PM, Konstantin Boudnik <[hidden email]> wrote:
> The main point of my argument expressed in a lesser than 100 words: adding
> Python that is inconsistent across different Linux distros and has a history
> of backward incompatibilities (2.6 vs 2.5, 3.0 vs earlier, etc.) doesn't seem
> to leverage the benefit of having a somewhat easier build in Windows.

This seems mostly like a red herring to me. I'll grant that it's hard
to build a complete Python stack that's compatible between Python 2.x
and 2.y, but it's very easy by following best practices to keep python
*scripts* compatible across all reasonable Python 2.x versions.
Simply pick an oldest-supported-version like 2.4 and be reasonably
disciplined to not use newer constructs or libraries. I wouldn't wish
to try to build a complete system in such a limited dialect [1], but
for "we need a reasonable replacement for /bin/sh scripts" it's just
fine.

Ignore Python 3 for the time being, it's a completely different
language with incompatible syntax and semantics that doesn't support
several currently-important platforms.  Maybe in a few years sane
people can consider moving to it, but for now it's best to just stick
with the compatible subset of Python 2.x.

[1] the Mercurial project has had a pretty good experience with this
scheme; http://mercurial.selenic.com/wiki/SupportedPythonVersions they
currently support 2.4 - 2.7 with a few required libraries. They
dropped 2.2 and 2.3 support a few years ago due to specific
shortcomings on those versions.

-andy
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Radim Kolar-2
In reply to this post by Chris Nauroth
Dne 21.11.2012 22:44, Chris Nauroth napsal(a):
> Sorry, to clarify my point a little more, Ant does allow you to make
> declarations to explicitly set the desired file permissions via the
> fileMode attribute of a tarfileset.
there are just 2 directories /bin and /sbin with executable files. Its
probably possible to set file mode per directory in maven assembly plugin.
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Radim Kolar-2
In reply to this post by Andy Isaacson
/Ignore Python 3 for the time being, it's a completely different
language with incompatible syntax and semantics that doesn't support
several currently-important platforms. Maybe in a few years sane people
can consider moving to it, but for now it's best to just stick with the
compatible subset of Python 2.x. [1] the Mercurial project has had a
pretty good experience with this scheme;
http://mercurial.selenic.com/wiki/SupportedPythonVersions they currently
support 2.4 - 2.7 with a few required libraries. They dropped 2.2 and
2.3 support a few years ago due to specific shortcomings on those versions./

I know that Python compatibility can be worked around. I used Python for
few years and wrote about 70k LOC in it until it started to irritate me
that every new version has incompatibilities such as 2.4 vs 2.3 vs 2.5
and it makes maintaining and testing way harder then it should be. Its
not just compatibility with missing library functions. sometimes even
expression evaluated to different value under new version. This was
similar to php 4 to php 5 migration. Today i have 3 versions of python
installed because of software requirements.

For simple scripts it can probably work if you stick to some common subset.

Scripting via maven plugin has advantage that user do not needs to
install anything, there is couple of languages available: scala, groovy,
jelly, jruby. Maybe jython too.
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Chris Nauroth
In reply to this post by Radim Kolar-2
Unfortunately, there are a couple of spots where it gets really messy and
directory-wide rules fail to cover it.  The trickiest maintenance issue is
hadoop-hdfs-httpfs, where we unpack and repack a Tomcat.  Initially, I
tried to do this using only the ant plugin, but I wound up with a ton of
different tarfileset directives with different fileMode values to reapply
the same permissions that were present in the original Tomcat distribution.
 This also would have been a brittle solution, because changes in the
Tomcat package would risk invalidating our ant rules.  A solution that
preserves the original permissions reduces this kind of maintenance work.

Thanks,
--Chris

On Wed, Nov 21, 2012 at 3:15 PM, Radim Kolar <[hidden email]> wrote:

> Dne 21.11.2012 22:44, Chris Nauroth napsal(a):
>
>  Sorry, to clarify my point a little more, Ant does allow you to make
>> declarations to explicitly set the desired file permissions via the
>> fileMode attribute of a tarfileset.
>>
> there are just 2 directories /bin and /sbin with executable files. Its
> probably possible to set file mode per directory in maven assembly plugin.
>
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Konstantin Boudnik-2
In reply to this post by Radim Kolar-2
On Thu, Nov 22, 2012 at 12:58AM, Radim Kolar wrote:

> I know that Python compatibility can be worked around. I used Python
> for few years and wrote about 70k LOC in it until it started to
> irritate me that every new version has incompatibilities such as 2.4
> vs 2.3 vs 2.5 and it makes maintaining and testing way harder then
> it should be. Its not just compatibility with missing library
> functions. sometimes even expression evaluated to different value
> under new version. This was similar to php 4 to php 5 migration.
> Today i have 3 versions of python installed because of software
> requirements.
>
> For simple scripts it can probably work if you stick to some common subset.
>
> Scripting via maven plugin has advantage that user do not needs to
> install anything, there is couple of languages available: scala,
> groovy, jelly, jruby. Maybe jython too.

pretty much all of the j* in JSR223 land is abomination of one sort or
another, actually :)

Cos
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Radim Kolar-2
In reply to this post by Chris Nauroth
Dne 22.11.2012 1:14, Chris Nauroth napsal(a):
> The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack and repack a Tomcat.
why its not possible to just ship WAR file? Its seems to be special
purpose app and they needs hand security setup anyway and intergration
with existing firewall/web infrastructure.

did you considered to use Jetty? it has really good maven support:
http://wiki.eclipse.org/Jetty/Feature/Jetty_Maven_Plugin
I am using jetty 8 instead of tomcat and run it with java -jar start.jar
no extra file permissions like x bit are needed.

If you really need to create tar by hand, there is java library for
doing it - http://code.google.com/p/jtar/ and it can be used from any
JVM based script language, you have plenty of choices.
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Radim Kolar-2
In reply to this post by Konstantin Boudnik-2

> pretty much all of the j* in JSR223 land is abomination of one sort or
> another, actually :)
jruby is good because you can run rails application on standard Java
infrastructure which is way easier to maintain, then obscure Ruby
application servers.
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Chris Nauroth
In reply to this post by Radim Kolar-2
This predates me, so I don't know the rationale for repackaging Tomcat
inside HTTPFS.  I suspect that there was a desire to create a fully
stand-alone distribution package, including a full web server.  The Maven
Jetty plugin isn't directly applicable to this use case.  I don't know why
it was decided to use Tomcat instead of Jetty.  (If anyone else out there
has the background, please respond.)  Regardless, if the desire is to
package a full web server instead of just the war, then switching to Jetty
would not change the challenges of the build process.  We'd still need to
preserve whatever permissions are present in the Jetty distribution.

In general, when I was working on this, I did not question whether the
current packaging was "correct".  I assumed that whatever changes I made
for Windows compatibility must yield the exact same distribution without
changes on currently supported platforms like Linux.  If there are
questions around actually changing the output of the build process, then
that will steer the conversation in another direction and increase the
scope of this effort.

It seems like the trickiest issue is preservation of permissions and
symlinks in tar files.  I suspect that any JVM-based solution like custom
Maven plugins, Groovy, or jtar would be limited in this respect.  According
to Ant documentation, it's a JDK limitation, so I suspect all of these
would have the same problem.  I haven't tried any of them though.  (If
there was a feasible solution, then Ant likely would have incorporated it
long ago.)  If anyone wants to try though, we might learn something from
that.

Thank you,
--Chris


On Wed, Nov 21, 2012 at 5:55 PM, Radim Kolar <[hidden email]> wrote:

> Dne 22.11.2012 1:14, Chris Nauroth napsal(a):
>
>  The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack
>> and repack a Tomcat.
>>
> why its not possible to just ship WAR file? Its seems to be special
> purpose app and they needs hand security setup anyway and intergration with
> existing firewall/web infrastructure.
>
> did you considered to use Jetty? it has really good maven support:
> http://wiki.eclipse.org/Jetty/**Feature/Jetty_Maven_Plugin<http://wiki.eclipse.org/Jetty/Feature/Jetty_Maven_Plugin>
> I am using jetty 8 instead of tomcat and run it with java -jar start.jar
> no extra file permissions like x bit are needed.
>
> If you really need to create tar by hand, there is java library for doing
> it - http://code.google.com/p/jtar/ and it can be used from any JVM based
> script language, you have plenty of choices.
>
12