Change proposal for FileInputFormat isSplitable

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Change proposal for FileInputFormat isSplitable

Niels Basjes
Hi,

Last week I ran into this problem again
https://issues.apache.org/jira/browse/MAPREDUCE-2094

What happens here is that the default implementation of the isSplitable
method in FileInputFormat is so unsafe that just about everyone who
implements a new subclass is likely to get this wrong. The effect of
getting this wrong is that all unit tests succeed and running it against
'large' input files (>>64MiB) that are compressed using a non-splittable
compression (often Gzip) will cause the input to be fed into the mappers
multiple time (i.e. you get garbage results without ever seeing any
errors).

Last few days I was at Berlin buzzwords talking to someone about this bug
and this resulted in the following proposal which I would like your
feedback on.

1) This is a change that will break backwards compatibility (deliberate
choice).
2) The FileInputFormat will get 3 methods (the old isSplitable with the
typo of one 't' in the name will disappear):
    (protected) isSplittableContainer --> true unless compressed with
non-splittable compression.
    (protected) isSplittableContent --> abstract, MUST be implemented by
the subclass
    (public)      isSplittable --> isSplittableContainer &&
isSplittableContent

The idea is that only the isSplittable is used by other classes to know if
this is a splittable file.
The effect I hope to get is that a developer writing their own
fileinputformat (which I alone have done twice so far) is 'forced' and
'helped' getting this right.

The reason for me to propose this as an incompatible change is that this
way I hope to eradicate some of the existing bugs in custom implementations
'out there'.

P.S. If you agree to this change then I'm willing to put my back into it
and submit a patch.

--
Best regards,

Niels Basjes
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Steve Loughran-3
On 28 May 2014 20:50, Niels Basjes <[hidden email]> wrote:

> Hi,
>
> Last week I ran into this problem again
> https://issues.apache.org/jira/browse/MAPREDUCE-2094
>
> What happens here is that the default implementation of the isSplitable
> method in FileInputFormat is so unsafe that just about everyone who
> implements a new subclass is likely to get this wrong. The effect of
> getting this wrong is that all unit tests succeed and running it against
> 'large' input files (>>64MiB) that are compressed using a non-splittable
> compression (often Gzip) will cause the input to be fed into the mappers
> multiple time (i.e. you get garbage results without ever seeing any
> errors).
>
> Last few days I was at Berlin buzzwords talking to someone about this bug
>

that was me, I recall.


> and this resulted in the following proposal which I would like your
> feedback on.
>
> 1) This is a change that will break backwards compatibility (deliberate
> choice).
> 2) The FileInputFormat will get 3 methods (the old isSplitable with the
> typo of one 't' in the name will disappear):
>     (protected) isSplittableContainer --> true unless compressed with
> non-splittable compression.
>     (protected) isSplittableContent --> abstract, MUST be implemented by
> the subclass
>     (public)      isSplittable --> isSplittableContainer &&
> isSplittableContent
>
> The idea is that only the isSplittable is used by other classes to know if
> this is a splittable file.
> The effect I hope to get is that a developer writing their own
> fileinputformat (which I alone have done twice so far) is 'forced' and
> 'helped' getting this right.
>

I could see making the attributes more explicit would be good -but stopping
everything that exists from working isn't going to fly.

what about some subclass, AbstractSplittableFileInputFormat that implements
the container properly, requires that content one -and then calculates
IsSplitable() from the results? Existing code: no change, new formats can
descend from this (and built in ones retrofitted).



> The reason for me to propose this as an incompatible change is that this
> way I hope to eradicate some of the existing bugs in custom implementations
> 'out there'.
>
> P.S. If you agree to this change then I'm willing to put my back into it
> and submit a patch.
>
> --
> Best regards,
>
> Niels Basjes
>

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Matt Fellows
I could be missing something, but couldn't you just deprecate isSplitable (spelled incorrectly) and create a new isSplittable as described?


On Thu, May 29, 2014 at 10:34 AM, Steve Loughran <[hidden email]> wrote:
On 28 May 2014 20:50, Niels Basjes <[hidden email]> wrote:

> Hi,
>
> Last week I ran into this problem again
> https://issues.apache.org/jira/browse/MAPREDUCE-2094
>
> What happens here is that the default implementation of the isSplitable
> method in FileInputFormat is so unsafe that just about everyone who
> implements a new subclass is likely to get this wrong. The effect of
> getting this wrong is that all unit tests succeed and running it against
> 'large' input files (>>64MiB) that are compressed using a non-splittable
> compression (often Gzip) will cause the input to be fed into the mappers
> multiple time (i.e. you get garbage results without ever seeing any
> errors).
>
> Last few days I was at Berlin buzzwords talking to someone about this bug
>

that was me, I recall.


> and this resulted in the following proposal which I would like your
> feedback on.
>
> 1) This is a change that will break backwards compatibility (deliberate
> choice).
> 2) The FileInputFormat will get 3 methods (the old isSplitable with the
> typo of one 't' in the name will disappear):
>     (protected) isSplittableContainer --> true unless compressed with
> non-splittable compression.
>     (protected) isSplittableContent --> abstract, MUST be implemented by
> the subclass
>     (public)      isSplittable --> isSplittableContainer &&
> isSplittableContent
>
> The idea is that only the isSplittable is used by other classes to know if
> this is a splittable file.
> The effect I hope to get is that a developer writing their own
> fileinputformat (which I alone have done twice so far) is 'forced' and
> 'helped' getting this right.
>

I could see making the attributes more explicit would be good -but stopping
everything that exists from working isn't going to fly.

what about some subclass, AbstractSplittableFileInputFormat that implements
the container properly, requires that content one -and then calculates
IsSplitable() from the results? Existing code: no change, new formats can
descend from this (and built in ones retrofitted).



> The reason for me to propose this as an incompatible change is that this
> way I hope to eradicate some of the existing bugs in custom implementations
> 'out there'.
>
> P.S. If you agree to this change then I'm willing to put my back into it
> and submit a patch.
>
> --
> Best regards,
>
> Niels Basjes
>

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.



--

First Option Software Ltd
Signal House
Jacklyns Lane
Alresford
SO24 9JJ

____________________________________________________

This is confidential, non-binding and not company endorsed - see full terms at www.fosolutions.co.uk/emailpolicy.html 
First Option Software Ltd Registered No. 06340261
Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
____________________________________________________

Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
In reply to this post by Steve Loughran-3
My original proposal (from about 3 years ago) was to change the isSplitable
method to return a safe default ( you can see that in the patch that is
still attached to that Jira issue).
For arguments I still do not fully understand this was rejected by Todd and
Doug.

So that is why my new proposal is to deprecate (remove!) the old method
with the typo in Hadoop 3.0 and replace it with something correct and less
error prone.
Given the fact that this would happen in a major version jump I thought
that would be the right time to do that.

Niels


On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <[hidden email]>wrote:

> On 28 May 2014 20:50, Niels Basjes <[hidden email]> wrote:
>
> > Hi,
> >
> > Last week I ran into this problem again
> > https://issues.apache.org/jira/browse/MAPREDUCE-2094
> >
> > What happens here is that the default implementation of the isSplitable
> > method in FileInputFormat is so unsafe that just about everyone who
> > implements a new subclass is likely to get this wrong. The effect of
> > getting this wrong is that all unit tests succeed and running it against
> > 'large' input files (>>64MiB) that are compressed using a non-splittable
> > compression (often Gzip) will cause the input to be fed into the mappers
> > multiple time (i.e. you get garbage results without ever seeing any
> > errors).
> >
> > Last few days I was at Berlin buzzwords talking to someone about this bug
> >
>
> that was me, I recall.
>
>
> > and this resulted in the following proposal which I would like your
> > feedback on.
> >
> > 1) This is a change that will break backwards compatibility (deliberate
> > choice).
> > 2) The FileInputFormat will get 3 methods (the old isSplitable with the
> > typo of one 't' in the name will disappear):
> >     (protected) isSplittableContainer --> true unless compressed with
> > non-splittable compression.
> >     (protected) isSplittableContent --> abstract, MUST be implemented by
> > the subclass
> >     (public)      isSplittable --> isSplittableContainer &&
> > isSplittableContent
> >
> > The idea is that only the isSplittable is used by other classes to know
> if
> > this is a splittable file.
> > The effect I hope to get is that a developer writing their own
> > fileinputformat (which I alone have done twice so far) is 'forced' and
> > 'helped' getting this right.
> >
>
> I could see making the attributes more explicit would be good -but stopping
> everything that exists from working isn't going to fly.
>
> what about some subclass, AbstractSplittableFileInputFormat that implements
> the container properly, requires that content one -and then calculates
> IsSplitable() from the results? Existing code: no change, new formats can
> descend from this (and built in ones retrofitted).
>
>
>
> > The reason for me to propose this as an incompatible change is that this
> > way I hope to eradicate some of the existing bugs in custom
> implementations
> > 'out there'.
> >
> > P.S. If you agree to this change then I'm willing to put my back into it
> > and submit a patch.
> >
> > --
> > Best regards,
> >
> > Niels Basjes
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



--
Best regards / Met vriendelijke groeten,

Niels Basjes
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Matt Fellows
As someone who doesn't really contribute, just lurks, I could well be misinformed or under-informed, but I don't see why we can't deprecate a method which could cause dangerous side effects?  
People can still use the deprecated methods for backwards compatibility, but are discouraged by compiler warnings, and any changes they write to their code can start to use the new functionality?

*Apologies if I'm stepping into a Hadoop holy war here


On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <[hidden email]> wrote:
My original proposal (from about 3 years ago) was to change the isSplitable
method to return a safe default ( you can see that in the patch that is
still attached to that Jira issue).
For arguments I still do not fully understand this was rejected by Todd and
Doug.

So that is why my new proposal is to deprecate (remove!) the old method
with the typo in Hadoop 3.0 and replace it with something correct and less
error prone.
Given the fact that this would happen in a major version jump I thought
that would be the right time to do that.

Niels


On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <[hidden email]>wrote:

> On 28 May 2014 20:50, Niels Basjes <[hidden email]> wrote:
>
> > Hi,
> >
> > Last week I ran into this problem again
> > https://issues.apache.org/jira/browse/MAPREDUCE-2094
> >
> > What happens here is that the default implementation of the isSplitable
> > method in FileInputFormat is so unsafe that just about everyone who
> > implements a new subclass is likely to get this wrong. The effect of
> > getting this wrong is that all unit tests succeed and running it against
> > 'large' input files (>>64MiB) that are compressed using a non-splittable
> > compression (often Gzip) will cause the input to be fed into the mappers
> > multiple time (i.e. you get garbage results without ever seeing any
> > errors).
> >
> > Last few days I was at Berlin buzzwords talking to someone about this bug
> >
>
> that was me, I recall.
>
>
> > and this resulted in the following proposal which I would like your
> > feedback on.
> >
> > 1) This is a change that will break backwards compatibility (deliberate
> > choice).
> > 2) The FileInputFormat will get 3 methods (the old isSplitable with the
> > typo of one 't' in the name will disappear):
> >     (protected) isSplittableContainer --> true unless compressed with
> > non-splittable compression.
> >     (protected) isSplittableContent --> abstract, MUST be implemented by
> > the subclass
> >     (public)      isSplittable --> isSplittableContainer &&
> > isSplittableContent
> >
> > The idea is that only the isSplittable is used by other classes to know
> if
> > this is a splittable file.
> > The effect I hope to get is that a developer writing their own
> > fileinputformat (which I alone have done twice so far) is 'forced' and
> > 'helped' getting this right.
> >
>
> I could see making the attributes more explicit would be good -but stopping
> everything that exists from working isn't going to fly.
>
> what about some subclass, AbstractSplittableFileInputFormat that implements
> the container properly, requires that content one -and then calculates
> IsSplitable() from the results? Existing code: no change, new formats can
> descend from this (and built in ones retrofitted).
>
>
>
> > The reason for me to propose this as an incompatible change is that this
> > way I hope to eradicate some of the existing bugs in custom
> implementations
> > 'out there'.
> >
> > P.S. If you agree to this change then I'm willing to put my back into it
> > and submit a patch.
> >
> > --
> > Best regards,
> >
> > Niels Basjes
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



--
Best regards / Met vriendelijke groeten,

Niels Basjes



--

First Option Software Ltd
Signal House
Jacklyns Lane
Alresford
SO24 9JJ

____________________________________________________

This is confidential, non-binding and not company endorsed - see full terms at www.fosolutions.co.uk/emailpolicy.html 
First Option Software Ltd Registered No. 06340261
Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
____________________________________________________

Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

jay vyas
I think breaking backwards compat is sensible since It's easily caught by the compiler and  in this case where the alternative is a
Runtime error that can result in terabytes of mucked up output.

> On May 29, 2014, at 6:11 AM, Matt Fellows <[hidden email]> wrote:
>
> As someone who doesn't really contribute, just lurks, I could well be misinformed or under-informed, but I don't see why we can't deprecate a method which could cause dangerous side effects?  
> People can still use the deprecated methods for backwards compatibility, but are discouraged by compiler warnings, and any changes they write to their code can start to use the new functionality?
>
> *Apologies if I'm stepping into a Hadoop holy war here
>
>
>> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <[hidden email]> wrote:
>> My original proposal (from about 3 years ago) was to change the isSplitable
>> method to return a safe default ( you can see that in the patch that is
>> still attached to that Jira issue).
>> For arguments I still do not fully understand this was rejected by Todd and
>> Doug.
>>
>> So that is why my new proposal is to deprecate (remove!) the old method
>> with the typo in Hadoop 3.0 and replace it with something correct and less
>> error prone.
>> Given the fact that this would happen in a major version jump I thought
>> that would be the right time to do that.
>>
>> Niels
>>
>>
>> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <[hidden email]>wrote:
>>
>> > On 28 May 2014 20:50, Niels Basjes <[hidden email]> wrote:
>> >
>> > > Hi,
>> > >
>> > > Last week I ran into this problem again
>> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094
>> > >
>> > > What happens here is that the default implementation of the isSplitable
>> > > method in FileInputFormat is so unsafe that just about everyone who
>> > > implements a new subclass is likely to get this wrong. The effect of
>> > > getting this wrong is that all unit tests succeed and running it against
>> > > 'large' input files (>>64MiB) that are compressed using a non-splittable
>> > > compression (often Gzip) will cause the input to be fed into the mappers
>> > > multiple time (i.e. you get garbage results without ever seeing any
>> > > errors).
>> > >
>> > > Last few days I was at Berlin buzzwords talking to someone about this bug
>> > >
>> >
>> > that was me, I recall.
>> >
>> >
>> > > and this resulted in the following proposal which I would like your
>> > > feedback on.
>> > >
>> > > 1) This is a change that will break backwards compatibility (deliberate
>> > > choice).
>> > > 2) The FileInputFormat will get 3 methods (the old isSplitable with the
>> > > typo of one 't' in the name will disappear):
>> > >     (protected) isSplittableContainer --> true unless compressed with
>> > > non-splittable compression.
>> > >     (protected) isSplittableContent --> abstract, MUST be implemented by
>> > > the subclass
>> > >     (public)      isSplittable --> isSplittableContainer &&
>> > > isSplittableContent
>> > >
>> > > The idea is that only the isSplittable is used by other classes to know
>> > if
>> > > this is a splittable file.
>> > > The effect I hope to get is that a developer writing their own
>> > > fileinputformat (which I alone have done twice so far) is 'forced' and
>> > > 'helped' getting this right.
>> > >
>> >
>> > I could see making the attributes more explicit would be good -but stopping
>> > everything that exists from working isn't going to fly.
>> >
>> > what about some subclass, AbstractSplittableFileInputFormat that implements
>> > the container properly, requires that content one -and then calculates
>> > IsSplitable() from the results? Existing code: no change, new formats can
>> > descend from this (and built in ones retrofitted).
>> >
>> >
>> >
>> > > The reason for me to propose this as an incompatible change is that this
>> > > way I hope to eradicate some of the existing bugs in custom
>> > implementations
>> > > 'out there'.
>> > >
>> > > P.S. If you agree to this change then I'm willing to put my back into it
>> > > and submit a patch.
>> > >
>> > > --
>> > > Best regards,
>> > >
>> > > Niels Basjes
>> > >
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or entity to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the reader
>> > of this message is not the intended recipient, you are hereby notified that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender immediately
>> > and delete it from your system. Thank You.
>> >
>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>
>
>
> --
>
> First Option Software Ltd
> Signal House
> Jacklyns Lane
> Alresford
> SO24 9JJ
> Tel: +44 (0)1962 738232
> Mob: +44 (0)7710 160458
> Fax: +44 (0)1962 600112
> Web: www.bespokesoftware.com
>
> ____________________________________________________
>
> This is confidential, non-binding and not company endorsed - see full terms at www.fosolutions.co.uk/emailpolicy.html
> First Option Software Ltd Registered No. 06340261
> Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
> ____________________________________________________
>
jay vyas
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
This is exactly why I'm proposing a change that will either 'fix silently'
(my original patch from 3 years ago) or 'break loudly' (my current
proposal) old implementations.
I'm convinced that ther are atleast 100 companies world wide that have a
custom implementation with this bug and have no clue they have been basing
descision upon silently corrupted data.


On Thu, May 29, 2014 at 1:21 PM, Jay Vyas <[hidden email]> wrote:

> I think breaking backwards compat is sensible since It's easily caught by
> the compiler and  in this case where the alternative is a
> Runtime error that can result in terabytes of mucked up output.
>
> > On May 29, 2014, at 6:11 AM, Matt Fellows <
> [hidden email]> wrote:
> >
> > As someone who doesn't really contribute, just lurks, I could well be
> misinformed or under-informed, but I don't see why we can't deprecate a
> method which could cause dangerous side effects?
> > People can still use the deprecated methods for backwards compatibility,
> but are discouraged by compiler warnings, and any changes they write to
> their code can start to use the new functionality?
> >
> > *Apologies if I'm stepping into a Hadoop holy war here
> >
> >
> >> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <[hidden email]> wrote:
> >> My original proposal (from about 3 years ago) was to change the
> isSplitable
> >> method to return a safe default ( you can see that in the patch that is
> >> still attached to that Jira issue).
> >> For arguments I still do not fully understand this was rejected by Todd
> and
> >> Doug.
> >>
> >> So that is why my new proposal is to deprecate (remove!) the old method
> >> with the typo in Hadoop 3.0 and replace it with something correct and
> less
> >> error prone.
> >> Given the fact that this would happen in a major version jump I thought
> >> that would be the right time to do that.
> >>
> >> Niels
> >>
> >>
> >> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <
> [hidden email]>wrote:
> >>
> >> > On 28 May 2014 20:50, Niels Basjes <[hidden email]> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Last week I ran into this problem again
> >> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094
> >> > >
> >> > > What happens here is that the default implementation of the
> isSplitable
> >> > > method in FileInputFormat is so unsafe that just about everyone who
> >> > > implements a new subclass is likely to get this wrong. The effect of
> >> > > getting this wrong is that all unit tests succeed and running it
> against
> >> > > 'large' input files (>>64MiB) that are compressed using a
> non-splittable
> >> > > compression (often Gzip) will cause the input to be fed into the
> mappers
> >> > > multiple time (i.e. you get garbage results without ever seeing any
> >> > > errors).
> >> > >
> >> > > Last few days I was at Berlin buzzwords talking to someone about
> this bug
> >> > >
> >> >
> >> > that was me, I recall.
> >> >
> >> >
> >> > > and this resulted in the following proposal which I would like your
> >> > > feedback on.
> >> > >
> >> > > 1) This is a change that will break backwards compatibility
> (deliberate
> >> > > choice).
> >> > > 2) The FileInputFormat will get 3 methods (the old isSplitable with
> the
> >> > > typo of one 't' in the name will disappear):
> >> > >     (protected) isSplittableContainer --> true unless compressed
> with
> >> > > non-splittable compression.
> >> > >     (protected) isSplittableContent --> abstract, MUST be
> implemented by
> >> > > the subclass
> >> > >     (public)      isSplittable --> isSplittableContainer &&
> >> > > isSplittableContent
> >> > >
> >> > > The idea is that only the isSplittable is used by other classes to
> know
> >> > if
> >> > > this is a splittable file.
> >> > > The effect I hope to get is that a developer writing their own
> >> > > fileinputformat (which I alone have done twice so far) is 'forced'
> and
> >> > > 'helped' getting this right.
> >> > >
> >> >
> >> > I could see making the attributes more explicit would be good -but
> stopping
> >> > everything that exists from working isn't going to fly.
> >> >
> >> > what about some subclass, AbstractSplittableFileInputFormat that
> implements
> >> > the container properly, requires that content one -and then calculates
> >> > IsSplitable() from the results? Existing code: no change, new formats
> can
> >> > descend from this (and built in ones retrofitted).
> >> >
> >> >
> >> >
> >> > > The reason for me to propose this as an incompatible change is that
> this
> >> > > way I hope to eradicate some of the existing bugs in custom
> >> > implementations
> >> > > 'out there'.
> >> > >
> >> > > P.S. If you agree to this change then I'm willing to put my back
> into it
> >> > > and submit a patch.
> >> > >
> >> > > --
> >> > > Best regards,
> >> > >
> >> > > Niels Basjes
> >> > >
> >> >
> >> > --
> >> > CONFIDENTIALITY NOTICE
> >> > NOTICE: This message is intended for the use of the individual or
> entity to
> >> > which it is addressed and may contain information that is
> confidential,
> >> > privileged and exempt from disclosure under applicable law. If the
> reader
> >> > of this message is not the intended recipient, you are hereby
> notified that
> >> > any printing, copying, dissemination, distribution, disclosure or
> >> > forwarding of this communication is strictly prohibited. If you have
> >> > received this communication in error, please contact the sender
> immediately
> >> > and delete it from your system. Thank You.
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards / Met vriendelijke groeten,
> >>
> >> Niels Basjes
> >
> >
> >
> > --
> >
> > First Option Software Ltd
> > Signal House
> > Jacklyns Lane
> > Alresford
> > SO24 9JJ
> > Tel: +44 (0)1962 738232
> > Mob: +44 (0)7710 160458
> > Fax: +44 (0)1962 600112
> > Web: www.bespokesoftware.com
> >
> > ____________________________________________________
> >
> > This is confidential, non-binding and not company endorsed - see full
> terms at www.fosolutions.co.uk/emailpolicy.html
> > First Option Software Ltd Registered No. 06340261
> > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
> > ____________________________________________________
> >
>



--
Best regards / Met vriendelijke groeten,

Niels Basjes
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
I forgot to ask a relevant question: What made the original proposed
solution "incompatible"?
To me it still seems to be a clean backward compatible solution that fixes
this issue in a simple way.

Perhaps Todd can explain why?

Niels
On May 29, 2014 2:17 PM, "Niels Basjes" <[hidden email]> wrote:

> This is exactly why I'm proposing a change that will either 'fix silently'
> (my original patch from 3 years ago) or 'break loudly' (my current
> proposal) old implementations.
> I'm convinced that ther are atleast 100 companies world wide that have a
> custom implementation with this bug and have no clue they have been basing
> descision upon silently corrupted data.
>
>
> On Thu, May 29, 2014 at 1:21 PM, Jay Vyas <[hidden email]> wrote:
>
>> I think breaking backwards compat is sensible since It's easily caught by
>> the compiler and  in this case where the alternative is a
>> Runtime error that can result in terabytes of mucked up output.
>>
>> > On May 29, 2014, at 6:11 AM, Matt Fellows <
>> [hidden email]> wrote:
>> >
>> > As someone who doesn't really contribute, just lurks, I could well be
>> misinformed or under-informed, but I don't see why we can't deprecate a
>> method which could cause dangerous side effects?
>> > People can still use the deprecated methods for backwards
>> compatibility, but are discouraged by compiler warnings, and any changes
>> they write to their code can start to use the new functionality?
>> >
>> > *Apologies if I'm stepping into a Hadoop holy war here
>> >
>> >
>> >> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <[hidden email]>
>> wrote:
>> >> My original proposal (from about 3 years ago) was to change the
>> isSplitable
>> >> method to return a safe default ( you can see that in the patch that is
>> >> still attached to that Jira issue).
>> >> For arguments I still do not fully understand this was rejected by
>> Todd and
>> >> Doug.
>> >>
>> >> So that is why my new proposal is to deprecate (remove!) the old method
>> >> with the typo in Hadoop 3.0 and replace it with something correct and
>> less
>> >> error prone.
>> >> Given the fact that this would happen in a major version jump I thought
>> >> that would be the right time to do that.
>> >>
>> >> Niels
>> >>
>> >>
>> >> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <
>> [hidden email]>wrote:
>> >>
>> >> > On 28 May 2014 20:50, Niels Basjes <[hidden email]> wrote:
>> >> >
>> >> > > Hi,
>> >> > >
>> >> > > Last week I ran into this problem again
>> >> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094
>> >> > >
>> >> > > What happens here is that the default implementation of the
>> isSplitable
>> >> > > method in FileInputFormat is so unsafe that just about everyone who
>> >> > > implements a new subclass is likely to get this wrong. The effect
>> of
>> >> > > getting this wrong is that all unit tests succeed and running it
>> against
>> >> > > 'large' input files (>>64MiB) that are compressed using a
>> non-splittable
>> >> > > compression (often Gzip) will cause the input to be fed into the
>> mappers
>> >> > > multiple time (i.e. you get garbage results without ever seeing any
>> >> > > errors).
>> >> > >
>> >> > > Last few days I was at Berlin buzzwords talking to someone about
>> this bug
>> >> > >
>> >> >
>> >> > that was me, I recall.
>> >> >
>> >> >
>> >> > > and this resulted in the following proposal which I would like your
>> >> > > feedback on.
>> >> > >
>> >> > > 1) This is a change that will break backwards compatibility
>> (deliberate
>> >> > > choice).
>> >> > > 2) The FileInputFormat will get 3 methods (the old isSplitable
>> with the
>> >> > > typo of one 't' in the name will disappear):
>> >> > >     (protected) isSplittableContainer --> true unless compressed
>> with
>> >> > > non-splittable compression.
>> >> > >     (protected) isSplittableContent --> abstract, MUST be
>> implemented by
>> >> > > the subclass
>> >> > >     (public)      isSplittable --> isSplittableContainer &&
>> >> > > isSplittableContent
>> >> > >
>> >> > > The idea is that only the isSplittable is used by other classes to
>> know
>> >> > if
>> >> > > this is a splittable file.
>> >> > > The effect I hope to get is that a developer writing their own
>> >> > > fileinputformat (which I alone have done twice so far) is 'forced'
>> and
>> >> > > 'helped' getting this right.
>> >> > >
>> >> >
>> >> > I could see making the attributes more explicit would be good -but
>> stopping
>> >> > everything that exists from working isn't going to fly.
>> >> >
>> >> > what about some subclass, AbstractSplittableFileInputFormat that
>> implements
>> >> > the container properly, requires that content one -and then
>> calculates
>> >> > IsSplitable() from the results? Existing code: no change, new
>> formats can
>> >> > descend from this (and built in ones retrofitted).
>> >> >
>> >> >
>> >> >
>> >> > > The reason for me to propose this as an incompatible change is
>> that this
>> >> > > way I hope to eradicate some of the existing bugs in custom
>> >> > implementations
>> >> > > 'out there'.
>> >> > >
>> >> > > P.S. If you agree to this change then I'm willing to put my back
>> into it
>> >> > > and submit a patch.
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > >
>> >> > > Niels Basjes
>> >> > >
>> >> >
>> >> > --
>> >> > CONFIDENTIALITY NOTICE
>> >> > NOTICE: This message is intended for the use of the individual or
>> entity to
>> >> > which it is addressed and may contain information that is
>> confidential,
>> >> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> >> > of this message is not the intended recipient, you are hereby
>> notified that
>> >> > any printing, copying, dissemination, distribution, disclosure or
>> >> > forwarding of this communication is strictly prohibited. If you have
>> >> > received this communication in error, please contact the sender
>> immediately
>> >> > and delete it from your system. Thank You.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best regards / Met vriendelijke groeten,
>> >>
>> >> Niels Basjes
>> >
>> >
>> >
>> > --
>> >
>> > First Option Software Ltd
>> > Signal House
>> > Jacklyns Lane
>> > Alresford
>> > SO24 9JJ
>> > Tel: +44 (0)1962 738232
>> > Mob: +44 (0)7710 160458
>> > Fax: +44 (0)1962 600112
>> > Web: www.bespokesoftware.com
>> >
>> > ____________________________________________________
>> >
>> > This is confidential, non-binding and not company endorsed - see full
>> terms at www.fosolutions.co.uk/emailpolicy.html
>> > First Option Software Ltd Registered No. 06340261
>> > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
>> > ____________________________________________________
>> >
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Doug Cutting
In reply to this post by Niels Basjes
On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]> wrote:
> For arguments I still do not fully understand this was rejected by Todd and
> Doug.

Performance is a part of compatibility.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
Hi,

The way I see the effects of the original patch on existing subclasses:
- implemented isSplitable
   --> no performance difference.
- did not implement isSplitable
   --> then there is no performance difference if the container is either
not compressed or uses a splittable compression.
   --> If it uses a common non splittable compression (like gzip) then the
output will suddenly be different (which is the correct answer) and the
jobs will finish sooner because the input is not processed multiple times.

Where do you see a performance impact?

Niels
On May 30, 2014 8:06 PM, "Doug Cutting" <[hidden email]> wrote:

> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]> wrote:
> > For arguments I still do not fully understand this was rejected by Todd
> and
> > Doug.
>
> Performance is a part of compatibility.
>
> Doug
>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Tim-2
Remove

On Fri, May 30, 2014 at 3:03 PM, Niels Basjes <[hidden email]> wrote:

> Hi,
> The way I see the effects of the original patch on existing subclasses:
> - implemented isSplitable
>    --> no performance difference.
> - did not implement isSplitable
>    --> then there is no performance difference if the container is either
> not compressed or uses a splittable compression.
>    --> If it uses a common non splittable compression (like gzip) then the
> output will suddenly be different (which is the correct answer) and the
> jobs will finish sooner because the input is not processed multiple times.
> Where do you see a performance impact?
> Niels
> On May 30, 2014 8:06 PM, "Doug Cutting" <[hidden email]> wrote:
>> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]> wrote:
>> > For arguments I still do not fully understand this was rejected by Todd
>> and
>> > Doug.
>>
>> Performance is a part of compatibility.
>>
>> Doug
>>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Doug Cutting
In reply to this post by Niels Basjes
I was trying to explain my comment, where I stated that, "changing the
default implementation to return false would be an incompatible
change".  The patch was added 6 months after that comment, so the
comment didn't address the patch.

The patch does not appear to change the default implementation to
return false unless the suffix of the file name is that of a known
unsplittable compression format.  So the folks who'd be harmed by this
are those who used a suffix like ".gz" for an Avro, Parquet or
other-format file.  Their applications might suddenly run much slower
and it would be difficult for them to determine why.  Such folks are
probably few, but perhaps exist.  I'd prefer a change that avoided
that possibility entirely.

Doug

On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <[hidden email]> wrote:

> Hi,
>
> The way I see the effects of the original patch on existing subclasses:
> - implemented isSplitable
>    --> no performance difference.
> - did not implement isSplitable
>    --> then there is no performance difference if the container is either
> not compressed or uses a splittable compression.
>    --> If it uses a common non splittable compression (like gzip) then the
> output will suddenly be different (which is the correct answer) and the
> jobs will finish sooner because the input is not processed multiple times.
>
> Where do you see a performance impact?
>
> Niels
> On May 30, 2014 8:06 PM, "Doug Cutting" <[hidden email]> wrote:
>
>> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]> wrote:
>> > For arguments I still do not fully understand this was rejected by Todd
>> and
>> > Doug.
>>
>> Performance is a part of compatibility.
>>
>> Doug
>>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
Ok, got it.

If someone has an Avro file (foo.avro) and gzips that ( foo.avro.gz) then
the frame work will select the GzipCodec which is not capable of splitting
and which will cause the problem. So by gzipping a splittable file it
becomes non splittable.

At my workplace we have applied gzip to avro but then the compression
applies to the blocks inside the avro file. So that are multiple gzipped
blocks inside an avro container which is a splittable file without any
changes.

How would someone create the situation you are referring to?
On May 31, 2014 1:06 AM, "Doug Cutting" <[hidden email]> wrote:

> I was trying to explain my comment, where I stated that, "changing the
> default implementation to return false would be an incompatible
> change".  The patch was added 6 months after that comment, so the
> comment didn't address the patch.
>
> The patch does not appear to change the default implementation to
> return false unless the suffix of the file name is that of a known
> unsplittable compression format.  So the folks who'd be harmed by this
> are those who used a suffix like ".gz" for an Avro, Parquet or
> other-format file.  Their applications might suddenly run much slower
> and it would be difficult for them to determine why.  Such folks are
> probably few, but perhaps exist.  I'd prefer a change that avoided
> that possibility entirely.
>
> Doug
>
> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <[hidden email]> wrote:
> > Hi,
> >
> > The way I see the effects of the original patch on existing subclasses:
> > - implemented isSplitable
> >    --> no performance difference.
> > - did not implement isSplitable
> >    --> then there is no performance difference if the container is either
> > not compressed or uses a splittable compression.
> >    --> If it uses a common non splittable compression (like gzip) then
> the
> > output will suddenly be different (which is the correct answer) and the
> > jobs will finish sooner because the input is not processed multiple
> times.
> >
> > Where do you see a performance impact?
> >
> > Niels
> > On May 30, 2014 8:06 PM, "Doug Cutting" <[hidden email]> wrote:
> >
> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]> wrote:
> >> > For arguments I still do not fully understand this was rejected by
> Todd
> >> and
> >> > Doug.
> >>
> >> Performance is a part of compatibility.
> >>
> >> Doug
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Chris Douglas
On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <[hidden email]> wrote:
> How would someone create the situation you are referring to?

By adopting a naming convention where the filename suffix doesn't
imply that the raw data are compressed with that codec.

For example, if a user named SequenceFiles foo.lzo and foo.gz to
record which codec was used, then isSplittable would spuriously return
false. -C

> On May 31, 2014 1:06 AM, "Doug Cutting" <[hidden email]> wrote:
>
>> I was trying to explain my comment, where I stated that, "changing the
>> default implementation to return false would be an incompatible
>> change".  The patch was added 6 months after that comment, so the
>> comment didn't address the patch.
>>
>> The patch does not appear to change the default implementation to
>> return false unless the suffix of the file name is that of a known
>> unsplittable compression format.  So the folks who'd be harmed by this
>> are those who used a suffix like ".gz" for an Avro, Parquet or
>> other-format file.  Their applications might suddenly run much slower
>> and it would be difficult for them to determine why.  Such folks are
>> probably few, but perhaps exist.  I'd prefer a change that avoided
>> that possibility entirely.
>>
>> Doug
>>
>> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <[hidden email]> wrote:
>> > Hi,
>> >
>> > The way I see the effects of the original patch on existing subclasses:
>> > - implemented isSplitable
>> >    --> no performance difference.
>> > - did not implement isSplitable
>> >    --> then there is no performance difference if the container is either
>> > not compressed or uses a splittable compression.
>> >    --> If it uses a common non splittable compression (like gzip) then
>> the
>> > output will suddenly be different (which is the correct answer) and the
>> > jobs will finish sooner because the input is not processed multiple
>> times.
>> >
>> > Where do you see a performance impact?
>> >
>> > Niels
>> > On May 30, 2014 8:06 PM, "Doug Cutting" <[hidden email]> wrote:
>> >
>> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]> wrote:
>> >> > For arguments I still do not fully understand this was rejected by
>> Todd
>> >> and
>> >> > Doug.
>> >>
>> >> Performance is a part of compatibility.
>> >>
>> >> Doug
>> >>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
The Hadoop framework uses the filename extension  to automatically insert
the "right" decompression codec in the read pipeline.
So if someone does what you describe then they would need to unload all
compression codecs or face decompression errors. And if it really was
gzipped then it would not be splittable at all.

Niels
On May 31, 2014 11:12 PM, "Chris Douglas" <[hidden email]> wrote:

> On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <[hidden email]> wrote:
> > How would someone create the situation you are referring to?
>
> By adopting a naming convention where the filename suffix doesn't
> imply that the raw data are compressed with that codec.
>
> For example, if a user named SequenceFiles foo.lzo and foo.gz to
> record which codec was used, then isSplittable would spuriously return
> false. -C
>
> > On May 31, 2014 1:06 AM, "Doug Cutting" <[hidden email]> wrote:
> >
> >> I was trying to explain my comment, where I stated that, "changing the
> >> default implementation to return false would be an incompatible
> >> change".  The patch was added 6 months after that comment, so the
> >> comment didn't address the patch.
> >>
> >> The patch does not appear to change the default implementation to
> >> return false unless the suffix of the file name is that of a known
> >> unsplittable compression format.  So the folks who'd be harmed by this
> >> are those who used a suffix like ".gz" for an Avro, Parquet or
> >> other-format file.  Their applications might suddenly run much slower
> >> and it would be difficult for them to determine why.  Such folks are
> >> probably few, but perhaps exist.  I'd prefer a change that avoided
> >> that possibility entirely.
> >>
> >> Doug
> >>
> >> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <[hidden email]> wrote:
> >> > Hi,
> >> >
> >> > The way I see the effects of the original patch on existing
> subclasses:
> >> > - implemented isSplitable
> >> >    --> no performance difference.
> >> > - did not implement isSplitable
> >> >    --> then there is no performance difference if the container is
> either
> >> > not compressed or uses a splittable compression.
> >> >    --> If it uses a common non splittable compression (like gzip) then
> >> the
> >> > output will suddenly be different (which is the correct answer) and
> the
> >> > jobs will finish sooner because the input is not processed multiple
> >> times.
> >> >
> >> > Where do you see a performance impact?
> >> >
> >> > Niels
> >> > On May 30, 2014 8:06 PM, "Doug Cutting" <[hidden email]> wrote:
> >> >
> >> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]>
> wrote:
> >> >> > For arguments I still do not fully understand this was rejected by
> >> Todd
> >> >> and
> >> >> > Doug.
> >> >>
> >> >> Performance is a part of compatibility.
> >> >>
> >> >> Doug
> >> >>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Chris Douglas
On Sat, May 31, 2014 at 10:53 PM, Niels Basjes <[hidden email]> wrote:
> The Hadoop framework uses the filename extension  to automatically insert
> the "right" decompression codec in the read pipeline.

This would be the new behavior, incompatible with existing code.

> So if someone does what you describe then they would need to unload all
> compression codecs or face decompression errors. And if it really was
> gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
returns true because it extends FileInputFormat. After the change, it
could spuriously return false based on the suffix of the input files.
In the prenominate example, SequenceFile is splittable, even if the
codec used in each block is not. -C

> Niels
> On May 31, 2014 11:12 PM, "Chris Douglas" <[hidden email]> wrote:
>
>> On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <[hidden email]> wrote:
>> > How would someone create the situation you are referring to?
>>
>> By adopting a naming convention where the filename suffix doesn't
>> imply that the raw data are compressed with that codec.
>>
>> For example, if a user named SequenceFiles foo.lzo and foo.gz to
>> record which codec was used, then isSplittable would spuriously return
>> false. -C
>>
>> > On May 31, 2014 1:06 AM, "Doug Cutting" <[hidden email]> wrote:
>> >
>> >> I was trying to explain my comment, where I stated that, "changing the
>> >> default implementation to return false would be an incompatible
>> >> change".  The patch was added 6 months after that comment, so the
>> >> comment didn't address the patch.
>> >>
>> >> The patch does not appear to change the default implementation to
>> >> return false unless the suffix of the file name is that of a known
>> >> unsplittable compression format.  So the folks who'd be harmed by this
>> >> are those who used a suffix like ".gz" for an Avro, Parquet or
>> >> other-format file.  Their applications might suddenly run much slower
>> >> and it would be difficult for them to determine why.  Such folks are
>> >> probably few, but perhaps exist.  I'd prefer a change that avoided
>> >> that possibility entirely.
>> >>
>> >> Doug
>> >>
>> >> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> > The way I see the effects of the original patch on existing
>> subclasses:
>> >> > - implemented isSplitable
>> >> >    --> no performance difference.
>> >> > - did not implement isSplitable
>> >> >    --> then there is no performance difference if the container is
>> either
>> >> > not compressed or uses a splittable compression.
>> >> >    --> If it uses a common non splittable compression (like gzip) then
>> >> the
>> >> > output will suddenly be different (which is the correct answer) and
>> the
>> >> > jobs will finish sooner because the input is not processed multiple
>> >> times.
>> >> >
>> >> > Where do you see a performance impact?
>> >> >
>> >> > Niels
>> >> > On May 30, 2014 8:06 PM, "Doug Cutting" <[hidden email]> wrote:
>> >> >
>> >> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <[hidden email]>
>> wrote:
>> >> >> > For arguments I still do not fully understand this was rejected by
>> >> Todd
>> >> >> and
>> >> >> > Doug.
>> >> >>
>> >> >> Performance is a part of compatibility.
>> >> >>
>> >> >> Doug
>> >> >>
>> >>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Doug Cutting
In reply to this post by Niels Basjes
On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <[hidden email]> wrote:
> How would someone create the situation you are referring to?

By renaming files.  I believe I can easily write a test using public
APIs that will succeed without this patch and will fail with this
patch.  Given the number of Hadoop-based applications now in the
world, the likelihood that some perform such file renamings is
non-negligible.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
In reply to this post by Chris Douglas
On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas <[hidden email]> wrote:

> On Sat, May 31, 2014 at 10:53 PM, Niels Basjes <[hidden email]> wrote:
> > The Hadoop framework uses the filename extension  to automatically insert
> > the "right" decompression codec in the read pipeline.
>
> This would be the new behavior, incompatible with existing code.
>

You are right, I was wrong. It is the LineRecordReader that inserts it.

Looking at this code and where it is used I noticed that the bug I'm trying
to prevent is present in the current trunk.
The NLineInputFormat does not override the isSplitable and used the
LineRecordReader that is capable of reading gzipped input. Overall effect
is that this inputformat silently produces garbage (missing lines +
duplicated lines) when when ran against a gzipped file. I just verified
this.

> So if someone does what you describe then they would need to unload all
> compression codecs or face decompression errors. And if it really was
> gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
> returns true because it extends FileInputFormat. After the change, it
> could spuriously return false based on the suffix of the input files.
> In the prenominate example, SequenceFile is splittable, even if the
> codec used in each block is not. -C
>

and if you then give the file the .gz extension this breaks all common
sense / conventions about file names.


Let's reiterate the options I see now:
1) isSplitable --> return true
    Too unsafe, I say "must change". I alone hit my head twice so far on
this, many others have too, event the current trunk still has this bug in
there.

2) isSplitable --> return false
    Safe but too slow in some cases. In those cases the actual
implementation can simply override it very easily and regain their original
performance.

3) isSplitable --> true (same as the current implementation) unless you use
a file extension that is associated with a non splittable compression codec
(i.e.  .gz or something like that).
    If a custom format want to break with well known conventions about
filenames then they should simply override the isSplitable with their own.

4) isSplitable --> abstract
    Compatibility breaker. I see this as the cleanest way to force the
developer of the custom fileinputformat to think about their specific case.

I hold "correct data" much higher than performance and scalability; so the
performance impact is a concern but it is much less important than the list
of bugs we are facing right now.

The safest way would be either 2 or 4. Solution 3 would effectively be the
same as the current implementation, yet it would catch the problem
situations as long as people stick to normal file name conventions.
Solution 3 would also allow removing some code duplication in several
subclasses.

I would go for solution 3.

Niels Basjes
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Chris Douglas
On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
> and if you then give the file the .gz extension this breaks all common
> sense / conventions about file names.

That the suffix for all compression codecs in every context- and all
future codecs- should determine whether a file can be split is not an
assumption we can make safely. Again, that's not an assumption that
held when people built their current systems, and they would be justly
annoyed with the project for changing it.

> I hold "correct data" much higher than performance and scalability; so the
> performance impact is a concern but it is much less important than the list
> of bugs we are facing right now.

These are not bugs. NLineInputFormat doesn't support compressed input,
and why would it? -C

> The safest way would be either 2 or 4. Solution 3 would effectively be the
> same as the current implementation, yet it would catch the problem
> situations as long as people stick to normal file name conventions.
> Solution 3 would also allow removing some code duplication in several
> subclasses.
>
> I would go for solution 3.
>
> Niels Basjes
Reply | Threaded
Open this post in threaded view
|

Re: Change proposal for FileInputFormat isSplitable

Niels Basjes
On Tue, Jun 10, 2014 at 8:10 PM, Chris Douglas <[hidden email]> wrote:

> On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
> > and if you then give the file the .gz extension this breaks all common
> > sense / conventions about file names.
>


> That the suffix for all compression codecs in every context- and all
> future codecs- should determine whether a file can be split is not an
> assumption we can make safely. Again, that's not an assumption that
> held when people built their current systems, and they would be justly
> annoyed with the project for changing it.


That's not what I meant. What I understood from what was described is that
sometimes people use an existing file extension (like .gz) for a file that
is not a gzipped file.
If a file is splittable or not depends greatly on the actual codec
implementation that is used to read it. Using the default GzipCodec a .gz
file is not splittable, but that can be changed with a different
implementation like for example this
https://github.com/nielsbasjes/splittablegzip
So given a file extension the file 'must' be a file that is the format that
is described by the file name extension.

The flow is roughly as follows
- What is the file extension
- Get the codec class registered to that extension
- Is this a splittable codec ? (Does this class implement the
splittablecodec interface)

> I hold "correct data" much higher than performance and scalability; so the
> > performance impact is a concern but it is much less important than the
> list
> > of bugs we are facing right now.
>
> These are not bugs. NLineInputFormat doesn't support compressed input,
> and why would it? -C
>

I'm not saying it should (in fact, for this one I agree that it shouldn't).
The reality is that it accepts the file, decompresses it and then produces
output that 'looks good' but really is garbage.

I consider "silently producing garbage" one of the worst kinds of problem
to tackle.
Because many custom file based input formats have stumbled (getting
"silently produced garbage") over the current isSplitable implementation I
really want to avoid any more of this in the future.
That is why I want to change the implementations in this area of Hadoop in
such a way that this "silently producing garbage" effect is taken out.

So the question remains: What is the way this should be changed?
I'm willing to build it and submit a patch.




> > The safest way would be either 2 or 4. Solution 3 would effectively be
> the
> > same as the current implementation, yet it would catch the problem
> > situations as long as people stick to normal file name conventions.
> > Solution 3 would also allow removing some code duplication in several
> > subclasses.
> >
> > I would go for solution 3.
> >
> > Niels Basjes
>



--
Best regards / Met vriendelijke groeten,

Niels Basjes
12