[jira] Created: (HADOOP-759) TextInputFormat should allow different treatment on carriage return char '\r'

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-759) TextInputFormat should allow different treatment on carriage return char '\r'

Kenneth William Krugler (Jira)
TextInputFormat should allow different treatment on carriage return char '\r'
-----------------------------------------------------------------------------

                 Key: HADOOP-759
                 URL: http://issues.apache.org/jira/browse/HADOOP-759
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Runping Qi



The current implementation treat '\r' and '\n' both as line breakers. However, in some cases, it is desiable to strictly use '\n' as the solely line breaker and treat '\r' as a part of data in a line.

One way to do this is to make readline function as a member function so that the user can create a subclass to overwrite the function with the desired behavior.



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-759) TextInputFormat should allow different treatment on carriage return char '\r'

Kenneth William Krugler (Jira)
    [ http://issues.apache.org/jira/browse/HADOOP-759?page=comments#action_12454544 ]
           
eric baldeschwieler commented on HADOOP-759:
--------------------------------------------

Is there a situation where we want to treat CR-LF ("\r\n" right?) as two line breaks?  If we can afford the extra processing, perhaps we should just check for this case when we see a CR in get line?  In the average case of only "\n" this will not cost us anything and we'll get CR-LF right for PC files.  I don't think there is a case we will get wrong and we'll only incur extra processing for CR only files, which are rather rare I expect, since apple abandoned this with osX and I'm not aware of any current system that uses this convention...

Just getting this right seems simpler than adding extra methods and complexity to the interface.

Thoughts?

> TextInputFormat should allow different treatment on carriage return char '\r'
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-759
>                 URL: http://issues.apache.org/jira/browse/HADOOP-759
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>
> The current implementation treat '\r' and '\n' both as line breakers. However, in some cases, it is desiable to strictly use '\n' as the solely line breaker and treat '\r' as a part of data in a line.
> One way to do this is to make readline function as a member function so that the user can create a subclass to overwrite the function with the desired behavior.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-759) TextInputFormat should allow different treatment on carriage return char '\r'

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)
    [ http://issues.apache.org/jira/browse/HADOOP-759?page=comments#action_12454665 ]
           
Runping Qi commented on HADOOP-759:
-----------------------------------

The case at my hand is a bit different. We have a file consisting of a sequence of records, separated by LF '\n':
REC1\nREC2\n...

And it is possible that some records may contain '\r'.
Thus, it is wrong to interpret '\r' as a line breaker.


> TextInputFormat should allow different treatment on carriage return char '\r'
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-759
>                 URL: http://issues.apache.org/jira/browse/HADOOP-759
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>
> The current implementation treat '\r' and '\n' both as line breakers. However, in some cases, it is desiable to strictly use '\n' as the solely line breaker and treat '\r' as a part of data in a line.
> One way to do this is to make readline function as a member function so that the user can create a subclass to overwrite the function with the desired behavior.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira