[jira] Created: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

classic Classic list List threaded Threaded
126 messages Options
1234 ... 7
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
----------------------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-1470
                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
             Project: Hadoop
          Issue Type: Improvement
          Components: fs
    Affects Versions: 0.12.3
            Reporter: Hairong Kuang
            Assignee: Hairong Kuang
             Fix For: 0.14.0


Comment from Doug in HADOOP-1134:
I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502126 ]

Hairong Kuang commented on HADOOP-1470:
---------------------------------------

I am moving Raghu and Doug's discussion here since they are more relevant to this issue:

Raghu Angadi - [06/Jun/07 02:23 PM ]
Attaching an implementation of readBuffer() that handles the retries similar to ChecksumFileSystem. I am planning to use this in my development. IMHO complexity of this should be compared with what is required for HADOOP-1470 (both under fs and dfs).

[ Doug Cutting - [06/Jun/07 03:00 PM ]
> complexity of this should be compared with what is required for HADOOP-1470
Or perhaps this can be used as a template for the generic version. The only DFS-specific bits are in the catch clause, and they could be factored into an abstract method, no?

> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502137 ]

Hairong Kuang commented on HADOOP-1470:
---------------------------------------

The file ReadBuffer.java that Raghu submitted to HADOOP-1134 shows only the ChecksumException handling code. How about the checksum verification and checksum generation parts of code? I think they should also belong to the generic classes.

Let's suppose that the genric classes are Checker for reading and Summer for writing. Should these two classes contain two streams, one for data and one for checksum? Should these two streams be an abstraction of a block or a file?


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502137 ]

Hairong Kuang edited comment on HADOOP-1470 at 6/6/07 3:43 PM:
---------------------------------------------------------------

The file ReadBuffer.java that Raghu submitted to HADOOP-1134 shows only the ChecksumException handling code. How about the checksum verification and checksum generation parts of code? I think they should also belong to the generic classes.

Let's suppose that the genric classes are Checker for reading and Summer for writing. Should each of these two classes contains two streams, one for data and one for checksum? Should these two streams be an abstraction of a block or a file?



 was:
The file ReadBuffer.java that Raghu submitted to HADOOP-1134 shows only the ChecksumException handling code. How about the checksum verification and checksum generation parts of code? I think they should also belong to the generic classes.

Let's suppose that the genric classes are Checker for reading and Summer for writing. Should these two classes contain two streams, one for data and one for checksum? Should these two streams be an abstraction of a block or a file?


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502140 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------


ReadBuffer.java's purpose was not to show how checksumming can be handled in general by a small function. IMO it was to show that handling DFS internal checksum errors internally is not difficult. The source patch attached to 1134 handles rest of checksum verification and creation that is internal to DFS.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502145 ]

Hairong Kuang commented on HADOOP-1470:
---------------------------------------

In the block-level-crc dfs, do we allow different values of bytesPerSum for blocks in a file? Do we allow different block size for blocks in a file?

If data are checksumed at the FileSystem level, there is another complexity in the block-level-crc dfs. When a block size is not a multiple of bytesPerSum, we also need to output/verify checksum at the end of each block. So it is harder to decide checksumChunk boundaries.

> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502148 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------

> In the block-level-crc dfs, do we allow different values of bytesPerSum for blocks in a file?
Yes. Though there is no way to do this now. Also we don't enforce that each block to have same {{bps}} for a file. So theoretical yes.

> Do we allow different block size for blocks in a file?
No. Just like current DFS. i.e. this is not a block-level-crc specific issue.

bq. If data are checksumed at the FileSystem level, there is another complexity in the block-level-crc dfs. When a block size is not a multiple of bytesPerSum, we also need to output/verify checksum at the end of each block. So it is harder to decide checksumChunk boundaries.

good point. Yes. This will further influence ChecksumFileSystem implementation. This inter-(dependency and influence) between ChecksumFS and DFS could potentially further increase with this jira,



> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502160 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------

From my understanding this seems to have similarities with discussion about ToolBase in HADOOP-1424 where ToolBase imposes a _specific structure_ on implementation in order to provide an utility (in this case comparing checksums and retries). Of course ToolBase is a very simple case.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502160 ]

Raghu Angadi edited comment on HADOOP-1470 at 6/6/07 5:57 PM:
--------------------------------------------------------------

From my understanding this seems to have similarities with discussion about ToolBase in HADOOP-1425 where ToolBase imposes a _specific structure_ on implementation in order to provide an utility (in this case comparing checksums and retries). Of course ToolBase is a very simple case.

Edit: Changed Jira number.


 was:
From my understanding this seems to have similarities with discussion about ToolBase in HADOOP-1424 where ToolBase imposes a _specific structure_ on implementation in order to provide an utility (in this case comparing checksums and retries). Of course ToolBase is a very simple case.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502169 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------

May be two streams for InputChecker is not very generic. May be it could be single stream or not a stream at all but discrete chunks. I have a feeling any generic method will end up requiring extra mem copy of data for atleast some of the implementations of FS.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502215 ]

Doug Cutting commented on HADOOP-1470:
--------------------------------------

Raghu, I think it is worth trying to keep as much as is reasonable generic, and no more.  If it indeed makes things considerably more complex to share the code, as you fear, then we probably should not.  But we should first try, no?

I'm not convinced that your ReadBuffer.java really handles all of the cases that might arise--it's pseudo code that hasn't been tested yet.  Things sometimes have a way of getting more complicated as they're used and debugged.  It doesn't even show the checksumming, just handling of one exception and retry.  It also doesn't support an important property that we desire, where checksums are verified as late as possible, so that we catch more memory errors, not just disk errors.  So I don't really see how it illustrates the point I think you intend, that a non-generic version will be simpler.

I'm also not sure what your analogy is with HADOOP-1425.  In that case some folks don't like subclassing, and prefer an interface and static methods.  That's fine, and we could do that here if you think such a design would be cleaner.  But no one is arguing there that we shouldn't share as much logic as possible, rather that discussion was about how the logic is shared.

Finally, I think it's okay to throw an exception in the client when the configured blocksize is not a multiple of the configured bytesPerSum.  So, if we think it will considerably simplify implementation, I don't see a problem with adding this restriction.



> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502388 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------


readBuffer.java is tested in my dev environment. Checksumming closest to User I think can be solved independently. I agree that it is very important and will be solved for HADOOP-1134.

Some more considerations for InputChecker design:

If purpose of this Jira is to provide a generic InputChecker, I wonder why 2 streams (where it imposes that every checksum block be of the same size, for eg.) is very generic. Does it support different types of checksums (DFS already uses two types)? Are we confident this serves future FS well? Or is it ok to modify InputChecker (probably all the existing FS'es) then?

When InputChecker reads  4k (say by some equivalent of readFully()), should DFS necessarily read 256 MB  of block data in the case where bytesPerChecksum is 64K to solve that? Of course no. This is just one example of various things DFS support for this InputChecker needs to handle.

I have probably thought more than I should about this :). Back to HADOOP-1134.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502446 ]

Hairong Kuang commented on HADOOP-1470:
---------------------------------------

> Finally, I think it's okay to throw an exception in the client when the configured blocksize is not a multiple of the configured bytesPerSum. So, if we think it will considerably simplify implementation, I don't see a problem with adding this restriction.

Yes, I agree that this restriction would greatly simplify this issue. For backward compatability, could we enforce this during current dfs -> block-level-crc dfs upgrade?

> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502452 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------

This does not fix existing data. I have written quite a bit of code in Block-Level CRCs Upgrade to handle this case well.

Looks to me, we might be subconsciously trying to fit InputChecker to match what FSInputChecker already does in ChecksumFS. But I was under the impression that we want to have a generic InputStream.



> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502477 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------

> For backward compatability, could we enforce this during current dfs -> block-level-crc dfs upgrade?
hmm.. changing block size during upgrade. that will surely not be a picnic.

> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-1470:
----------------------------------

    Attachment: genericChecksum.patch

This is an initial patch for review. It assumes that block size is a multiple of bytesPerSum and contains checksum generation/verification and checksum error handling. Do we need any other funcationality? I still need to work on ChecksumFileSystem so that it does not assume that it is a wrapper of a raw fs.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: genericChecksum.patch
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502492 ]

Doug Cutting commented on HADOOP-1470:
--------------------------------------

> changing block size during upgrade. that will surely not be a picnic.

I don't think that should be required.  Are there actually any large data sets in HDFS whose blockSize is not a multiple of its bytesPerChecksum?  I'd be surprised if many if any have configured things this way.  If we do encounter this somewhere then we can either (a) discard the checksums; or (b) recompute the checksums (first validating using the old).  I think this is an edge case that we need not invest too much effort in.

> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: genericChecksum.patch
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502499 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------

Doug, I personally like the condition that block size should be a multiple of {{bpc}}. I even filed HADOOP-1259 about it. I guess discussion there is still relevant.

But my point is that this generic InputChecker forces a condition that does not make sense for either of its users (ChecksumFS and DFS). This is just one example of things that imposes. The current uploaded patch even takes 'blockSize' for output summer, though not yet used.

Do you think attached patch is generic?


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: genericChecksum.patch
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502508 ]

Doug Cutting commented on HADOOP-1470:
--------------------------------------

> Do you think attached patch is generic?

Nothing is generic until it's used in more than one place.  The above isn't complete (it needs, e.g., 'read()' and 'seek()' implementations) but it looks like a good start.  But the real question is to you: could you use the above?  Can you present the data and checksums as input streams that support read(byte[], int, int) and seek(long)?

It also assumes a particular checksum implementation, CRC32.  If we wish to allow for others, that aspect could be generalized by adding a codec-like interface for checksummers.  But I think we should probably skip that this iteration.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: genericChecksum.patch
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1470) Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502511 ]

Raghu Angadi commented on HADOOP-1470:
--------------------------------------

> could you use the above?
_can_ I? Yeah sure, my argument has never been it _can not_  be done. We can even rechecksum to hide mismatch between {{bpc}} and {{blockSize}}.

Though I don't agree with 'it is good enough until it is extremely difficult to use', whether it is implied here or not. I surely don't think it is an improvement over FSInputChecker which, I think, was not explicitly designed to be general purpose checker.
 
I sure hope this is not a blocker for HADOOP-1134.

> Rework FSInputChecker and FSOutputSummer to support checksum code sharing between ChecksumFileSystem and block level crc dfs
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: genericChecksum.patch
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In particular, it seems to me that FSInputChecker and FSOutputSummer could be extended to support pluggable sources and sinks for checksums, respectively, and DFSDataInputStream and DFSDataOutputStream could use these. Advantages of this are: (a) single implementation of checksum logic to debug and maintain; (b) keeps checksumming as close to possible to data generation and use. This patch computes checksums after data has been buffered, and validates them before it is buffered. We sometimes use large buffers and would like to guard against in-memory errors. The current checksum code catches a lot of such errors. So we should compute checksums after minimal buffering (just bytesPerChecksum, ideally) and validate them at the last possible moment (e.g., through the use of a small final buffer with a larger buffer behind it). I do not think this will significantly affect performance, and data integrity is a high priority.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1234 ... 7