[jira] Created: (HADOOP-1491) After successful distcp, couple of checksum error files

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)
After successful distcp, couple of checksum error files
-------------------------------------------------------

                 Key: HADOOP-1491
                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
             Project: Hadoop
          Issue Type: Bug
          Components: util
    Affects Versions: 0.12.3
            Reporter: Koji Noguchi


Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
Distcp ran on 25 nodes mapreduce.

Couple of tasks failed, but job was successful.

When checked, 12  files were corrupted. (Checksum error)

This is repeatable.

I'll add more information as we find.





--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504914 ]

Raghu Angadi commented on HADOOP-1491:
--------------------------------------

My impression from looking at the one case of Koji's investigation:

Two files involved: A and B. On the source side of distcp both are fine. On the destination side A (A_dest) is fine. B_dest is corrupted. .B_dest.crc is same as .B_src.crc, but B_dest has the same content as A_src. Both A and B are small have only one block. Looks like while writing B_dest, it some how wrote block corresponding to A.

One possible bug that can result in this situation is HADOOP-1396. If both A_dest and B_dest were created around the same time, then it is even more likely culprit (we can check the creation times from creation times of the blocks).


> After successful distcp, couple of checksum error files
> -------------------------------------------------------
>
>                 Key: HADOOP-1491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>
> Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
> Distcp ran on 25 nodes mapreduce.
> Couple of tasks failed, but job was successful.
> When checked, 12  files were corrupted. (Checksum error)
> This is repeatable.
> I'll add more information as we find.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505336 ]

Koji Noguchi commented on HADOOP-1491:
--------------------------------------

To confirm Dhruba and Raghu's analysis,
I inserted one debug print statement inside DFSClient.newBackupFile to print out the "result" and "src".

On one node, two mappers started (almost) at the same time by the distcp.
There were difinitely clashing on the temporary file names.  
Attaching the two userlogs.


Picked files from the clashing and dfs -get from source and target cluster. ls -l showed

-rw-r--r--  1 knoguchi users 133142 Jun 15 10:46 part-270-source
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:47 part-270-target
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:48 part-277-source
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:47 part-277-target

After the copy, part-270 file was corrupted.




> After successful distcp, couple of checksum error files
> -------------------------------------------------------
>
>                 Key: HADOOP-1491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>
> Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
> Distcp ran on 25 nodes mapreduce.
> Couple of tasks failed, but job was successful.
> When checked, 12  files were corrupted. (Checksum error)
> This is repeatable.
> I'll add more information as we find.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Noguchi updated HADOOP-1491:
---------------------------------

    Attachment: mapper1.txt

> After successful distcp, couple of checksum error files
> -------------------------------------------------------
>
>                 Key: HADOOP-1491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>         Attachments: mapper1.txt
>
>
> Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
> Distcp ran on 25 nodes mapreduce.
> Couple of tasks failed, but job was successful.
> When checked, 12  files were corrupted. (Checksum error)
> This is repeatable.
> I'll add more information as we find.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505336 ]

Koji Noguchi edited comment on HADOOP-1491 at 6/15/07 11:21 AM:
----------------------------------------------------------------

To confirm Dhruba and Raghu's analysis,
I inserted one debug print statement inside DFSClient.newBackupFile to print out the "result" and "src".

On one node, two mappers started (almost) at the same time by the distcp.
There were definitely clashing on the temporary file names.  
Attaching the two userlogs.


Picked files from the clashing and dfs -get from source and target cluster. ls -l showed

-rw-r--r--  1 knoguchi users 133142 Jun 15 10:46 part-270-source
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:47 part-270-target
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:48 part-277-source
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:47 part-277-target

After the copy, part-270 file was corrupted.





 was:
To confirm Dhruba and Raghu's analysis,
I inserted one debug print statement inside DFSClient.newBackupFile to print out the "result" and "src".

On one node, two mappers started (almost) at the same time by the distcp.
There were difinitely clashing on the temporary file names.  
Attaching the two userlogs.


Picked files from the clashing and dfs -get from source and target cluster. ls -l showed

-rw-r--r--  1 knoguchi users 133142 Jun 15 10:46 part-270-source
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:47 part-270-target
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:48 part-277-source
-rw-r--r--  1 knoguchi users 133848 Jun 15 10:47 part-277-target

After the copy, part-270 file was corrupted.




> After successful distcp, couple of checksum error files
> -------------------------------------------------------
>
>                 Key: HADOOP-1491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>         Attachments: mapper1.txt
>
>
> Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
> Distcp ran on 25 nodes mapreduce.
> Couple of tasks failed, but job was successful.
> When checked, 12  files were corrupted. (Checksum error)
> This is repeatable.
> I'll add more information as we find.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Noguchi updated HADOOP-1491:
---------------------------------

    Attachment: mapper2.txt

> After successful distcp, couple of checksum error files
> -------------------------------------------------------
>
>                 Key: HADOOP-1491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>         Attachments: mapper1.txt, mapper2.txt
>
>
> Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
> Distcp ran on 25 nodes mapreduce.
> Couple of tasks failed, but job was successful.
> When checked, 12  files were corrupted. (Checksum error)
> This is repeatable.
> I'll add more information as we find.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur resolved HADOOP-1491.
--------------------------------------

    Resolution: Duplicate

Duplicate of HADOOP-1396.

> After successful distcp, couple of checksum error files
> -------------------------------------------------------
>
>                 Key: HADOOP-1491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>            Assignee: dhruba borthakur
>         Attachments: mapper1.txt, mapper2.txt
>
>
> Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
> Distcp ran on 25 nodes mapreduce.
> Couple of tasks failed, but job was successful.
> When checked, 12  files were corrupted. (Checksum error)
> This is repeatable.
> I'll add more information as we find.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (HADOOP-1491) After successful distcp, couple of checksum error files

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur reassigned HADOOP-1491:
----------------------------------------

    Assignee: dhruba borthakur

> After successful distcp, couple of checksum error files
> -------------------------------------------------------
>
>                 Key: HADOOP-1491
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1491
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>            Assignee: dhruba borthakur
>         Attachments: mapper1.txt, mapper2.txt
>
>
> Tried copying 700,000 files  with distcp. 8 mappers per node.  Single dfs.client.buffer.dir.
> Distcp ran on 25 nodes mapreduce.
> Couple of tasks failed, but job was successful.
> When checked, 12  files were corrupted. (Checksum error)
> This is repeatable.
> I'll add more information as we find.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.