Created: (NUTCH-451) Tool to recover partial fetcher output

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Created: (NUTCH-451) Tool to recover partial fetcher output

Sebastian Nagel (Jira)
Tool to recover partial fetcher output
--------------------------------------

                 Key: NUTCH-451
                 URL: https://issues.apache.org/jira/browse/NUTCH-451
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki
         Assigned To: Andrzej Bialecki
             Fix For: 0.9.0
         Attachments: LocalFetchRecover.java

This class may help you to recover partial data from a failed Fetcher run.

NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams.

NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered.

The recovery proces requires some preparation:

* determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.

* create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this:
  input/part-00000
  input/part-00001
  input/part-00002
  input/part-00003
  ...
 
* specify the "input" directory as the input to this tool.

If all goes well, a new segment will be created as a subdirectory of the output dir.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (NUTCH-451) Tool to recover partial fetcher output

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-451:
------------------------------------

    Attachment: LocalFetchRecover.java

> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki
>         Assigned To: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run.
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered.
> The recovery proces requires some preparation:
> * determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this:
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>  
> * specify the "input" directory as the input to this tool.
> If all goes well, a new segment will be created as a subdirectory of the output dir.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (NUTCH-451) Tool to recover partial fetcher output

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mathijs Homminga updated NUTCH-451:
-----------------------------------

    Attachment: LocalFetchRecover-0.8.1.java

works with Nutch 0.8.1

> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki
>         Assigned To: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run.
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered.
> The recovery proces requires some preparation:
> * determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this:
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>  
> * specify the "input" directory as the input to this tool.
> If all goes well, a new segment will be created as a subdirectory of the output dir.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-451) Tool to recover partial fetcher output

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480098 ]

Mathijs Homminga commented on NUTCH-451:
----------------------------------------

While fetching a segment with 4M documents, we ran out of diskspace.
We managed to recover most of our data using the LocalFetchRecover tool.

* First, the tool needed some modifications in order to work with Nutch 0.8.1 (see file attached)

* We copied the map task output file of the failed job to the tool's input directory (we had only one input file)
cp /tmp/hadoop/mapred/local/map_p7xlb2/part-0.out /tmp/recovery/input

* Next, we ran the LocalFetchRecover tool. After a few hours we got a EOFException because our input file was not closed properly. LocalFetchRecover uses an IndentityMapper, so the output from its map tasks is exactly the same as the input, only split into more parts. Knowing this, we ran the tool again using the newly created map task output files as input.

* Before we ran the tool again, we had to remove the last map output file because it will cause another EOFException.

* Done! Our segment was created successfully in /tmp/recovery/output/20070228225939/




> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki
>         Assigned To: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run.
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered.
> The recovery proces requires some preparation:
> * determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this:
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>  
> * specify the "input" directory as the input to this tool.
> If all goes well, a new segment will be created as a subdirectory of the output dir.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-451) Tool to recover partial fetcher output

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-451:
------------------------------------

    Priority: Minor  (was: Major)

> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki
>         Assigned To: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run.
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered.
> The recovery proces requires some preparation:
> * determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this:
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>  
> * specify the "input" directory as the input to this tool.
> If all goes well, a new segment will be created as a subdirectory of the output dir.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-451) Tool to recover partial fetcher output

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-451.
-----------------------------------

    Resolution: Won't Fix

> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run.
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered.
> The recovery proces requires some preparation:
> * determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this:
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>  
> * specify the "input" directory as the input to this tool.
> If all goes well, a new segment will be created as a subdirectory of the output dir.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-451) Tool to recover partial fetcher output

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633370#action_12633370 ]

Andrzej Bialecki  commented on NUTCH-451:
-----------------------------------------

I'm closing this issue, as the tool is not general enough to be included in Nutch. The code stays here, so anyone can still use it.

> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run.
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't use DFS - partial output to DFS is permanently lost if a process fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s will be corrupted at the end. This means that it won't be possible to recover all data from them - most likely only the data up to the last sync marker can be recovered.
> The recovery proces requires some preparation:
> * determine the map directories corresponding to the map task outputs of the failed job. These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s into this directory, renaming them sequentially like this:
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>  
> * specify the "input" directory as the input to this tool.
> If all goes well, a new segment will be created as a subdirectory of the output dir.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.