[jira] [Commented] (NUTCH-2370) Saving mapping of dumped file to URL

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2370) Saving mapping of dumped file to URL

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977818#comment-15977818 ]

ASF GitHub Bot commented on NUTCH-2370:

smadha commented on issue #180: fix for NUTCH-2370 contributed by [hidden email]
URL: https://github.com/apache/nutch/pull/180#issuecomment-295974051
   @chrismattmann can you review please?
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[hidden email]

> Saving mapping of dumped file to URL
> ------------------------------------
>                 Key: NUTCH-2370
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2370
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Madhav Sharan
>            Priority: Minor
> - nutch dump [0] is a great tool to simply dump all the crawled files from nutch segments.
> - After dump we loose information about URL from which this file was crawled. URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled segment which maps a file path to URL.
> [0] https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java

This message was sent by Atlassian JIRA