[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090044#comment-16090044 ]

ASF GitHub Bot commented on NUTCH-1465:

sebastian-nagel opened a new pull request #202: NUTCH-1465 Support for sitemaps
URL: https://github.com/apache/nutch/pull/202
   (applied Markus' patch as of 2017-07-05)
   - add SitemapProcessor
   - upgrade dependency crawler-commons to 0.8
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[hidden email]

> Support sitemaps in Nutch
> -------------------------
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.14
>         Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

This message was sent by Atlassian JIRA