[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090038#comment-16090038 ]

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

Thanks, [~[hidden email]]! Tested on a small set of sitemaps. Looks good to me, I've only improved the description of properties and did some code clean-up (patch / pull-request to follow). Please, go ahead and commit it! We can later improve it to make it more robust or to make sophisticated use of last modified time and priorities provided in sitemaps. Thanks!

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.14
>
>         Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
Loading...