[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440114#comment-13440114 ]

Markus Jelsma commented on NUTCH-1233:
--------------------------------------

Any comments here? I think we can commit this and remove the whitespace collapsing on Nutch' side when Tika 1.3 is released since TIKA-975 will be part of it.
               

> Rely on Tika for outlink extraction
> -----------------------------------
>
>                 Key: NUTCH-1233
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1233
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, NUTCH-1233-1.6-2.patch
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be able to use it in Nutch we need Tika to return the rel attr value of each link, which it currently doesn't. There's a patch for Tika 1.1. If that patch is included in Tika and we upgraded to that new version this issue can be worked on. Here's preliminary code that does both Tika and current outlink extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira