[jira] [Updated] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin

Cristian Vat (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-2449:
-----------------------------------
    Fix Version/s: 1.17

> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Major
>             Fix For: 1.17
>
>
> The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), and it doesn’t even fail gracefully with them - in my experience Chinese was recognized as Italian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)