[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967019#action_12967019 ]

Robert Muir commented on SOLR-1979:

bq. Yeah, that makes sense, however, I believe Tika returns 639.

Right, but 639 is just a subset of 3066 etc.

So, ignore what tika does. its 639 identifiers are also valid 3066.

Our API should at least be 3066, Java7/ICU already support BCP47 locale identifiers etc, so you get the normalization there for free.

It would probably also be nice to be able to map a number of languages to a single field.... say you have a single analyzer that can handle CJK, then you may want that whole collection of languages mapped to a single _cjk field.

And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped.

Both of these are good reasons why we must avoid 639-1.
We should be able to use things like macrolanguages and undetermined language.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan H√łydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch
> We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this:
> {code:xml}
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code}
> It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]