[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968777#action_12968777 ]

Grant Ingersoll commented on SOLR-1979:
---------------------------------------

Sorry, you are right.  See http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties 

{quote}
name.da=Danish
name.de=German
name.et=Estonian
name.el=Greek
name.en=English
name.es=Spanish
name.fi=Finnish
name.fr=French
name.hu=Hungarian
name.is=Icelandic
name.it=Italian
name.nl=Dutch
name.no=Norwegian
name.pl=Polish
name.pt=Portuguese
name.ru=Russian
name.sv=Swedish
name.th=Thai
{quote}

Kind of random that Thai is thrown in there!

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan H√łydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this:
> {code:xml}
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code}
> It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]