[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643901#comment-13643901 ]

Christian Moen commented on LUCENE-4956:
----------------------------------------

The Korean analyzer should be named named {{org.apache.lucene.analysis.kr.KoreanAnalyzer}} and we'll provide a ready-to-use field type {{text_kr}} in {{schema.xml}} for Solr users, which is consistent with what we do for other languages.

As for where the analyzer code itself lives, I think it's fine to put it in {{lucene/analysis/arirang}}.  The file {{lucene/analysis/README.txt}} documents what these modules are and the code is easily and directly retrievable in IDEs by looking up {{KoreanAnalyzer}} (the source code paths will be set up by {{ant eclipse}} and {{ant idea}}).

One reason analyzers have not been put in {{lucene/analysis/common} in the past is that they require dictionaries that are several megabytes large.

Overall, I don't think the scheme we are using is all that problematic, but it's true that {{MorfologikAnalyzer}} and {{SmartChineseAnalyzer}} doesn't align with it.  The scheme doesn't easily lend itself to different implementations for one language, but that's not a common case today although it might become more common in the future.

In the case of Norwegian (no), there are ISO language codes for both BokmÃ¥l (bm) and Nynorsk (nn), and one way of supporting this is also to consider these as options to {{NorwegianAnalyzer}} since both languages are Norwegian.  See SOLR-4565 for thoughts on how to extend support in {{NorwegianMinimalStemFilter}} for this.

A similar overall approach might make sense when there are multiple implementations of a language; end-users can use a analyzer named {{<Language>Analyzer}} without requiring users to study the difference in implementation before using.  I also see problems with this, but it's just a thought...

I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean?




               

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-4956
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4956
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.2
>            Reporter: SooMyung Lee
>              Labels: newbie
>         Attachments: kr.analyzer.4x.tar
>
>
> Korean language has specific characteristic. When developing search service with lucene & solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]