Multi-language Tokenizers / Filters recommended?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Multi-language Tokenizers / Filters recommended?

dma_bamboo
Hi

I'm now considering how to improve query results on a set of languages and
would like to hear considerations based on your experience in that.

I'm using the tokenizer HTMLStringWhitespaceTokenizerFactory with the
WordDelimiterFilterFactory, LowerCaseFilterFactory and
RemoveDuplicatesTokenFilterFactory as my default config.

I need to deal with:
    English (OK)
    Spanish
    Welsh
    Chinese Simplified
    Russian
    Arabic

For Spanish and Russian I'm using the SnowballPorterFilterFactory plus the
defaults. Should I use any specific TokenizerFactory? Which one?

For Chinese I'm going to use a TokenizerFactory that returns the
CJKTokenizer (as I read a previous discussion about it) plus the default
filters. Is it OK of the filters are inadequate?

For Welsh I'm using the defaults and would like to know if you have any
recommendation related to that.

For Arabic should I use the AraMorph Analyzer (
http://www.nongnu.org/aramorph/english/lucene.html)? What other processing
should I do to have better query results.

Does anyone have stop-words and synonyms for other languages but English?

I think this discussion can became a documentation topic with examples,
how-to's and stop-words / synonyms for each language, so it would be much
simpler for those who need to deal with non-English content. What do you
think about that?

Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

RE: Multi-language Tokenizers / Filters recommended?

T. Kuro Kurosaka
Hi Daniel,
As you know, Chinese and Japanese does not use
space or any other delimiters to break words.
To overcome this problem, CJKTokenizer uses a method
called bi-gram where the run of ideographic (=Chinese)
characters are made into tokens of two neighboring
characters.  So a run of five characters ABCDE
will result in four tokens AB, BC, CD, and DE.

So search for "BC" will hits this text,
even if AB is a word and CD is another word.
That is, it increases the noise in the hits.
I don't know how much real problem it would be
for Chinese.  But for Japanese, my native language,
this is a problem. Because of this, search result
for Kyoto will include false hits of documents
that incldue Tokyoto, i.e. Tokyo prefecture.

There is another method called morphological
analysis, which uses dictionaries and grammer
rules to break down text into real words.  You
might want to consider this method.

-kuro  
Reply | Threaded
Open this post in threaded view
|

RE: Multi-language Tokenizers / Filters recommended?

Xuesong Luo
In reply to this post by dma_bamboo
For chinese search, you may also consider
org.apache.lucene.analysis.cn.ChineseTokenizer.

ChineseTokenizer Description: Extract tokens from the Stream using
Character.getType() Rule: A Chinese character as a single token
Copyright: Copyright (c) 2001 Company: The difference between thr
ChineseTokenizer and the CJKTokenizer (id=23545) is that they have
different token parsing logic. Let me use an example. If having a
Chinese text "C1C2C3C4" to be indexed, the tokens returned from the
ChineseTokenizer are C1, C2, C3, C4. And the tokens returned from the
CJKTokenizer are C1C2, C2C3, C3C4. Therefore the index the CJKTokenizer
created is much larger. The problem is that when searching for C1, C1C2,
C1C3, C4C2, C1C2C3 ... the ChineseTokenizer works, but the CJKTokenizer
will not work.




-----Original Message-----
From: Teruhiko Kurosaka [mailto:[hidden email]]
Sent: Friday, June 22, 2007 2:25 PM
To: [hidden email]
Subject: RE: Multi-language Tokenizers / Filters recommended?

Hi Daniel,
As you know, Chinese and Japanese does not use
space or any other delimiters to break words.
To overcome this problem, CJKTokenizer uses a method
called bi-gram where the run of ideographic (=Chinese)
characters are made into tokens of two neighboring
characters.  So a run of five characters ABCDE
will result in four tokens AB, BC, CD, and DE.

So search for "BC" will hits this text,
even if AB is a word and CD is another word.
That is, it increases the noise in the hits.
I don't know how much real problem it would be
for Chinese.  But for Japanese, my native language,
this is a problem. Because of this, search result
for Kyoto will include false hits of documents
that incldue Tokyoto, i.e. Tokyo prefecture.

There is another method called morphological
analysis, which uses dictionaries and grammer
rules to break down text into real words.  You
might want to consider this method.

-kuro  


Reply | Threaded
Open this post in threaded view
|

Re: Multi-language Tokenizers / Filters recommended?

Otis Gospodnetic-2
In reply to this post by dma_bamboo
So then you write a tokenizer that creates a token stream consisting of both uni-grams (e.g. C1, C2) and bi-grams (e.g. C1C2, C2C3), and you get both.  I already pointed to the n-gram tokenizers I wrote a while back and put under lucene's contrib/analyzers/...

Otis
--
Lucene Consulting -- http://lucene-consulting.com/


----- Original Message ----
From: Xuesong Luo <[hidden email]>
To: [hidden email]
Sent: Saturday, June 23, 2007 11:48:55 PM
Subject: RE: Multi-language Tokenizers / Filters recommended?

For chinese search, you may also consider
org.apache.lucene.analysis.cn.ChineseTokenizer.

ChineseTokenizer Description: Extract tokens from the Stream using
Character.getType() Rule: A Chinese character as a single token
Copyright: Copyright (c) 2001 Company: The difference between thr
ChineseTokenizer and the CJKTokenizer (id=23545) is that they have
different token parsing logic. Let me use an example. If having a
Chinese text "C1C2C3C4" to be indexed, the tokens returned from the
ChineseTokenizer are C1, C2, C3, C4. And the tokens returned from the
CJKTokenizer are C1C2, C2C3, C3C4. Therefore the index the CJKTokenizer
created is much larger. The problem is that when searching for C1, C1C2,
C1C3, C4C2, C1C2C3 ... the ChineseTokenizer works, but the CJKTokenizer
will not work.




-----Original Message-----
From: Teruhiko Kurosaka [mailto:[hidden email]]
Sent: Friday, June 22, 2007 2:25 PM
To: [hidden email]
Subject: RE: Multi-language Tokenizers / Filters recommended?

Hi Daniel,
As you know, Chinese and Japanese does not use
space or any other delimiters to break words.
To overcome this problem, CJKTokenizer uses a method
called bi-gram where the run of ideographic (=Chinese)
characters are made into tokens of two neighboring
characters.  So a run of five characters ABCDE
will result in four tokens AB, BC, CD, and DE.

So search for "BC" will hits this text,
even if AB is a word and CD is another word.
That is, it increases the noise in the hits.
I don't know how much real problem it would be
for Chinese.  But for Japanese, my native language,
this is a problem. Because of this, search result
for Kyoto will include false hits of documents
that incldue Tokyoto, i.e. Tokyo prefecture.

There is another method called morphological
analysis, which uses dictionaries and grammer
rules to break down text into real words.  You
might want to consider this method.

-kuro