The best Chinese Analyzer?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

The best Chinese Analyzer?

Bob Cheung-2
I have a question for those who have used Lucene to index and search for
Chinese Characters, what is the best Analyzer for the job?

I know all these three can do the job:

1. StandardAnalyzer
2. CJKAnalyzer
3. ChineseAnalyzer

What are the difference between these 3 analyzers?

TIA.

Regards,
Bob

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The best Chinese Analyzer?

saturnism
Hi Bob,

In short, I use a slightly modified ChineseAnalyzer to index chinese text.
They differ mainly in the way they tokenize the text.

StandardAnalyzer is inteded to use w/ Latin-based languages, that each
word composes of multiple characters, and each word is separated by
special markers such as a space ' ', a comma, a period, a new line...
etc.. so "C1C2C3" (space) "C4C5C6" will be tokenized into 2 terms:
"C1C2C3" and "C4C5C6"

CJKAnalyzer tokenizes Chinese text into 2-grams (from
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java?rev=165565&view=markup)
"C1C2C3C4" -> "C1C2" "C2C3" "C3C4"

ChineseAnalyzer tokenizes Chinese text into 1-gram
(http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java?rev=353930&view=markup)
"C1C2C3C4" -> "C1" "C2" "C2" "C3" "C3" "C4"

The most obvious result of these 3 tokenization tokenization
strategies is the search results.
Suppose you search for "C2C3", you can only find it w/
ChineseAnalyzer, but not the other 2 with the above example.

ray,


On 5/8/06, Bob Cheung <[hidden email]> wrote:

> I have a question for those who have used Lucene to index and search for
> Chinese Characters, what is the best Analyzer for the job?
>
> I know all these three can do the job:
>
> 1. StandardAnalyzer
> 2. CJKAnalyzer
> 3. ChineseAnalyzer
>
> What are the difference between these 3 analyzers?
>
> TIA.
>
> Regards,
> Bob
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Subject indexing and seraching documents with multiple languages

pbatcoi
Hello,

we need to index and search documents of multiple languages.

Our current approach is:

Determine the language of each document before passing it to Lucene and use
a Lucene index for each language. This seems to be necessary because the
IndexWriter takes an analyzer as parameter. Thus we can pass the English
documents to the IndexWriter created with the English analyzer and so on.

Our problem is the search: We would like to be able to search in only one or
all language specific indexes. Not a problem itself, because we can use the
MultiSearcher. But the MultiSearcher takes one query as parameter and the
query is generated using an analyzer. We would need to generate different
analyzed queries for the different indexes.

Did somebody find a solution for this problem and can point us a direction
to investigate further?

Greetings

Peter and Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Subject indexing and seraching documents with multiple languages

Grant Ingersoll
We wrote our own MultiSearcher type class that manages this problem.  It
takes in a query in the user's native language and then feeds it to the
searcher for that language, which uses a machine translation component
to create a query for that index using that language's Analyzer.

-Grant

[hidden email] wrote:

> Hello,
>
> we need to index and search documents of multiple languages.
>
> Our current approach is:
>
> Determine the language of each document before passing it to Lucene and use
> a Lucene index for each language. This seems to be necessary because the
> IndexWriter takes an analyzer as parameter. Thus we can pass the English
> documents to the IndexWriter created with the English analyzer and so on.
>
> Our problem is the search: We would like to be able to search in only one or
> all language specific indexes. Not a problem itself, because we can use the
> MultiSearcher. But the MultiSearcher takes one query as parameter and the
> query is generated using an analyzer. We would need to generate different
> analyzed queries for the different indexes.
>
> Did somebody find a solution for this problem and can point us a direction
> to investigate further?
>
> Greetings
>
> Peter and Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Subject indexing and seraching documents with multiple languages

Karl Wettin-3
On Mon, 2006-05-08 at 08:34 -0400, Grant Ingersoll wrote:
> This seems to be necessary because the IndexWriter takes an analyzer
> as parameter. Thus we can pass the English documents to the
> IndexWriter created with the English analyzer and so on.

/**
 * Adds a document to this index, using the provided analyzer
 * instead of the value of {@link #getAnalyzer()}.  If the
 * document contains more than {@link #setMaxFieldLength(int)}
 * terms for a given field, the remainder are discarded.
 */
 public void addDocument(Document doc, Analyzer analyzer)



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Subject indexing and seraching documents with multiple languages

Karl Wettin-3
On Mon, 2006-05-08 at 16:08 +0200, karl wettin wrote:
> On Mon, 2006-05-08 at 08:34 -0400, Grant Ingersoll wrote:
> > This seems to be necessary because the IndexWriter takes an analyzer
> > as parameter. Thus we can pass the English documents to the
> > IndexWriter created with the English analyzer and so on.

>  public void addDocument(Document doc, Analyzer analyzer)  

hmm, perhaps i missunderstood?


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Subject indexing and seraching documents with multiple languages

pbatcoi
Karl,

no, you didn't misunderstand. We have to admit that we were not aware of the
possibility to use different analyzers for the documents in an index. It
seems that we were working to close to the examples and did not spend enough
time to RTFM. Thank you for the hint!

Grant,

considering the answer from Karl, it seems that we have to choice to put all
the documents in one index or use an index for each language. You are using
an index for each language. We are currently discussing the pros and cons
for both solutions. Thus we would be very interested to find out about your
reasons to use a separate index for each language.

Thanks again for taking the time to answer our question!

Greeting

Stefan and Peter

> --- Urspr√ľngliche Nachricht ---
> Von: karl wettin <[hidden email]>
> An: [hidden email]
> Betreff: Re: Subject indexing and seraching documents with
> multiple languages
> Datum: Mon, 08 May 2006 16:10:03 +0200
>
> On Mon, 2006-05-08 at 16:08 +0200, karl wettin wrote:
> > On Mon, 2006-05-08 at 08:34 -0400, Grant Ingersoll wrote:
> > > This seems to be necessary because the IndexWriter takes an analyzer
> > > as parameter. Thus we can pass the English documents to the
> > > IndexWriter created with the English analyzer and so on.
>
> >  public void addDocument(Document doc, Analyzer analyzer)  
>
> hmm, perhaps i missunderstood?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Subject indexing and seraching documents with multiple languages

Karl Wettin-3
On Tue, 2006-05-09 at 10:18 +0200, [hidden email] wrote:
>
> considering the answer from Karl, it seems that we have to choice to
> put all the documents in one index or use an index for each language.
> You are using an index for each language. We are currently discussing
> the pros and cons for both solutions. Thus we would be very interested
> to find out about your reasons to use a separate index for each
> language.


If you ask me, then you only choose multiple indices when there is no
other chooise. Well, at least if you plan to search on more than one
language at the time.

Why? It is really, really slow.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Subject indexing and seraching documents with multiple languages

Grant Ingersoll
In reply to this post by pbatcoi
[hidden email] wrote:
> Grant,
>
> considering the answer from Karl, it seems that we have to choice to put all
> the documents in one index or use an index for each language. You are using
> an index for each language. We are currently discussing the pros and cons
> for both solutions. Thus we would be very interested to find out about your
> reasons to use a separate index for each language.
>
>  
Hmmm, it was a while ago and I am not 100% convinced I would make the
same decision again, but I seem to recall a few reasons:

1. I thought it would be easier to manage them separately, as they all
come from distinct collections.  We can easily take one language/index
off line w/o affecting the other indexes.  I think this has been proved
out over time in our case.  We often index the same set of documents
several different ways (different stemmers, adding proper nouns,
phrases, transliteration, translation, etc.), trying to evaluate which
one gives us the best results.  Having them all in the same collection
makes this harder.  What works for you probably depends on how often you
update/delete, etc.

2. I am not certain of what it means to have an English query match
against Arabic documents that contain English in them (or any other
language).  On the surface this seems fine, since it is a term match,
but I am not sure if it is meaningful when it comes to meeting a user's
information need.  This is just a hunch and I am not sure it is a
correct one.  I think it would probably warrant more study from a user's
perspective.  I would like to hear other's opinions on this.

3. Logically, to me, our collections our separate.  They come from
different sources, they are in different languages, they have different
styles, different authors, etc.  It just _feels_ like they should be
kept separate.  You may not care about this distinction.

One question I have always had is whether storing everything in the same
index skews the IDF values by giving a term more importance than it
really warrants.  My guess is it doesn't b/c this would happen evenly
for all terms.  However, some of the times you have terms that occur in
both collections, so the IDF for these may be smaller than it would be
relative to having indexed the collection separately.  Is this valid or
am I talking crazy?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]