Lucene for chinese search

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene for chinese search

Lee Li Bin
Hi,

I would like to know whether Standard Analyzer allows searching of chinese
words?

And in order to support chinese searching, is there any encoding needed in
order to develop the application?

I'm currently using Jetty as web server, jsp as application, and search
results will be saved in xml file and display it using xsl. So is there
encoding needed for any of the files (xml, xsl, etc...) as well as during
parsing of query?

thanks alot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene for chinese search

chrislusf
There are three things to watch out for chinese or CJK languages:

1. The content source or database need to be encoded in UTF-8.
2. StandardAnalyzer doesn't support chinese words well. Use either
ChineseAnalyzer or CJKAnalyzer. My experience is that CJKAnalyzer is a
little better.
3. The user's query should be encoded in UTF-8.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 6/17/07, [hidden email] <[hidden email]> wrote:

> Hi,
>
> I would like to know whether Standard Analyzer allows searching of chinese
> words?
>
> And in order to support chinese searching, is there any encoding needed in
> order to develop the application?
>
> I'm currently using Jetty as web server, jsp as application, and search
> results will be saved in xml file and display it using xsl. So is there
> encoding needed for any of the files (xml, xsl, etc...) as well as during
> parsing of query?
>
> thanks alot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene for chinese search

Lee Li Bin
Hi,

I still met problem for searching of Chinese words.
XMl file which is the datasource and analyzer has already been encoded.
Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
still can't get any results.

1. do we need any encoding configuration in apache tomcat for Chinese
search using Lucence

2. do we need to use JSP meta / page encoding ? what is the encoding
for jsp?


 
Regards,
Lee Li Bin

-----Original Message-----
From: Chris Lu [mailto:[hidden email]]
Sent: Monday, June 18, 2007 2:10 AM
To: [hidden email]
Subject: Re: Lucene for chinese search

There are three things to watch out for chinese or CJK languages:

1. The content source or database need to be encoded in UTF-8.
2. StandardAnalyzer doesn't support chinese words well. Use either
ChineseAnalyzer or CJKAnalyzer. My experience is that CJKAnalyzer is a
little better.
3. The user's query should be encoded in UTF-8.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
inutes


On 6/17/07, [hidden email] <[hidden email]> wrote:

> Hi,
>
> I would like to know whether Standard Analyzer allows searching of chinese
> words?
>
> And in order to support chinese searching, is there any encoding needed in
> order to develop the application?
>
> I'm currently using Jetty as web server, jsp as application, and search
> results will be saved in xml file and display it using xsl. So is there
> encoding needed for any of the files (xml, xsl, etc...) as well as during
> parsing of query?
>
> thanks alot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene for chinese search

Mathieu Lecarme
Lee Li Bin a écrit :

> Hi,
>
> I still met problem for searching of Chinese words.
> XMl file which is the datasource and analyzer has already been encoded.
> Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
> still can't get any results.
>
> 1. do we need any encoding configuration in apache tomcat for Chinese
> search using Lucence
>
> 2. do we need to use JSP meta / page encoding ? what is the encoding
> for jsp?
>  
try first with simple junit test, after you can fight with UTF8 parameters.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene for chinese search

Lee Li Bin

Hi,

For indexing, there is no problem, there is Chinese text similar to my
datasource (XML) in the index file when opening on a note pad.

When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
ISO88599_1 or Cp1252 in Java servlet, but we getting search problem, the
search result does not display for Chinese term.

I mixed English and Chinese text in my datasource, the search is working for
English term, and Chinese char display as '???' in the result output.

Please advice or send some sample / solutions
 
Thanks.

-----Original Message-----
From: Mathieu Lecarme [mailto:[hidden email]]
Sent: Monday, June 18, 2007 8:58 PM
To: [hidden email]
Subject: Re: Lucene for chinese search

Lee Li Bin a écrit :

> Hi,
>
> I still met problem for searching of Chinese words.
> XMl file which is the datasource and analyzer has already been encoded.
> Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
> still can't get any results.
>
> 1. do we need any encoding configuration in apache tomcat for Chinese
> search using Lucence
>
> 2. do we need to use JSP meta / page encoding ? what is the encoding
> for jsp?
>  
try first with simple junit test, after you can fight with UTF8 parameters.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene for chinese search

chrislusf
Basically where ever you see, the encoding should be utf8.

The servlet also has an encoding setting. For your case, change the
tomcat setting.
When rendering jsp page, the encoding also matters.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 6/18/07, Lee Li Bin <[hidden email]> wrote:

>
> Hi,
>
> For indexing, there is no problem, there is Chinese text similar to my
> datasource (XML) in the index file when opening on a note pad.
>
> When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
> ISO88599_1 or Cp1252 in Java servlet, but we getting search problem, the
> search result does not display for Chinese term.
>
> I mixed English and Chinese text in my datasource, the search is working for
> English term, and Chinese char display as '???' in the result output.
>
> Please advice or send some sample / solutions
>
> Thanks.
>
> -----Original Message-----
> From: Mathieu Lecarme [mailto:[hidden email]]
> Sent: Monday, June 18, 2007 8:58 PM
> To: [hidden email]
> Subject: Re: Lucene for chinese search
>
> Lee Li Bin a écrit :
> > Hi,
> >
> > I still met problem for searching of Chinese words.
> > XMl file which is the datasource and analyzer has already been encoded.
> > Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
> > still can't get any results.
> >
> > 1.    do we need any encoding configuration in apache tomcat for Chinese
> > search using Lucence
> >
> > 2.    do we need to use JSP meta / page encoding ? what is the encoding
> > for   jsp?
> >
> try first with simple junit test, after you can fight with UTF8 parameters.
>
> M.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene for chinese search

Karl Wettin
A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK  
characters are represented by 3 bytes with UTF8, and 2 bytes as  
UTF16. It is a simple hack.

It did however not save me that much as I had a mixed latin and CJK  
corpus, and I reverted. Still think it is something worth  
considering. Perhaps it might be worth implementing per index, per  
document or per field string encoding strategy.




18 jun 2007 kl. 20.01 skrev Chris Lu:

> Basically where ever you see, the encoding should be utf8.
>
> The servlet also has an encoding setting. For your case, change the
> tomcat setting.
> When rendering jsp page, the encoding also matters.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?
> title=Create_Lucene_Database_Search_in_3_minutes
>
> On 6/18/07, Lee Li Bin <[hidden email]> wrote:
>>
>> Hi,
>>
>> For indexing, there is no problem, there is Chinese text similar  
>> to my
>> datasource (XML) in the index file when opening on a note pad.
>>
>> When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
>> ISO88599_1 or Cp1252 in Java servlet, but we getting search  
>> problem, the
>> search result does not display for Chinese term.
>>
>> I mixed English and Chinese text in my datasource, the search is  
>> working for
>> English term, and Chinese char display as '???' in the result output.
>>
>> Please advice or send some sample / solutions
>>
>> Thanks.
>>
>> -----Original Message-----
>> From: Mathieu Lecarme [mailto:[hidden email]]
>> Sent: Monday, June 18, 2007 8:58 PM
>> To: [hidden email]
>> Subject: Re: Lucene for chinese search
>>
>> Lee Li Bin a écrit :
>> > Hi,
>> >
>> > I still met problem for searching of Chinese words.
>> > XMl file which is the datasource and analyzer has already been  
>> encoded.
>> > Have testing on StandardAnalyzer, CJKAnalyzer, and  
>> ChineseAnalyzer, but it
>> > still can't get any results.
>> >
>> > 1.    do we need any encoding configuration in apache tomcat for  
>> Chinese
>> > search using Lucence
>> >
>> > 2.    do we need to use JSP meta / page encoding ? what is the  
>> encoding
>> > for   jsp?
>> >
>> try first with simple junit test, after you can fight with UTF8  
>> parameters.
>>
>> M.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene for chinese search

chrislusf
Hi, Karl,

Thanks for sharing this experience.

I did find CJKAnalyzer somehow behaves differently than
ChineseAnalyzer. When trying to highlight the matched term,
ChineseAnalyzer didn't work somehow. But I didn't investigate into it.

This is a useful clue for it.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 6/18/07, karl wettin <[hidden email]> wrote:

> A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
> characters are represented by 3 bytes with UTF8, and 2 bytes as
> UTF16. It is a simple hack.
>
> It did however not save me that much as I had a mixed latin and CJK
> corpus, and I reverted. Still think it is something worth
> considering. Perhaps it might be worth implementing per index, per
> document or per field string encoding strategy.
>
>
>
>
> 18 jun 2007 kl. 20.01 skrev Chris Lu:
>
> > Basically where ever you see, the encoding should be utf8.
> >
> > The servlet also has an encoding setting. For your case, change the
> > tomcat setting.
> > When rendering jsp page, the encoding also matters.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> > http://wiki.dbsight.com/index.php?
> > title=Create_Lucene_Database_Search_in_3_minutes
> >
> > On 6/18/07, Lee Li Bin <[hidden email]> wrote:
> >>
> >> Hi,
> >>
> >> For indexing, there is no problem, there is Chinese text similar
> >> to my
> >> datasource (XML) in the index file when opening on a note pad.
> >>
> >> When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
> >> ISO88599_1 or Cp1252 in Java servlet, but we getting search
> >> problem, the
> >> search result does not display for Chinese term.
> >>
> >> I mixed English and Chinese text in my datasource, the search is
> >> working for
> >> English term, and Chinese char display as '???' in the result output.
> >>
> >> Please advice or send some sample / solutions
> >>
> >> Thanks.
> >>
> >> -----Original Message-----
> >> From: Mathieu Lecarme [mailto:[hidden email]]
> >> Sent: Monday, June 18, 2007 8:58 PM
> >> To: [hidden email]
> >> Subject: Re: Lucene for chinese search
> >>
> >> Lee Li Bin a écrit :
> >> > Hi,
> >> >
> >> > I still met problem for searching of Chinese words.
> >> > XMl file which is the datasource and analyzer has already been
> >> encoded.
> >> > Have testing on StandardAnalyzer, CJKAnalyzer, and
> >> ChineseAnalyzer, but it
> >> > still can't get any results.
> >> >
> >> > 1.    do we need any encoding configuration in apache tomcat for
> >> Chinese
> >> > search using Lucence
> >> >
> >> > 2.    do we need to use JSP meta / page encoding ? what is the
> >> encoding
> >> > for   jsp?
> >> >
> >> try first with simple junit test, after you can fight with UTF8
> >> parameters.
> >>
> >> M.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene for chinese search

Karl Wettin
Don't they differ in tokenization? One of them uses grams, the other  
does not. Or? That would be another thing that might mess it up. But  
then I never looked at the highlighter, so I can only guess.

--
karl

18 jun 2007 kl. 22.37 skrev Chris Lu:

> Hi, Karl,
>
> Thanks for sharing this experience.
>
> I did find CJKAnalyzer somehow behaves differently than
> ChineseAnalyzer. When trying to highlight the matched term,
> ChineseAnalyzer didn't work somehow. But I didn't investigate into it.
>
> This is a useful clue for it.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?
> title=Create_Lucene_Database_Search_in_3_minutes
>
>
> On 6/18/07, karl wettin <[hidden email]> wrote:
>> A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
>> characters are represented by 3 bytes with UTF8, and 2 bytes as
>> UTF16. It is a simple hack.
>>
>> It did however not save me that much as I had a mixed latin and CJK
>> corpus, and I reverted. Still think it is something worth
>> considering. Perhaps it might be worth implementing per index, per
>> document or per field string encoding strategy.
>>
>>
>>
>>
>> 18 jun 2007 kl. 20.01 skrev Chris Lu:
>>
>> > Basically where ever you see, the encoding should be utf8.
>> >
>> > The servlet also has an encoding setting. For your case, change the
>> > tomcat setting.
>> > When rendering jsp page, the encoding also matters.
>> >
>> > --
>> > Chris Lu
>> > -------------------------
>> > Instant Scalable Full-Text Search On Any Database/Application
>> > site: http://www.dbsight.net
>> > demo: http://search.dbsight.com
>> > Lucene Database Search in 3 minutes:
>> > http://wiki.dbsight.com/index.php?
>> > title=Create_Lucene_Database_Search_in_3_minutes
>> >
>> > On 6/18/07, Lee Li Bin <[hidden email]> wrote:
>> >>
>> >> Hi,
>> >>
>> >> For indexing, there is no problem, there is Chinese text similar
>> >> to my
>> >> datasource (XML) in the index file when opening on a note pad.
>> >>
>> >> When I try to use the utf8 in jsp and, getbytes array of  
>> 'utf-8' or
>> >> ISO88599_1 or Cp1252 in Java servlet, but we getting search
>> >> problem, the
>> >> search result does not display for Chinese term.
>> >>
>> >> I mixed English and Chinese text in my datasource, the search is
>> >> working for
>> >> English term, and Chinese char display as '???' in the result  
>> output.
>> >>
>> >> Please advice or send some sample / solutions
>> >>
>> >> Thanks.
>> >>
>> >> -----Original Message-----
>> >> From: Mathieu Lecarme [mailto:[hidden email]]
>> >> Sent: Monday, June 18, 2007 8:58 PM
>> >> To: [hidden email]
>> >> Subject: Re: Lucene for chinese search
>> >>
>> >> Lee Li Bin a écrit :
>> >> > Hi,
>> >> >
>> >> > I still met problem for searching of Chinese words.
>> >> > XMl file which is the datasource and analyzer has already been
>> >> encoded.
>> >> > Have testing on StandardAnalyzer, CJKAnalyzer, and
>> >> ChineseAnalyzer, but it
>> >> > still can't get any results.
>> >> >
>> >> > 1.    do we need any encoding configuration in apache tomcat for
>> >> Chinese
>> >> > search using Lucence
>> >> >
>> >> > 2.    do we need to use JSP meta / page encoding ? what is the
>> >> encoding
>> >> > for   jsp?
>> >> >
>> >> try first with simple junit test, after you can fight with UTF8
>> >> parameters.
>> >>
>> >> M.
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene for chinese search

Lee Li Bin
In reply to this post by chrislusf
Hi,

thanks guys for helping me.

I forgot to use back the same analyzer for searching, that's why I can't
search for Chinese words.. :)

 

-----Original Message-----
From: Chris Lu [mailto:[hidden email]]
Sent: Tuesday, June 19, 2007 4:37 AM
To: [hidden email]
Subject: Re: Lucene for chinese search

Hi, Karl,

Thanks for sharing this experience.

I did find CJKAnalyzer somehow behaves differently than
ChineseAnalyzer. When trying to highlight the matched term,
ChineseAnalyzer didn't work somehow. But I didn't investigate into it.

This is a useful clue for it.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
inutes


On 6/18/07, karl wettin <[hidden email]> wrote:

> A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
> characters are represented by 3 bytes with UTF8, and 2 bytes as
> UTF16. It is a simple hack.
>
> It did however not save me that much as I had a mixed latin and CJK
> corpus, and I reverted. Still think it is something worth
> considering. Perhaps it might be worth implementing per index, per
> document or per field string encoding strategy.
>
>
>
>
> 18 jun 2007 kl. 20.01 skrev Chris Lu:
>
> > Basically where ever you see, the encoding should be utf8.
> >
> > The servlet also has an encoding setting. For your case, change the
> > tomcat setting.
> > When rendering jsp page, the encoding also matters.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> > http://wiki.dbsight.com/index.php?
> > title=Create_Lucene_Database_Search_in_3_minutes
> >
> > On 6/18/07, Lee Li Bin <[hidden email]> wrote:
> >>
> >> Hi,
> >>
> >> For indexing, there is no problem, there is Chinese text similar
> >> to my
> >> datasource (XML) in the index file when opening on a note pad.
> >>
> >> When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
> >> ISO88599_1 or Cp1252 in Java servlet, but we getting search
> >> problem, the
> >> search result does not display for Chinese term.
> >>
> >> I mixed English and Chinese text in my datasource, the search is
> >> working for
> >> English term, and Chinese char display as '???' in the result output.
> >>
> >> Please advice or send some sample / solutions
> >>
> >> Thanks.
> >>
> >> -----Original Message-----
> >> From: Mathieu Lecarme [mailto:[hidden email]]
> >> Sent: Monday, June 18, 2007 8:58 PM
> >> To: [hidden email]
> >> Subject: Re: Lucene for chinese search
> >>
> >> Lee Li Bin a écrit :
> >> > Hi,
> >> >
> >> > I still met problem for searching of Chinese words.
> >> > XMl file which is the datasource and analyzer has already been
> >> encoded.
> >> > Have testing on StandardAnalyzer, CJKAnalyzer, and
> >> ChineseAnalyzer, but it
> >> > still can't get any results.
> >> >
> >> > 1.    do we need any encoding configuration in apache tomcat for
> >> Chinese
> >> > search using Lucence
> >> >
> >> > 2.    do we need to use JSP meta / page encoding ? what is the
> >> encoding
> >> > for   jsp?
> >> >
> >> try first with simple junit test, after you can fight with UTF8
> >> parameters.
> >>
> >> M.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene for chinese search

Otis Gospodnetic-2
In reply to this post by Lee Li Bin
Regarding point #2, in case none of those work for you for some reason, you could always try using this:

$ ll analyzers/src/java/org/apache/lucene/analysis/ngram/
total 48
-rw-rw-r--  1 otis otis 4934 Mar  2 16:32 EdgeNGramTokenFilter.java
-rw-rw-r--  1 otis otis 4617 Feb 21 15:33 EdgeNGramTokenizer.java
-rw-rw-r--  1 otis otis 3257 Mar  2 17:12 NGramTokenFilter.java
-rw-rw-r--  1 otis otis 3103 Mar  2 16:33 NGramTokenizer.java
drwxrwxr-x  7 otis otis 4096 May 31 10:11 .svn/

Otis
--
Lucene Consulting -- http://lucene-consulting.com/


----- Original Message ----
From: Chris Lu <[hidden email]>
To: [hidden email]
Sent: Sunday, June 17, 2007 8:09:30 PM
Subject: Re: Lucene for chinese search

There are three things to watch out for chinese or CJK languages:

1. The content source or database need to be encoded in UTF-8.
2. StandardAnalyzer doesn't support chinese words well. Use either
ChineseAnalyzer or CJKAnalyzer. My experience is that CJKAnalyzer is a
little better.
3. The user's query should be encoded in UTF-8.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 6/17/07, [hidden email] <[hidden email]> wrote:

> Hi,
>
> I would like to know whether Standard Analyzer allows searching of chinese
> words?
>
> And in order to support chinese searching, is there any encoding needed in
> order to develop the application?
>
> I'm currently using Jetty as web server, jsp as application, and search
> results will be saved in xml file and display it using xsl. So is there
> encoding needed for any of the files (xml, xsl, etc...) as well as during
> parsing of query?
>
> thanks alot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]