Indexing multiple languages

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing multiple languages

Tansley, Robert
Hi all,

The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it.  Now the system is being used globally, it needs to
support multi-language indexing.

I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.

What if we're trying to index multiple languages in the same site?  Is
it best to have:

1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each language?

I don't fully understand the consequences in terms of performance for
1/, but I can see that false hits could turn up where one word appears
in different languages (stemming could increase the changes of this).
Also some languages' analyzers are quite dramatically different (e.g.
the Chinese one which just treats every character as a separate
token/word).

On the other hand, if people are searching for proper nouns in metadata
(e.g. "DSpace") it may be advantageous to search all languages at once.


I'm also not sure of the storage and performance consequences of 2/.

Approach 3/ seems like it might be the most complex from an
implementation/code point of view.  

Does anyone have any thoughts or recommendations on this?

Many thanks,

 Robert Tansley / Digital Media Systems Programme / HP Labs
  http://www.hpl.hp.com/personal/Robert_Tansley/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

jian chen
Hi,

Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.

Right now I think maybe I have to write a special analyzer that takes
the text input, and detect if the character is an ASCII char, if it
is, assembly them together and make it as a token, if not, then, make
it as a Chinese word token.

So, bottom line is, just one analyzer for all the text and do the
if/else statement inside the analyzer.

I would like to learn more thoughts about this!

Thanks,

Jian

On 5/31/05, Tansley, Robert <[hidden email]> wrote:

> Hi all,
>
> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> (Dublin Core standard) and extracted full-text content of documents
> stored in it.  Now the system is being used globally, it needs to
> support multi-language indexing.
>
> I've looked through the mailing list archives etc. and it seems it's
> easy to plug in analyzers for different languages.
>
> What if we're trying to index multiple languages in the same site?  Is
> it best to have:
>
> 1/ one index for all languages
> 2/ one index for all languages, with an extra language field so searches
> can be constrained to a particular language
> 3/ separate indices for each language?
>
> I don't fully understand the consequences in terms of performance for
> 1/, but I can see that false hits could turn up where one word appears
> in different languages (stemming could increase the changes of this).
> Also some languages' analyzers are quite dramatically different (e.g.
> the Chinese one which just treats every character as a separate
> token/word).
>
> On the other hand, if people are searching for proper nouns in metadata
> (e.g. "DSpace") it may be advantageous to search all languages at once.
>
>
> I'm also not sure of the storage and performance consequences of 2/.
>
> Approach 3/ seems like it might be the most complex from an
> implementation/code point of view.
>
> Does anyone have any thoughts or recommendations on this?
>
> Many thanks,
>
>  Robert Tansley / Digital Media Systems Programme / HP Labs
>   http://www.hpl.hp.com/personal/Robert_Tansley/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Erik Hatcher
Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It  
will keep English as-is (removing stop words, lowercasing, and such)  
and separate CJK characters into separate tokens also.

     Erik


On May 31, 2005, at 5:49 PM, jian chen wrote:

> Hi,
>
> Interesting topic. I thought about this as well. I wanted to index
> Chinese text with English, i.e., I want to treat the English text
> inside Chinese text as English tokens rather than Chinese text tokens.
>
> Right now I think maybe I have to write a special analyzer that takes
> the text input, and detect if the character is an ASCII char, if it
> is, assembly them together and make it as a token, if not, then, make
> it as a Chinese word token.
>
> So, bottom line is, just one analyzer for all the text and do the
> if/else statement inside the analyzer.
>
> I would like to learn more thoughts about this!
>
> Thanks,
>
> Jian
>
> On 5/31/05, Tansley, Robert <[hidden email]> wrote:
>
>> Hi all,
>>
>> The DSpace (www.dspace.org) currently uses Lucene to index metadata
>> (Dublin Core standard) and extracted full-text content of documents
>> stored in it.  Now the system is being used globally, it needs to
>> support multi-language indexing.
>>
>> I've looked through the mailing list archives etc. and it seems it's
>> easy to plug in analyzers for different languages.
>>
>> What if we're trying to index multiple languages in the same  
>> site?  Is
>> it best to have:
>>
>> 1/ one index for all languages
>> 2/ one index for all languages, with an extra language field so  
>> searches
>> can be constrained to a particular language
>> 3/ separate indices for each language?
>>
>> I don't fully understand the consequences in terms of performance for
>> 1/, but I can see that false hits could turn up where one word  
>> appears
>> in different languages (stemming could increase the changes of this).
>> Also some languages' analyzers are quite dramatically different (e.g.
>> the Chinese one which just treats every character as a separate
>> token/word).
>>
>> On the other hand, if people are searching for proper nouns in  
>> metadata
>> (e.g. "DSpace") it may be advantageous to search all languages at  
>> once.
>>
>>
>> I'm also not sure of the storage and performance consequences of 2/.
>>
>> Approach 3/ seems like it might be the most complex from an
>> implementation/code point of view.
>>
>> Does anyone have any thoughts or recommendations on this?
>>
>> Many thanks,
>>
>>  Robert Tansley / Digital Media Systems Programme / HP Labs
>>   http://www.hpl.hp.com/personal/Robert_Tansley/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Erik Hatcher
In reply to this post by Tansley, Robert
Robert,

I'm very likely going to be using DSpace and some related  
technologies from the SIMILE project very soon :)


On May 31, 2005, at 5:08 PM, Tansley, Robert wrote:

> Hi all,
>
> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> (Dublin Core standard) and extracted full-text content of documents
> stored in it.  Now the system is being used globally, it needs to
> support multi-language indexing.
>
> I've looked through the mailing list archives etc. and it seems it's
> easy to plug in analyzers for different languages.
>
> What if we're trying to index multiple languages in the same site?  Is
> it best to have:
>
> 1/ one index for all languages
> 2/ one index for all languages, with an extra language field so  
> searches
> can be constrained to a particular language
> 3/ separate indices for each language?

I would vote for option #2 as it gives the most flexibilty - you can  
query with or without concern for language.

> I'm also not sure of the storage and performance consequences of 2/.

Adding an additional field will be of little consequence.

> Approach 3/ seems like it might be the most complex from an
> implementation/code point of view.

I don't think #3 is all that complex to implement beyond the other  
options, except if you want to search across all languages - but the  
MultiSearcher can handle that.

> Does anyone have any thoughts or recommendations on this?

It's tough to give a general recommendation - it really depends on  
how each of these solutions fit into the architecture and what needs  
you have in terms of querying across multiple languages and such.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

jian chen
In reply to this post by Erik Hatcher
Hi, Erik,

Thanks for your info.

No, I haven't tried it yet. I will give it a try and maybe produce
some Chinese/English text search demo online.

Currently I used Lucene as the indexing engine for Velocity mailing
list search. I have a demo at www.jhsystems.net.

It is yet another mailing list search for Velocity, but I combined
date as well as full text search together.

I only used lucene for indexing the textual content, and combined
database search with lucene search in returning the results.

The other interesting thought I have is: maybe it is possible to use
Lucene's merge segments mechanism to write a java based simple file
system. Which of course, does not require constant compact operation.
The file system could be based on one file only, where segments are
just part of the big file. It might be really efficient in terms of
adding/deleting the objects all the time.

Lastly, any comments welcome for www.jhsystems.net Velocity search.

Thanks,

Jian
www.jhsystems.net

On 5/31/05, Erik Hatcher <[hidden email]> wrote:

> Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
> will keep English as-is (removing stop words, lowercasing, and such)
> and separate CJK characters into separate tokens also.
>
>      Erik
>
>
> On May 31, 2005, at 5:49 PM, jian chen wrote:
>
> > Hi,
> >
> > Interesting topic. I thought about this as well. I wanted to index
> > Chinese text with English, i.e., I want to treat the English text
> > inside Chinese text as English tokens rather than Chinese text tokens.
> >
> > Right now I think maybe I have to write a special analyzer that takes
> > the text input, and detect if the character is an ASCII char, if it
> > is, assembly them together and make it as a token, if not, then, make
> > it as a Chinese word token.
> >
> > So, bottom line is, just one analyzer for all the text and do the
> > if/else statement inside the analyzer.
> >
> > I would like to learn more thoughts about this!
> >
> > Thanks,
> >
> > Jian
> >
> > On 5/31/05, Tansley, Robert <[hidden email]> wrote:
> >
> >> Hi all,
> >>
> >> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> >> (Dublin Core standard) and extracted full-text content of documents
> >> stored in it.  Now the system is being used globally, it needs to
> >> support multi-language indexing.
> >>
> >> I've looked through the mailing list archives etc. and it seems it's
> >> easy to plug in analyzers for different languages.
> >>
> >> What if we're trying to index multiple languages in the same
> >> site?  Is
> >> it best to have:
> >>
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> >>
> >> I don't fully understand the consequences in terms of performance for
> >> 1/, but I can see that false hits could turn up where one word
> >> appears
> >> in different languages (stemming could increase the changes of this).
> >> Also some languages' analyzers are quite dramatically different (e.g.
> >> the Chinese one which just treats every character as a separate
> >> token/word).
> >>
> >> On the other hand, if people are searching for proper nouns in
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at
> >> once.
> >>
> >>
> >> I'm also not sure of the storage and performance consequences of 2/.
> >>
> >> Approach 3/ seems like it might be the most complex from an
> >> implementation/code point of view.
> >>
> >> Does anyone have any thoughts or recommendations on this?
> >>
> >> Many thanks,
> >>
> >>  Robert Tansley / Digital Media Systems Programme / HP Labs
> >>   http://www.hpl.hp.com/personal/Robert_Tansley/
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Paul Libbrecht
In reply to this post by Erik Hatcher
Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
>> 1/ one index for all languages
>> 2/ one index for all languages, with an extra language field so
>> searches
>> can be constrained to a particular language
>> 3/ separate indices for each language?
> I would vote for option #2 as it gives the most flexibilty - you can
> query with or without concern for language.

The way I've solved this is to make a different field-name per-language
as our documents can be multilingual.
What's then done is query expansion at query time: given a term-query
for text, I duplicate it for each accepted language of the user with a
factor related to the preference of the language (e.g. the q factor in
Accept-Language http header). Presumably I could be using solution 2/
as well if my queries become too big, making several documents for each
language of the document.

I think it's very important to care about guessing the accepted
languages of the user. Typically, the default behaviour of Google is to
only give you matches in your primary language but then allow expansion
in any language.

>> On the other hand, if people are searching for proper nouns in
>> metadata
>> (e.g. "DSpace") it may be advantageous to search all languages at
>> once.

This one may need particular treatment.

Tell us your success!

paul


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Indexing multiple languages

Tansley, Robert
In reply to this post by Tansley, Robert
Thanks all for the useful comments.

It seems that there are even more options --

4/ One index, with a separate Lucene document for each (item,language) combination, with one field that specifies the language
5/ One index, one Lucene document per item, with field names that include the language (e.g. title_en, title_cn)

I quite like 4, because you can search with no language constraint, or with one as Paul suggests below.  However, some "non language-specific" data might need to be repeated (e.g. dates), unless we had an extra Lucene document for all that.  I wonder what the various pros and cons in terms of index size and performance would be in each case?  I really don't have enough knowledge of Lucene to have any idea...

 Robert Tansley / Digital Media Systems Programme / HP Labs
  http://www.hpl.hp.com/personal/Robert_Tansley/

> -----Original Message-----
> From: Paul Libbrecht [mailto:[hidden email]]
> Sent: 01 June 2005 04:10
> To: [hidden email]
> Subject: Re: Indexing multiple languages
>
> Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> > I would vote for option #2 as it gives the most flexibilty
> - you can
> > query with or without concern for language.
>
> The way I've solved this is to make a different field-name
> per-language
> as our documents can be multilingual.
> What's then done is query expansion at query time: given a term-query
> for text, I duplicate it for each accepted language of the
> user with a
> factor related to the preference of the language (e.g. the q
> factor in
> Accept-Language http header). Presumably I could be using solution 2/
> as well if my queries become too big, making several
> documents for each
> language of the document.
>
> I think it's very important to care about guessing the accepted
> languages of the user. Typically, the default behaviour of
> Google is to
> only give you matches in your primary language but then allow
> expansion
> in any language.
>
> >> On the other hand, if people are searching for proper nouns in
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at
> >> once.
>
> This one may need particular treatment.
>
> Tell us your success!
>
> paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Indexing multiple languages

Bob Cheung
In reply to this post by Tansley, Robert
Hi Erik,

I am a new comer to this list and please allow me to ask a dumb
question.

For the StandardAnalyzer, will it have to be modified to accept
different character encodings.

We have customers in China, Taiwan and Hong Kong.  Chinese data may come
in 3 different encoding:  Big5, GB and UTF8.

What is the default encoding for the StandardAnalyser.

Btw, I did try running the lucene demo (web template) to index the HTML
files after I added one including English and Chinese characters.  I was
not able to search for any Chinese in that HTML file (returned no hits).
I wonder whether I need to change some of the java programs to index
Chinese and/or accept Chinese as search term.  I was able to search for
the HTML file if I used English word that appeared in the added HTML
file.

Thanks,

Bob


On May 31, 2005, Erik wrote:

Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
will keep English as-is (removing stop words, lowercasing, and such)
and separate CJK characters into separate tokens also.

     Erik


On May 31, 2005, at 5:49 PM, jian chen wrote:

> Hi,
>
> Interesting topic. I thought about this as well. I wanted to index
> Chinese text with English, i.e., I want to treat the English text
> inside Chinese text as English tokens rather than Chinese text tokens.
>
> Right now I think maybe I have to write a special analyzer that takes
> the text input, and detect if the character is an ASCII char, if it
> is, assembly them together and make it as a token, if not, then, make
> it as a Chinese word token.
>
> So, bottom line is, just one analyzer for all the text and do the
> if/else statement inside the analyzer.
>
> I would like to learn more thoughts about this!
>
> Thanks,
>
> Jian
>
> On 5/31/05, Tansley, Robert <[hidden email]> wrote:
>
>> Hi all,
>>
>> The DSpace (www.dspace.org) currently uses Lucene to index metadata
>> (Dublin Core standard) and extracted full-text content of documents
>> stored in it.  Now the system is being used globally, it needs to
>> support multi-language indexing.
>>
>> I've looked through the mailing list archives etc. and it seems it's
>> easy to plug in analyzers for different languages.
>>
>> What if we're trying to index multiple languages in the same
>> site?  Is
>> it best to have:
>>
>> 1/ one index for all languages
>> 2/ one index for all languages, with an extra language field so
>> searches
>> can be constrained to a particular language
>> 3/ separate indices for each language?
>>
>> I don't fully understand the consequences in terms of performance for
>> 1/, but I can see that false hits could turn up where one word
>> appears
>> in different languages (stemming could increase the changes of this).
>> Also some languages' analyzers are quite dramatically different (e.g.
>> the Chinese one which just treats every character as a separate
>> token/word).
>>
>> On the other hand, if people are searching for proper nouns in
>> metadata
>> (e.g. "DSpace") it may be advantageous to search all languages at
>> once.
>>
>>
>> I'm also not sure of the storage and performance consequences of 2/.
>>
>> Approach 3/ seems like it might be the most complex from an
>> implementation/code point of view.
>>
>> Does anyone have any thoughts or recommendations on this?
>>
>> Many thanks,
>>
>>  Robert Tansley / Digital Media Systems Programme / HP Labs
>>   http://www.hpl.hp.com/personal/Robert_Tansley/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Andy Roberts-3
On Friday 03 Jun 2005 01:06, Bob Cheung wrote:
> For the StandardAnalyzer, will it have to be modified to accept
> different character encodings.
>
> We have customers in China, Taiwan and Hong Kong.  Chinese data may come
> in 3 different encoding:  Big5, GB and UTF8.
>
> What is the default encoding for the StandardAnalyser.

The analysers themselves do not worry about encodings, per se. Java uses
Unicode strings throughout, which is adequate enough to describing all
languages.  When reading in text files, it's a matter of letting the reader
know which encoding the file is in, this helps Java to read in the text, and
essentially map that encoding to the Unicode encoding. All the string
operations, like analysing are done on these Unicode strings.

So, the task is making sure the file reader you use to open a document for
indexing is given the required information for correctly decoding your file.
If you don't specify, Java will use one based on the locale that your OS
uses. For me, that's Latin1 as I'm in Britain. This clearly is inadequate for
non-Latin texts and wouldn't be able to read in Chinese texts properly as the
Latin1 encoding doesn't support such characters. You need to specify Big5
yourself. Read the info on InputStreamReaders:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStreamReader.html

Andy

>
> Btw, I did try running the lucene demo (web template) to index the HTML
> files after I added one including English and Chinese characters.  I was
> not able to search for any Chinese in that HTML file (returned no hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search for
> the HTML file if I used English word that appeared in the added HTML
> file.
>
> Thanks,
>
> Bob
>
>
> On May 31, 2005, Erik wrote:
>
> Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
> will keep English as-is (removing stop words, lowercasing, and such)
> and separate CJK characters into separate tokens also.
>
>      Erik
>
> On May 31, 2005, at 5:49 PM, jian chen wrote:
> > Hi,
> >
> > Interesting topic. I thought about this as well. I wanted to index
> > Chinese text with English, i.e., I want to treat the English text
> > inside Chinese text as English tokens rather than Chinese text tokens.
> >
> > Right now I think maybe I have to write a special analyzer that takes
> > the text input, and detect if the character is an ASCII char, if it
> > is, assembly them together and make it as a token, if not, then, make
> > it as a Chinese word token.
> >
> > So, bottom line is, just one analyzer for all the text and do the
> > if/else statement inside the analyzer.
> >
> > I would like to learn more thoughts about this!
> >
> > Thanks,
> >
> > Jian
> >
> > On 5/31/05, Tansley, Robert <[hidden email]> wrote:
> >> Hi all,
> >>
> >> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> >> (Dublin Core standard) and extracted full-text content of documents
> >> stored in it.  Now the system is being used globally, it needs to
> >> support multi-language indexing.
> >>
> >> I've looked through the mailing list archives etc. and it seems it's
> >> easy to plug in analyzers for different languages.
> >>
> >> What if we're trying to index multiple languages in the same
> >> site?  Is
> >> it best to have:
> >>
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> >>
> >> I don't fully understand the consequences in terms of performance for
> >> 1/, but I can see that false hits could turn up where one word
> >> appears
> >> in different languages (stemming could increase the changes of this).
> >> Also some languages' analyzers are quite dramatically different (e.g.
> >> the Chinese one which just treats every character as a separate
> >> token/word).
> >>
> >> On the other hand, if people are searching for proper nouns in
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at
> >> once.
> >>
> >>
> >> I'm also not sure of the storage and performance consequences of 2/.
> >>
> >> Approach 3/ seems like it might be the most complex from an
> >> implementation/code point of view.
> >>
> >> Does anyone have any thoughts or recommendations on this?
> >>
> >> Many thanks,
> >>
> >>  Robert Tansley / Digital Media Systems Programme / HP Labs
> >>   http://www.hpl.hp.com/personal/Robert_Tansley/
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Erik Hatcher
In reply to this post by Bob Cheung

On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:

> Btw, I did try running the lucene demo (web template) to index the  
> HTML
> files after I added one including English and Chinese characters.  
> I was
> not able to search for any Chinese in that HTML file (returned no  
> hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search  
> for
> the HTML file if I used English word that appeared in the added HTML
> file.

Bob - Andy provided thorough information on the StandardAnalyzer  
issue (in short, it deals with Unicode directly not encodings).  As  
for the Lucene demo - you will have to adjust it to read the files in  
the proper encoding.  The IndexFiles program indexes files using the  
default encoding which won't be sufficient for your purpose.  The two  
files to check are HtmlDocument and FileDocument.  These files read  
the HTML and text files that the demo indexes.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Grant Ingersoll
In reply to this post by Tansley, Robert
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages

>>> [hidden email] 6/3/2005 6:03:31 AM >>>

On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
> Btw, I did try running the lucene demo (web template) to index the  
> HTML
> files after I added one including English and Chinese characters.  
> I was
> not able to search for any Chinese in that HTML file (returned no  
> hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search

> for
> the HTML file if I used English word that appeared in the added HTML
> file.

Bob - Andy provided thorough information on the StandardAnalyzer  
issue (in short, it deals with Unicode directly not encodings).  As  
for the Lucene demo - you will have to adjust it to read the files in

the proper encoding.  The IndexFiles program indexes files using the  
default encoding which won't be sufficient for your purpose.  The two

files to check are HtmlDocument and FileDocument.  These files read  
the HTML and text files that the demo indexes.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Paul Libbrecht
In reply to this post by Tansley, Robert
Robert,

Le 2 juin 05, à 21:42, Tansley, Robert a écrit :
> It seems that there are even more options --
> 4/ One index, with a separate Lucene document for each (item,language)
> combination, with one field that specifies the language
> 5/ One index, one Lucene document per item, with field names that
> include the language (e.g. title_en, title_cn)
> I quite like 4, because you can search with no language constraint, or
> with one as Paul suggests below.

You can in both cases. In the second, you need to expand the query (ie
searching for carrot would search text_en:carrot or text_cn:carrot",
which, I think is fair as long as you don't a two kilometer's list of
languages.

> However, some "non language-specific" data might need to be repeated
> (e.g. dates), unless we had an extra Lucene document for all that.  I
> wonder what the various pros and cons in terms of index size and
> performance would be in each case?  I really don't have enough
> knowledge of Lucene to have any idea...

If you separate the indices you won't, as far as I know, be able to
query simultaneously (e.g. some text which, as well, is new
enough....).

paul


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Indexing multiple languages

Max Pfingsthorn
In reply to this post by Tansley, Robert
Hi

You could use the ParalellReader for this if you have all documents in all languages. Then, the metadata fields can be stored in one of the field data files, while each languages gets its own field data file...

max

-----Original Message-----
From: Paul Libbrecht [mailto:[hidden email]]
Sent: Friday, June 03, 2005 14:23
To: [hidden email]
Subject: Re: Indexing multiple languages


Robert,

Le 2 juin 05, à 21:42, Tansley, Robert a écrit :
> It seems that there are even more options --
> 4/ One index, with a separate Lucene document for each (item,language)
> combination, with one field that specifies the language
> 5/ One index, one Lucene document per item, with field names that
> include the language (e.g. title_en, title_cn)
> I quite like 4, because you can search with no language constraint, or
> with one as Paul suggests below.

You can in both cases. In the second, you need to expand the query (ie
searching for carrot would search text_en:carrot or text_cn:carrot",
which, I think is fair as long as you don't a two kilometer's list of
languages.

> However, some "non language-specific" data might need to be repeated
> (e.g. dates), unless we had an extra Lucene document for all that.  I
> wonder what the various pros and cons in terms of index size and
> performance would be in each case?  I really don't have enough
> knowledge of Lucene to have any idea...

If you separate the indices you won't, as far as I know, be able to
query simultaneously (e.g. some text which, as well, is new
enough....).

paul


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

Doug Cutting
In reply to this post by Tansley, Robert
Tansley, Robert wrote:
> What if we're trying to index multiple languages in the same site?  Is
> it best to have:
>
> 1/ one index for all languages
> 2/ one index for all languages, with an extra language field so searches
> can be constrained to a particular language
> 3/ separate indices for each language?

I'd use 2/.  In particular, use the same field for the content, title,
etc., even if when produced by different analyzers.  Have a "lang" field
that names the language of the document.

At query time, use an analyzer selected by the user's environment (e.g.,
HTTP lang header).  If folks are getting false positives, where a term
in another language that means something different is matching their
query, they can use a "lang" pulldown to remove documents from other
languages, implemented as a Lucene Filter.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Indexing multiple languages

Bruce Ritchie
In reply to this post by Tansley, Robert
> Tansley, Robert wrote:
> > What if we're trying to index multiple languages in the
> same site?  Is
> > it best to have:
> >
> > 1/ one index for all languages
> > 2/ one index for all languages, with an extra language field so
> > searches can be constrained to a particular language 3/ separate
> > indices for each language?
>
> I'd use 2/.  In particular, use the same field for the
> content, title, etc., even if when produced by different
> analyzers.  Have a "lang" field that names the language of
> the document.

We use 2/ and use filters when we want to search only within a particular language. Just be sure touse the same analyzer when indexing and
searching within a particular language.


Regards,

Bruce Ritchie

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple languages

sergiu gordea
In reply to this post by Tansley, Robert
Tansley, Robert wrote:

>Hi all,
>
>The DSpace (www.dspace.org) currently uses Lucene to index metadata
>(Dublin Core standard) and extracted full-text content of documents
>stored in it.  Now the system is being used globally, it needs to
>support multi-language indexing.
>
>I've looked through the mailing list archives etc. and it seems it's
>easy to plug in analyzers for different languages.
>
>What if we're trying to index multiple languages in the same site?  Is
>it best to have:
>
>1/ one index for all languages
>2/ one index for all languages, with an extra language field so searches
>can be constrained to a particular language
>3/ separate indices for each language?
>
>I don't fully understand the consequences in terms of performance for
>1/, but I can see that false hits could turn up where one word appears
>in different languages (stemming could increase the changes of this).
>Also some languages' analyzers are quite dramatically different (e.g.
>the Chinese one which just treats every character as a separate
>token/word).
>  
>
>On the other hand, if people are searching for proper nouns in metadata
>(e.g. "DSpace") it may be advantageous to search all languages at once.
>
>
>I'm also not sure of the storage and performance consequences of 2/.
>
>Approach 3/ seems like it might be the most complex from an
>implementation/code point of view.  
>  
>
But this will be the most robust solution. You have to differentiate
between languages anyway,
and as you pointed here, you can differentiate by adding a Keyword field
for language, or you can create different
indexes.

If you need to use complex search strings over multiple fields and
indexes then I recommend you to use the QueryParser
to compute the search string. When you instantiate a QueryPArser you
will need to provide an analyzer, that will be different
for different languages.

I think that the differences in performance won't be noticable between  
2nd and 3rd solutions, but from maintenance point of
view, I would choose the third solution.

Of course there are other factors that must be take in account when
designing such an application:
number of documents to be indexed, number of document fields, index
change frequency, server load (number of concurrent sessions), etc.

 Hope this hints help you a little,

 Best,

 Sergiu



>Does anyone have any thoughts or recommendations on this?
>
>Many thanks,
>
> Robert Tansley / Digital Media Systems Programme / HP Labs
>  http://www.hpl.hp.com/personal/Robert_Tansley/
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]