Index multiple languages with multiple analyzers with the same field

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Index multiple languages with multiple analyzers with the same field

Wu, Daniel
Hi,
 
I know this probably has been asked before, but I was not able to find
it in the mailing list.  So forgive me if I repeated the same question.
 
We are trying to build a search application to support multiple
languages.  Users can potentially query with any language.  First
thought come to us is to index the text of all languages in the same
field using language specific analyzer.  As all the data are indexed in
the same field, it would just find results with the language that
matches the user query.
 
Looking at the Solr schema, it seems each field can have one and only
analyzer.  Is it possible to have multiple analyzers for the same field?
 
Or is there any other approaches that can achieve the same thing?
 
Daniel
Reply | Threaded
Open this post in threaded view
|

Re: Index multiple languages with multiple analyzers with the same field

Mike Klaas
On 28-Sep-07, at 11:13 AM, Wu, Daniel wrote:

> Hi,
>
> I know this probably has been asked before, but I was not able to find
> it in the mailing list.  So forgive me if I repeated the same  
> question.

This thread hashes out the issues in quite a lot of detail:

<http://www.nabble.com/Multi-language-indexing-and-searching- 
tf3885324.html#a11012939>

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Index multiple languages with multiple analyzers with the same field

Thom Nelson
In reply to this post by Wu, Daniel
I had the same problem, but never found a good solution.  The best
solution is to have a more dynamic way of determining which analyzer to
return, such as having some kind of conditional expression evalution in
the fieldType/analyzer element, where either the document or the query
request could be used as the comparison object.

<fieldtype type="textMultiLingual" class="solr.TextField">
    <analyzer type="query" expression="request.lang == 'EN'">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
</fieldtype>

Analyzers could still be cached by adding the expression to the cache key.

Unfortunately I have switched jobs, so I don't have the time or
motivation to do this, but it should be a very useful addition.

- Thom

Wu, Daniel wrote:

> Hi,
>  
> I know this probably has been asked before, but I was not able to find
> it in the mailing list.  So forgive me if I repeated the same question.
>  
> We are trying to build a search application to support multiple
> languages.  Users can potentially query with any language.  First
> thought come to us is to index the text of all languages in the same
> field using language specific analyzer.  As all the data are indexed in
> the same field, it would just find results with the language that
> matches the user query.
>  
> Looking at the Solr schema, it seems each field can have one and only
> analyzer.  Is it possible to have multiple analyzers for the same field?
>  
> Or is there any other approaches that can achieve the same thing?
>  
> Daniel
>
>  

Reply | Threaded
Open this post in threaded view
|

RE: Index multiple languages with multiple analyzers with the same field

Lance Norskog-2
Other people custom-create a separate dynamic field for each language they
want to support.  The spellchecker in Solr 1.2 wants just one field to use
as its word source, so this fits.

We have a more complex version of this problem: we have content with both
English and other languages. Searching is one problem; we also want to have
spelling correction dictionaries for each language. We have many world
languages which need very different handling and semantics, like CJK
processing. We will have to use the multiple-field trick; I don't think we
can shoehorn our complexity into this technique. It is a valiant effort,
though.

It's possible we could separate out the different-language words in the
document, put them each in separate words_en_text, word_sp_text, etc. and
make the default search field out of
        <copyField source="*_text" dest="defaultText"/>
Hmm.....

Lance

-----Original Message-----
From: Thom Nelson [mailto:[hidden email]]
Sent: Friday, September 28, 2007 12:07 PM
To: [hidden email]; [hidden email]
Subject: Re: Index multiple languages with multiple analyzers with the same
field

I had the same problem, but never found a good solution.  The best solution
is to have a more dynamic way of determining which analyzer to return, such
as having some kind of conditional expression evalution in the
fieldType/analyzer element, where either the document or the query request
could be used as the comparison object.

<fieldtype type="textMultiLingual" class="solr.TextField">
    <analyzer type="query" expression="request.lang == 'EN'">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
</fieldtype>

Analyzers could still be cached by adding the expression to the cache key.

Unfortunately I have switched jobs, so I don't have the time or motivation
to do this, but it should be a very useful addition.

- Thom

Wu, Daniel wrote:

> Hi,
>  
> I know this probably has been asked before, but I was not able to find
> it in the mailing list.  So forgive me if I repeated the same question.
>  
> We are trying to build a search application to support multiple
> languages.  Users can potentially query with any language.  First
> thought come to us is to index the text of all languages in the same
> field using language specific analyzer.  As all the data are indexed
> in the same field, it would just find results with the language that
> matches the user query.
>  
> Looking at the Solr schema, it seems each field can have one and only
> analyzer.  Is it possible to have multiple analyzers for the same field?
>  
> Or is there any other approaches that can achieve the same thing?
>  
> Daniel
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Index multiple languages with multiple analyzers with the same field

dma_bamboo
Same Here.

But I can't see how to fit into this UNLESS you are going to create an
analyzer to handle a language parameter and based on it would be able to
apply a set of filters (and sometimes you want a different - but compatible
- set of filters in indexing/query time). It would work, but doing so we
lose the advantage of having Solr config were we can change and experiment
alternative analyzers/tokenizers/filters compositions...

What I've done is I created one specific text field per language and created
a dismax request handler per language (using language name or ISO name) and
it is very flexible and appropriate for each language.

I've also created for management simplicity a dismax handler that allows me
to query all documents no matter in which language it is. May be useful for
you too.

Regards,
Daniel Alheiros




On 29/9/07 03:29, "Lance Norskog" <[hidden email]> wrote:

> Other people custom-create a separate dynamic field for each language they
> want to support.  The spellchecker in Solr 1.2 wants just one field to use
> as its word source, so this fits.
>
> We have a more complex version of this problem: we have content with both
> English and other languages. Searching is one problem; we also want to have
> spelling correction dictionaries for each language. We have many world
> languages which need very different handling and semantics, like CJK
> processing. We will have to use the multiple-field trick; I don't think we
> can shoehorn our complexity into this technique. It is a valiant effort,
> though.
>
> It's possible we could separate out the different-language words in the
> document, put them each in separate words_en_text, word_sp_text, etc. and
> make the default search field out of
> <copyField source="*_text" dest="defaultText"/>
> Hmm.....
>
> Lance
>
> -----Original Message-----
> From: Thom Nelson [mailto:[hidden email]]
> Sent: Friday, September 28, 2007 12:07 PM
> To: [hidden email]; [hidden email]
> Subject: Re: Index multiple languages with multiple analyzers with the same
> field
>
> I had the same problem, but never found a good solution.  The best solution
> is to have a more dynamic way of determining which analyzer to return, such
> as having some kind of conditional expression evalution in the
> fieldType/analyzer element, where either the document or the query request
> could be used as the comparison object.
>
> <fieldtype type="textMultiLingual" class="solr.TextField">
>     <analyzer type="query" expression="request.lang == 'EN'">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>     </analyzer>
> </fieldtype>
>
> Analyzers could still be cached by adding the expression to the cache key.
>
> Unfortunately I have switched jobs, so I don't have the time or motivation
> to do this, but it should be a very useful addition.
>
> - Thom
>
> Wu, Daniel wrote:
>> Hi,
>>  
>> I know this probably has been asked before, but I was not able to find
>> it in the mailing list.  So forgive me if I repeated the same question.
>>  
>> We are trying to build a search application to support multiple
>> languages.  Users can potentially query with any language.  First
>> thought come to us is to index the text of all languages in the same
>> field using language specific analyzer.  As all the data are indexed
>> in the same field, it would just find results with the language that
>> matches the user query.
>>  
>> Looking at the Solr schema, it seems each field can have one and only
>> analyzer.  Is it possible to have multiple analyzers for the same field?
>>  
>> Or is there any other approaches that can achieve the same thing?
>>  
>> Daniel
>>
>>  
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Index multiple languages with multiple analyzers with the same field

Ryan McKinley
>
> But I can't see how to fit into this UNLESS you are going to create an
> analyzer to handle a language parameter and based on it would be able to
> apply a set of filters (and sometimes you want a different - but compatible
> - set of filters in indexing/query time).

I don't think this is what you are asking for, but...  in 1.3 you can
configure an UpdateRequestProcessor -- this lets you add custom code
just before AddUpdateCommand is called.  It is a good place to implement
a conditional copyField.  Check SOLR-269

ryan