Multilingual - Search against the appropriate field

classic Classic list List threaded Threaded
4 messages Options
SR
Reply | Threaded
Open this post in threaded view
|

Multilingual - Search against the appropriate field

SR
Hi,

I know this topic has been treated many times in the (distant) past, but I wonder whether there are new better practices/tendencies.

In my application, I'm dealing with documents in different languages. Each document is monolingual; it has some fields containing free text and a set of fields that do not require any text analysis. For the free text, we need to make a specific analysis based of the language of the document.

I'm for the use of a single index for all the documents instead of one index per language (any objection?). Thus, in schema.xml, I need to declare a separate field for each language (text_fr, text_en, etc.), each with its own appropriate analysis. Then, during the indexing, I need to assign the free text content of each document to the appropriate field. Thus, for each document, only one of the freetext fields would be populated.

My question is, at search time, what is the best solution to search against the appropriate field?

I know that using dismax, we can define in "qf" the set the fields we want to search against. e.g., <str name="qf"> text_fr text_en</str>

With this solution, does Solr choose the appropriate analysis for the query. i.e., if a query is compared to a document having English free text (text_en is populated), does Solr analyze the query as it was in English ?

One problem with this approach is that, each query will be compared to all the available documents. i.e., a query in English would be compared to a document in French. As I know, if we know the query language, this problem can be avoided, either by searching against the appropriate field (e.g., text_fr:query), or by using a filter to select only those documents having English text. Am I correct? Or is there a better solution?

Thanks,
-Saïd

Reply | Threaded
Open this post in threaded view
|

Re: Multilingual - Search against the appropriate field

Jan Høydahl / Cominvent
Hi,

I have chosen the same approach as you, indexing content into text_<language> fields with custom analysis, and it works great. Solr does not have any overhead with this even if there are hundreds of languages, due to the schema-less nature of Lucene.

And if you know which language is being searched, you can select only those fields in question, and you'd still be as fast as the mono language case. But you'd only get documents in that language returned.

Say you want to match across languages, it could be you search for "obama" which would be written the same in all languages. How to achieve this? I see two approaches:
a) Seach across all languages with proper analysis, as you suggest qf=text_fr text_en^10 (you can even boost the preferred languages).
b) Index all content in a "text_all" field with no stemming involved and search qf=text_all (you will match "obama" in all languages but lose stemming)

My feeling is that a) would work if you have a limited set of languages, but b) might be necessary if you have dozens of languages to search across, due to reduced query performance with such a large disMax query.

Of course with a) there may be ambiguities that an english word gets stemmed and hits the same stem as a totally different french word - I don't have any hands on examples, but I'm sure the issue exists. Then it is probably better to search the other languages un-stemmed, like a hybrid approach:

c) Search the query language stemmed and all other unstemmed (qf=text_en^10 text_all - giving increased recall)

The downside of a text_all field is you almost double the size of your index worst-case.

Then you have the issue of displaying the results in front end.
Which title do you pick? title_en or title_fr? Here, I also see two solutions and I have tried both:
1) Store a title_display which is stored, while the title_<language> fields are only indexed, not stored. Use the title_display in frontend
2) Make a wrapper around QueryResult class so when frontend asks for "title", you intelligently try to pull out title_XY where XY is pulled from documents "language" metadata.

I think which you choose depends on taste, each has its + and -

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. juli 2010, at 12.26, Saïd Radhouani wrote:

> Hi,
>
> I know this topic has been treated many times in the (distant) past, but I wonder whether there are new better practices/tendencies.
>
> In my application, I'm dealing with documents in different languages. Each document is monolingual; it has some fields containing free text and a set of fields that do not require any text analysis. For the free text, we need to make a specific analysis based of the language of the document.
>
> I'm for the use of a single index for all the documents instead of one index per language (any objection?). Thus, in schema.xml, I need to declare a separate field for each language (text_fr, text_en, etc.), each with its own appropriate analysis. Then, during the indexing, I need to assign the free text content of each document to the appropriate field. Thus, for each document, only one of the freetext fields would be populated.
>
> My question is, at search time, what is the best solution to search against the appropriate field?
>
> I know that using dismax, we can define in "qf" the set the fields we want to search against. e.g., <str name="qf"> text_fr text_en</str>
>
> With this solution, does Solr choose the appropriate analysis for the query. i.e., if a query is compared to a document having English free text (text_en is populated), does Solr analyze the query as it was in English ?
>
> One problem with this approach is that, each query will be compared to all the available documents. i.e., a query in English would be compared to a document in French. As I know, if we know the query language, this problem can be avoided, either by searching against the appropriate field (e.g., text_fr:query), or by using a filter to select only those documents having English text. Am I correct? Or is there a better solution?
>
> Thanks,
> -Saïd
>


SR
Reply | Threaded
Open this post in threaded view
|

Re: Multilingual - Search against the appropriate field

SR
Hi Jan,

I totally agree with what you said.

In a), you talked about boosting. I guess you meant to boost at the client side, right?

I still have a question:

>> does Solr choose the appropriate analysis for the query. i.e., if a query is compared to a document having English free text (text_en is populated), does Solr analyze it as it was in English ?


Thanks,
-Saïd

On Jul 1, 2010, at 1:26 PM, Jan Høydahl / Cominvent wrote:

> Hi,
>
> I have chosen the same approach as you, indexing content into text_<language> fields with custom analysis, and it works great. Solr does not have any overhead with this even if there are hundreds of languages, due to the schema-less nature of Lucene.
>
> And if you know which language is being searched, you can select only those fields in question, and you'd still be as fast as the mono language case. But you'd only get documents in that language returned.
>
> Say you want to match across languages, it could be you search for "obama" which would be written the same in all languages. How to achieve this? I see two approaches:
> a) Seach across all languages with proper analysis, as you suggest qf=text_fr text_en^10 (you can even boost the preferred languages).
> b) Index all content in a "text_all" field with no stemming involved and search qf=text_all (you will match "obama" in all languages but lose stemming)
>
> My feeling is that a) would work if you have a limited set of languages, but b) might be necessary if you have dozens of languages to search across, due to reduced query performance with such a large disMax query.
>
> Of course with a) there may be ambiguities that an english word gets stemmed and hits the same stem as a totally different french word - I don't have any hands on examples, but I'm sure the issue exists. Then it is probably better to search the other languages un-stemmed, like a hybrid approach:
>
> c) Search the query language stemmed and all other unstemmed (qf=text_en^10 text_all - giving increased recall)
>
> The downside of a text_all field is you almost double the size of your index worst-case.
>
> Then you have the issue of displaying the results in front end.
> Which title do you pick? title_en or title_fr? Here, I also see two solutions and I have tried both:
> 1) Store a title_display which is stored, while the title_<language> fields are only indexed, not stored. Use the title_display in frontend
> 2) Make a wrapper around QueryResult class so when frontend asks for "title", you intelligently try to pull out title_XY where XY is pulled from documents "language" metadata.
>
> I think which you choose depends on taste, each has its + and -
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 1. juli 2010, at 12.26, Saïd Radhouani wrote:
>
>> Hi,
>>
>> I know this topic has been treated many times in the (distant) past, but I wonder whether there are new better practices/tendencies.
>>
>> In my application, I'm dealing with documents in different languages. Each document is monolingual; it has some fields containing free text and a set of fields that do not require any text analysis. For the free text, we need to make a specific analysis based of the language of the document.
>>
>> I'm for the use of a single index for all the documents instead of one index per language (any objection?). Thus, in schema.xml, I need to declare a separate field for each language (text_fr, text_en, etc.), each with its own appropriate analysis. Then, during the indexing, I need to assign the free text content of each document to the appropriate field. Thus, for each document, only one of the freetext fields would be populated.
>>
>> My question is, at search time, what is the best solution to search against the appropriate field?
>>
>> I know that using dismax, we can define in "qf" the set the fields we want to search against. e.g., <str name="qf"> text_fr text_en</str>
>>
>> With this solution, does Solr choose the appropriate analysis for the query. i.e., if a query is compared to a document having English free text (text_en is populated), does Solr analyze the query as it was in English ?
>>
>> One problem with this approach is that, each query will be compared to all the available documents. i.e., a query in English would be compared to a document in French. As I know, if we know the query language, this problem can be avoided, either by searching against the appropriate field (e.g., text_fr:query), or by using a filter to select only those documents having English text. Am I correct? Or is there a better solution?
>>
>> Thanks,
>> -Saïd
>>
>
>

SR
Reply | Threaded
Open this post in threaded view
|

Re: Multilingual - Search against the appropriate field

SR
In reply to this post by Jan Høydahl / Cominvent
Hi Jan,

I totally agree with what you said.

In a), you talked about boosting. I guess you meant to boost at the client side, right?

I still have a question:

>> does Solr choose the appropriate analysis for the query. i.e., if a query is compared to a document having English free text (text_en is populated), does Solr analyze it as it was in English ?


Thanks,
-Saïd

On Jul 1, 2010, at 1:26 PM, Jan Høydahl / Cominvent wrote:

> Hi,
>
> I have chosen the same approach as you, indexing content into text_<language> fields with custom analysis, and it works great. Solr does not have any overhead with this even if there are hundreds of languages, due to the schema-less nature of Lucene.
>
> And if you know which language is being searched, you can select only those fields in question, and you'd still be as fast as the mono language case. But you'd only get documents in that language returned.
>
> Say you want to match across languages, it could be you search for "obama" which would be written the same in all languages. How to achieve this? I see two approaches:
> a) Seach across all languages with proper analysis, as you suggest qf=text_fr text_en^10 (you can even boost the preferred languages).
> b) Index all content in a "text_all" field with no stemming involved and search qf=text_all (you will match "obama" in all languages but lose stemming)
>
> My feeling is that a) would work if you have a limited set of languages, but b) might be necessary if you have dozens of languages to search across, due to reduced query performance with such a large disMax query.
>
> Of course with a) there may be ambiguities that an english word gets stemmed and hits the same stem as a totally different french word - I don't have any hands on examples, but I'm sure the issue exists. Then it is probably better to search the other languages un-stemmed, like a hybrid approach:
>
> c) Search the query language stemmed and all other unstemmed (qf=text_en^10 text_all - giving increased recall)
>
> The downside of a text_all field is you almost double the size of your index worst-case.
>
> Then you have the issue of displaying the results in front end.
> Which title do you pick? title_en or title_fr? Here, I also see two solutions and I have tried both:
> 1) Store a title_display which is stored, while the title_<language> fields are only indexed, not stored. Use the title_display in frontend
> 2) Make a wrapper around QueryResult class so when frontend asks for "title", you intelligently try to pull out title_XY where XY is pulled from documents "language" metadata.
>
> I think which you choose depends on taste, each has its + and -
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 1. juli 2010, at 12.26, Saïd Radhouani wrote:
>
>> Hi,
>>
>> I know this topic has been treated many times in the (distant) past, but I wonder whether there are new better practices/tendencies.
>>
>> In my application, I'm dealing with documents in different languages. Each document is monolingual; it has some fields containing free text and a set of fields that do not require any text analysis. For the free text, we need to make a specific analysis based of the language of the document.
>>
>> I'm for the use of a single index for all the documents instead of one index per language (any objection?). Thus, in schema.xml, I need to declare a separate field for each language (text_fr, text_en, etc.), each with its own appropriate analysis. Then, during the indexing, I need to assign the free text content of each document to the appropriate field. Thus, for each document, only one of the freetext fields would be populated.
>>
>> My question is, at search time, what is the best solution to search against the appropriate field?
>>
>> I know that using dismax, we can define in "qf" the set the fields we want to search against. e.g., <str name="qf"> text_fr text_en</str>
>>
>> With this solution, does Solr choose the appropriate analysis for the query. i.e., if a query is compared to a document having English free text (text_en is populated), does Solr analyze the query as it was in English ?
>>
>> One problem with this approach is that, each query will be compared to all the available documents. i.e., a query in English would be compared to a document in French. As I know, if we know the query language, this problem can be avoided, either by searching against the appropriate field (e.g., text_fr:query), or by using a filter to select only those documents having English text. Am I correct? Or is there a better solution?
>>
>> Thanks,
>> -Saïd
>>
>
>