Help with multi-lang searches

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with multi-lang searches

Sambhav Kothari (BLOOMBERG/ LONDON)
Hi,

We have a problem with searches with multiple languages.
Our schema looks something like this:

____
field_en = English content for field

field_es = Spanish

field_it = Italian

etc.
____

When a user searches for a keyword, e.g.:

"brexit" it can also specify several languages s/he wants to see in the response, and the query will be performed on all the fields requested.

The issue is that for 'brexit' Italian results are boosted more because something like "Brexit" is unlikely to occur in the Italian language and the idf shoots up causing less relevant but Italian docs to rank higher than the English ones.

Is there some way to deal with this problem ?

The current solutions we can think of:

1. Create a catchall copyfield and use that to score the docs. (But this creates problems when a word is present in another language (for eg English) and not in the resulting document language (Italian) (we will have to pay also extra disk space of the copyfield and also problems with analysis for multiple languages)
2. Create a new scorer called "IDFGroupScorer" wrapping multiple fields and computing a aggregate idf (by averaging or computing the min/max) across the fields in the group.

Any thoughts on any other solutions or any suggestions on how we could possibly implement the IDFGroupScorer?

Thanks,

Sambhav

Reply | Threaded
Open this post in threaded view
|

Re: Help with multi-lang searches

Alexandre Rafalovitch
Additional possibilities:
1) omitNorms and maybe omitTermFreqAndPositions for the fields to
avoid frequency of term mattering
http://lucene.apache.org/solr/guide/7_5/defining-fields.html#optional-field-type-override-properties
2) Constant score:
http://lucene.apache.org/solr/guide/7_5/the-standard-query-parser.html#constant-score-with
3) If your languages are ranked (English first, Italian after), you
can boost English field
4) https://www.manning.com/books/relevant-search may have some ideas.
The examples use ES, but also has Solr discussion and Solr has some
additional capabilities now to match (e.g. eDisMax sow parameter).

Hope it helps,
   Alex.



On Mon, 22 Oct 2018 at 11:56, Sambhav Kothari (BLOOMBERG/ LONDON)
<[hidden email]> wrote:

>
> Hi,
>
> We have a problem with searches with multiple languages.
> Our schema looks something like this:
>
> ____
> field_en = English content for field
>
> field_es = Spanish
>
> field_it = Italian
>
> etc.
> ____
>
> When a user searches for a keyword, e.g.:
>
> "brexit" it can also specify several languages s/he wants to see in the response, and the query will be performed on all the fields requested.
>
> The issue is that for 'brexit' Italian results are boosted more because something like "Brexit" is unlikely to occur in the Italian language and the idf shoots up causing less relevant but Italian docs to rank higher than the English ones.
>
> Is there some way to deal with this problem ?
>
> The current solutions we can think of:
>
> 1. Create a catchall copyfield and use that to score the docs. (But this creates problems when a word is present in another language (for eg English) and not in the resulting document language (Italian) (we will have to pay also extra disk space of the copyfield and also problems with analysis for multiple languages)
> 2. Create a new scorer called "IDFGroupScorer" wrapping multiple fields and computing a aggregate idf (by averaging or computing the min/max) across the fields in the group.
>
> Any thoughts on any other solutions or any suggestions on how we could possibly implement the IDFGroupScorer?
>
> Thanks,
>
> Sambhav
>
Reply | Threaded
Open this post in threaded view
|

Re: Help with multi-lang searches

Tim Casey
Hi Sambhav,

Calculate the percentage of letter pairs per language in the index.
Given the letter pairs in the incoming token, find the closest "match" for
the languages in the indexes.

Even on a small number of tokens you will get close to the intended
language.  You can also calculate the "source language model" in an index
neutral way, say from a known corpus of language specific tokens +
frequency.

Generally this is a tricky thing to do.  Any kind of recall/precision trade
off requires measuring the results for the given data.  It is hard to ask
for general advice.  Sometimes the language segmentation is not done on a
document (index term here) basis.  But the incoming data is segmented by
something like a paragraph or sentence.  So, there is that as well.

I would expect this to be done where the source document is stored raw.
Then, along side the document is a set of probable languages.  From there,
you can pivot the results based on the user expectations.

tim

On Mon, Oct 22, 2018 at 11:18 AM Alexandre Rafalovitch <[hidden email]>
wrote:

> Additional possibilities:
> 1) omitNorms and maybe omitTermFreqAndPositions for the fields to
> avoid frequency of term mattering
>
> http://lucene.apache.org/solr/guide/7_5/defining-fields.html#optional-field-type-override-properties
> 2) Constant score:
>
> http://lucene.apache.org/solr/guide/7_5/the-standard-query-parser.html#constant-score-with
> 3) If your languages are ranked (English first, Italian after), you
> can boost English field
> 4) https://www.manning.com/books/relevant-search may have some ideas.
> The examples use ES, but also has Solr discussion and Solr has some
> additional capabilities now to match (e.g. eDisMax sow parameter).
>
> Hope it helps,
>    Alex.
>
>
>
> On Mon, 22 Oct 2018 at 11:56, Sambhav Kothari (BLOOMBERG/ LONDON)
> <[hidden email]> wrote:
> >
> > Hi,
> >
> > We have a problem with searches with multiple languages.
> > Our schema looks something like this:
> >
> > ____
> > field_en = English content for field
> >
> > field_es = Spanish
> >
> > field_it = Italian
> >
> > etc.
> > ____
> >
> > When a user searches for a keyword, e.g.:
> >
> > "brexit" it can also specify several languages s/he wants to see in the
> response, and the query will be performed on all the fields requested.
> >
> > The issue is that for 'brexit' Italian results are boosted more because
> something like "Brexit" is unlikely to occur in the Italian language and
> the idf shoots up causing less relevant but Italian docs to rank higher
> than the English ones.
> >
> > Is there some way to deal with this problem ?
> >
> > The current solutions we can think of:
> >
> > 1. Create a catchall copyfield and use that to score the docs. (But this
> creates problems when a word is present in another language (for eg
> English) and not in the resulting document language (Italian) (we will have
> to pay also extra disk space of the copyfield and also problems with
> analysis for multiple languages)
> > 2. Create a new scorer called "IDFGroupScorer" wrapping multiple fields
> and computing a aggregate idf (by averaging or computing the min/max)
> across the fields in the group.
> >
> > Any thoughts on any other solutions or any suggestions on how we could
> possibly implement the IDFGroupScorer?
> >
> > Thanks,
> >
> > Sambhav
> >
>