Designing a multilingual index

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Designing a multilingual index

pagod
Hi everyone!

I'm about to build a search engine that will handle documents in several languages (4 for now but the number will increase in the near future). In order to index them properly and offer the best user experience, I'm automatically recognizing the language of each document in order to be able to use appropriate language-dependent analysis, so that e.g. "the" is recognized as a stopword in an English document, but is interpreted as potentially meaning "thé" in a French document. I also want stemming rules to be applied depending on the document's language, so that a search for "bank" matches "Banken" in a German document and "banks" in an English document -- and not the other way round.

Now I've thought of two possible architectures to achieve this, perhaps some experienced Lucene user might give me some advice on which approach would be better -- or is there yet another one which would be even more appropriate?

The first method I'm thinking of, and I've already partially implemented, is to have several indices -- one per language. Upon recognizing the language of a document, a flag is set in my internal data structure, and this is used by the indexing module to decide which index the document should be added to. There is an additional index for documents where no language could be recognized -- in that case, a simple tokenizer is used.
As I see it, one advantage of this approach is that all indices share the same structure, which makes it easier to build queries. Also, they can all be searched parallely, but I'm not sure that this is a great advantage. However, one drawback might be that searching several smaller indices might not be as efficient as searching one big index containing all documents. Besides, I'm planning on improving the processing to allow a single document to be assigned several languages (e.g. paragraph-based recognition), and this architecture would mean that, in order to properly analyze the parts, a multilingual document would have to be split between several indices, therefore either repeating common information (like e.g. document name) or having yet another index containing language-independent information. This might quickly become rather cumbersome to manage.

The second method I've thought of is to have all languages in the same index and use different analyzers on fields that require analysis. In order to do that, I was thinking of extending the names of the fields with the names of the languages -- like e.g. "content-en" vs "content-fr" vs "content-xx" (for "no language recognized"). Then using a customized analyzer, the name of the field would be parsed in method tokenStream and the proper language-dependent analyzer would be selected.
The drawback of this method, as I see it, is that the number of fields in the index increases drastically, which in turn means that building queries becomes rather cumbersome -- but still doable, assuming (which also is the case) that I know the exact list of languages I'm dealing with. Also, it means that Lucene would be searching in non-existing fields in most documents, as I doubt many of them would contain *all* languages. But it keeps the complete information about one document gathered in one place and requires searching only one index.

As I said, I've already implemented the first method some time ago and it works fine. I've only just thought about the second one when I read about this PerFieldAnalyzerWrapper, which allows to do just what I want in the second method. Since my index won't be that big at first, I doubt either architecture would prove to be much more efficient than the other, however I want to use a scaleable design right from the start, so I was wondering whether some Lucene gurus might give me some insights as to what in their eyes would be the better approach -- or whether there might be a different, much better technique I haven't thought of.

Thanks a lot in advance for your support and ideas!

David




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

Paul Libbrecht
David,

I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries:  
if your user requests "segment" you have no way to know which language  
he is searching for; erm, well, you have the user-language(s) (through  
the browser Accept-Language header for example) so you'll understand  
he meant to search in french but would accept that he wants also  
matches in others languages, just less boosted.

So I "expand" the query from "segment" in a french environment to:
   title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-
fr:segment:1.2 wor text-en:segment:1.1 ...
(wor is my naming of the weighted-or which is the normal thing of a  
"should" boolean query)

Surprisingly i haven't seen many people talk about "query expansion"  
but I think it is rather systematic and it could become more part of  
the culture of search engines...

paul


Le 31-mars-10 à 18:20, David Vergnaud a écrit :

> The second method I've thought of is to have all languages in the  
> same index and use different analyzers on fields that require  
> analysis. In order to do that, I was thinking of extending the names  
> of the fields with the names of the languages -- like e.g. "content-
> en" vs "content-fr" vs "content-xx" (for "no language recognized").  
> Then using a customized analyzer, the name of the field would be  
> parsed in method tokenStream and the proper language-dependent  
> analyzer would be selected.
> The drawback of this method, as I see it, is that the number of  
> fields in the index increases drastically, which in turn means that  
> building queries becomes rather cumbersome -- but still doable,  
> assuming (which also is the case) that I know the exact list of  
> languages I'm dealing with. Also, it means that Lucene would be  
> searching in non-existing fields in most documents, as I doubt many  
> of them would contain *all* languages. But it keeps the complete  
> information about one document gathered in one place and requires  
> searching only one index.
>
> As I said, I've already implemented the first method some time ago  
> and it works fine. I've only just thought about the second one when  
> I read about this PerFieldAnalyzerWrapper, which allows to do just  
> what I want in the second method. Since my index won't be that big  
> at first, I doubt either architecture would prove to be much more  
> efficient than the other, however I want to use a scaleable design  
> right from the start, so I was wondering whether some Lucene gurus  
> might give me some insights as to what in their eyes would be the  
> better approach -- or whether there might be a different, much  
> better technique I haven't thought of.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

pagod
Hi,

thanks Paul for your input. I'm gonna try the "localized field" variant and see how it works for me.

I think your idea of automatically boosting the user language is neat, but it should definitely be possible to disable this boosting... Most users have no idea about the language settings in their browser, which drive the contents of the "Accept-Language" header, and e.g. here in Switzerland there's many a foreigner whose prefered language is not French or German or Italian, so forcing a boost on the user could definitely result in a poor user experience.

Does anyone have any technical arguments why the one (several indices) or the other (localized fields in a single index) method might be better?

Cheers,

David



----- Original Message ----
From: Paul Libbrecht <[hidden email]>
To: [hidden email]
Sent: Wed, March 31, 2010 10:00:14 PM
Subject: Re: Designing a multilingual index

David,

I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries: if your user requests "segment" you have no way to know which language he is searching for; erm, well, you have the user-language(s) (through the browser Accept-Language header for example) so you'll understand he meant to search in french but would accept that he wants also matches in others languages, just less boosted.

So I "expand" the query from "segment" in a french environment to:
  title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-fr:segment:1.2 wor text-en:segment:1.1 ...
(wor is my naming of the weighted-or which is the normal thing of a "should" boolean query)

Surprisingly i haven't seen many people talk about "query expansion" but I think it is rather systematic and it could become more part of the culture of search engines...

paul


Le 31-mars-10 à 18:20, David Vergnaud a écrit :

> The second method I've thought of is to have all languages in the same index and use different analyzers on fields that require analysis. In order to do that, I was thinking of extending the names of the fields with the names of the languages -- like e.g. "content-en" vs "content-fr" vs "content-xx" (for "no language recognized"). Then using a customized analyzer, the name of the field would be parsed in method tokenStream and the proper language-dependent analyzer would be selected.
> The drawback of this method, as I see it, is that the number of fields in the index increases drastically, which in turn means that building queries becomes rather cumbersome -- but still doable, assuming (which also is the case) that I know the exact list of languages I'm dealing with. Also, it means that Lucene would be searching in non-existing fields in most documents, as I doubt many of them would contain *all* languages. But it keeps the complete information about one document gathered in one place and requires searching only one index.
>
> As I said, I've already implemented the first method some time ago and it works fine. I've only just thought about the second one when I read about this PerFieldAnalyzerWrapper, which allows to do just what I want in the second method. Since my index won't be that big at first, I doubt either architecture would prove to be much more efficient than the other, however I want to use a scaleable design right from the start, so I was wondering whether some Lucene gurus might give me some insights as to what in their eyes would be the better approach -- or whether there might be a different, much better technique I haven't thought of.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

henrib
In reply to this post by pagod
Hi,
I worked some time ago on a similar system (using Solr) and used the multiple indices route (the multicore feature in Solr). In our case, the "same" document could exist in different languages; different localized versions of the same information (same Solr unique id for each l10n version).

This allowed to have the same index structure across locales but different settings for each (synonyms, stemmers, etc). Maintenance was easier this way; when refining/updating the settings (say adding synonyms or stemmers for instance), you may need to reindex and smaller indices allow faster deployments. It's also "dead-easy" to add a new language (esp. compared to the one index solution). It also makes replication or partitioning easier. Overall, IMO, this is a more scalable architecture than the single-index one.

Users were able to set in which language they were "fluent" (default being browser locale) so queries would only be performed in those and results "clustered" per locale (no need to return results that can not be understood...). Besides, IMO, scoring / ordering documents in different languages is a bit like comparing apples and oranges.

Finally, query expansion can also be used in the multiple indices case and might even use automated/guided translation.

In my experience, multiple indices had many advantages over the single index solution, be them functional or operational. YMMV.
Hope this helps,
Henrib
 
Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

Paul Libbrecht
How?

paul


Le 01-avr.-10 à 14:19, henrib a écrit :

> Finally, query expansion can also be used in the multiple indices  
> case and
> might even use automated/guided translation.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

henrib

By issuing multiple queries, one against each localized index, results being
clustered by locale.
You can further refine by translating the end-user input query terms for
each locale and issue "translated" queries against the respective indices.
I've seen satisfying results with "key" terms dictionaries.




--
View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690881.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

pagod
In reply to this post by henrib
Hi,

thx for sharing your experience with us. I'm happy to see that both methods I've thought of are apparently sensible ;-)

However, it might be due to my lack of experience in that domain, but some of your arguments in favor of a multi-index solution seem to me to be also compatible with a single index, or to apply only in a particular situation:

> Maintenance was easier this
> way; when refining/updating the settings (say
> adding synonyms or stemmers
> for instance), you may need to reindex and smaller
> indices allow faster deployments.
If I understand this correctly, that is true if you're indexing distinct collections of documents that are all in the same language, which you know in advance. In that case, you would of course only need to reindex one of these collections into the corresponding index to update it. In my case though, there is no such separation and only upon processing a document can I decide what language it is in. I therefore cannot decide to update only documents in one language, it's a "all or nothing".

> It's also "dead-easy" to add a new language (esp. compared to
> the one index solution).
I'm not really sure why adding new languages should be complicated in the one-index solution: the languages I can process (i.e. languages for which I have recognition data and the proper analyzer) are listed in a configuration file. Upon recognizing one of these languages, the module stores it in the internal data structure, which is then eventually passed to the indexing module. The latter retrieves the language from the structure and uses it to create the corresponding fields in the index dynamically (e.g. "content-de", "title-en" etc). It's all really automatic.

> It also makes replication or partitioning easier.
Not sure what you mean by that...

> Users were able to set in which language they were "fluent" (default being
> browser locale) so queries would only be performed in those and results
> "clustered" per locale (no need to return results that can not be
> understood...). Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.

So if I understand correctly you'd always only search in one of the indices? In that case, I can understand why the multi-index solution works better for you. However, assuming most of my users speak at least English *and* one, two or three of the other languages, all searches have to be conducted in all languages. Besides, the various languages are not localized versions of the same content, they may really be anything in any language, depending on the person who wrote them and their mood at the time of writing. Some documents also contains parts in English and parts in German... I'm definitely thinking about making it possible to enable/disable some languages, but parallel searching will definitely be required.

> Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.
Not too sure about that. If for instance you were to search for "firewall" in a German/English index, the word may appear in documents in both languages. Lucene's ranking algorithm is based on the number of tokens and the number of occurrences in fields, and while of course the ratio may vary (e.g. German tends to collate several words into one single entity, resulting in one single token, while English rather uses phrases, resulting in several tokens), I still think, assuming the user can understand both, that it kind of makes sense to rank a short German document where "firewall" occurs 10 times higher than an English document where the same word occurs only 5 times.

> Finally, query expansion can also be used in the multiple indices case and
> might even use automated/guided translation.
I do agree about that, and I guess query expansion is much easier when dealing with a single, consistent index structure. I'm thinking writing an algorithm to automatically expand a query to all languages might be ok, but I'm worried about the performance of performing such a query if the number of languages grows larger....

One other problem I'm seeing with the multi-index solution is the ranking of documents that are spanned across several indices (multilingual documents). If say my search term matches a document in the English index and the same document in the French index (which would quite often be the case for e.g. proper names), then how do I get about mixing the two rankings? (as I don't want to display the same result twice) I think using a single index would solve that problem, since the ranking would already take all fields (hence all languages into account).

Cheers,

David

----- Original Message ----
From: henrib <[hidden email]>
To: [hidden email]
Sent: Thu, April 1, 2010 2:19:07 PM
Subject: Re: Designing a multilingual index


Hi,
I worked some time ago on a similar system (using Solr) and used the
multiple indices route (the multicore feature in Solr). In our case, the
"same" document could exist in different languages; different localized
versions of the same information (same Solr unique id for each l10n
version).

This allowed to have the same index structure across locales but different
settings for each (synonyms, stemmers, etc). Maintenance was easier this
way; when refining/updating the settings (say adding synonyms or stemmers
for instance), you may need to reindex and smaller indices allow faster
deployments. It's also "dead-easy" to add a new language (esp. compared to
the one index solution). It also makes replication or partitioning easier.
Overall, IMO, this is a more scalable architecture than the single-index
one.

Users were able to set in which language they were "fluent" (default being
browser locale) so queries would only be performed in those and results
"clustered" per locale (no need to return results that can not be
understood...). Besides, IMO, scoring / ordering documents in different
languages is a bit like comparing apples and oranges.

Finally, query expansion can also be used in the multiple indices case and
might even use automated/guided translation.

In my experience, multiple indices had many advantages over the single index
solution, be them functional or operational. YMMV.
Hope this helps,
Henrib

--
View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690625.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


     

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

henrib
In reply to this post by Paul Libbrecht
By issuing multiple queries, one against each localized index, results being clustered by locale.
You can further refine by translating the end-user input query terms for each locale and issue "translated" queries against the respective indices. I've seen satisfying results with "key" terms dictionaries.
Henri
Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

henrib
In reply to this post by pagod
Hi David,

pagod wrote
... apply only in a particular situation:
Very true, as often in the IR field :-) ; in our case, the "same" document existed in different locales; these were localized technical docs which also meant the dictionary (of important) terms was limited and used to influence scoring.

pagod wrote
henrib wrote
This allowed to have the same index structure across locales but different settings for each (synonyms, stemmers, etc). Maintenance was easier this way; when refining/updating the settings (say adding synonyms or stemmers for instance), you may need to reindex and smaller indices allow faster deployments.
If I understand this correctly, that is true if you're indexing distinct collections of documents that are all in the same language, which you know in advance. In that case, you would of course only need to reindex one of these collections into the corresponding index to update it.
Correct.
pagod wrote
 In my case though, there is no such separation and only upon processing a document can I decide what language it is in. I therefore cannot decide to update only documents in one language, it's a "all or nothing".
As you write later, you perform language detection on the document before you index. Thus you know - the virtue of one lang <=> one index - in which index a given document needs to be put and can maintain the list of documents that exist for a given language (intrinsically, those in the 'FR' index are all in French). Pending you still have access to the raw content, you can then reindex just those.

pagod wrote
henrib wrote
It's also "dead-easy" to add a new language (esp. compared to the one index solution).
I'm not really sure why adding new languages should be complicated in the one-index solution: the languages I can process (i.e. languages for which I have recognition data and the proper analyzer) are listed in a configuration file.
Smart solution! With Solr, if you need to add a new field/language that was not in the original schema/list, you need to reindex the whole. In the multi-index solution, you just need to create a new index (Solr core).
pagod wrote
 
Upon recognizing one of these languages, the module stores it in the internal data structure, which is then eventually passed to the indexing module. The latter retrieves the language from the structure and uses it to create the corresponding fields in the index dynamically (e.g. "content-de", "title-en" etc). It's all really automatic.
Ok. I'm biased by Solr which needs a schema defining all fields before hand; you can't just decide to add a document which "declares" a new field. You need a new schema / index / core that declares the field, etc...

pagod wrote
henrib wrote
It also makes replication or partitioning easier.
Not sure what you mean by that...
Our solution was built on top Solr; all the query rewriting/expansion was occurring before Solr localized searches. Since our server was sending the queries and formatting the results, it was very easy to put indexes on different machines. Solr also comes with replication and sharding (if you need those); having multiple indexes makes things a tad easier to configure wrt number of speakers of a given language.  

pagod wrote
So if I understand correctly you'd always only search in one of the indices? In that case, I can understand why the multi-index solution works better for you. However, assuming most of my users speak at least English *and* one, two or three of the other languages, all searches have to be conducted in all languages.
We were issuing one query per language the user selected as able to comprehend. So if you spoke English, German and French, 3 queries were performed. So, yes, all searches have to be conducted in all needed languages.

pagod wrote
 Besides, the various languages are not localized versions of the same content, they may really be anything in any language, depending on the person who wrote them and their mood at the time of writing. Some documents also contains parts in English and parts in German... I'm definitely thinking about making it possible to enable/disable some languages, but parallel searching will definitely be required.
Mixed language documents is not something I had to cope with... But I suppose whatever solution you'll use to chunk which part of the doc is in which language can be used by both indexing solutions.
pagod wrote
> Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.
...Lucene's ranking algorithm is based on the number of tokens and the number of occurrences in fields,...
Scoring is indeed a complex issue; tf/idf is not the sole scoring method (BM25F comes to mind). The score being a ratio (/ best score), taking a practical stance over academic, you can still use it as an ordering norm for the sake of a better user experience even if it comes from different indexes. You might also want to bias results ordering with other factors like last-updated, ranking, etc.

pagod wrote
> Finally, query expansion can also be used in the multiple indices case and
> might even use automated/guided translation.
I do agree about that, and I guess query expansion is much easier when dealing with a single, consistent index structure. I'm thinking writing an algorithm to automatically expand a query to all languages might be ok, but I'm worried about the performance of performing such a query if the number of languages grows larger....
This algo would work for single & multiple indexes; at least in the multiple index case, you can partition easily and would the need arise, put localized indexes on different machines.

pagod wrote
One other problem I'm seeing with the multi-index solution is the ranking of documents that are spanned across several indices (multilingual documents). If say my search term matches a document in the English index and the same document in the French index (which would quite often be the case for e.g. proper names), then how do I get about mixing the two rankings? (as I don't want to display the same result twice) I think using a single index would solve that problem, since the ranking would already take all fields (hence all languages into account).
If we consider a unique id for the same doc in different indexes (another Solr recommended feature), nothing stops you from aggregating the scores coming from each index (a weighted average for instance).

Anyway, I'm relating an experience in a field that might be too different from yours to be applicable.
Hope it helps.
Cheers,
Henrib
Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

Paul Libbrecht
In reply to this post by henrib
Le 01-avr.-10 à 16:29, henrib a écrit :
> By issuing multiple queries, one against each localized index,  
> results being
> clustered by locale.
> You can further refine by translating the end-user input query terms  
> for
> each locale and issue "translated" queries against the respective  
> indices.
> I've seen satisfying results with "key" terms dictionaries.


What's funny here is how "uncertainty" can be pushed to different level.
I believe automated translation only makes sense if you know the exact  
source language, often not the case at me so I'm merging all results  
matching all languages, add it with the possibility of typos and  
phonetic matching...

Btw, can a lot of people report on successful language matching on  
using soundex, metaphone or double-metaphone for phonetic matching?

thanks in advance

paul



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

henrib
I agree that if you dont know the "source" language - or can't determine it - there is a lot of uncertainty in trying to transmogriphy the query from one language to another!  TIKA and Nutch do have language determination tools though (ngram profiles if I'm not mistaken). And you also can interact with the end-user before issuing the query to confirm the language if necessary("did you mean" kind of feature).
Assuming you can determine the query language and you do have "dictionaries" of important terms per field, I tend to think you increase precision.

The simple route is to ignore the language, use ngrams, forget stemmers & al and just fire; recall will likely be good, precision not that much.

Cheers

Henrib
Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

heikki
hello,

I would like to have your opinions on the impact on relevance scoring in the scenario where multiple languages are indexed in a single index.

> Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.
Not too sure about that. If for instance you were to search for "firewall" in a German/English index, the word may appear in documents in both languages. Lucene's ranking algorithm is based on the number of tokens and the number of occurrences in fields, and while of course the ratio may vary (e.g. German tends to collate several words into one single entity, resulting in one single token, while English rather uses phrases, resulting in several tokens), I still think, assuming the user can understand both, that it kind of makes sense to rank a short German document where "firewall" occurs 10 times higher than an English document where the same word occurs only 5 times.
In our case, it is "known" in which language the user is searching (because he tells us, and if he doesn't, we use the current GUI language). Results are returned so that results in the requested language are ordered on top, and within that, ordered by relevance. Results in other languages are also returned, and presented after the requested-language results, ordered by relevance.

If the results in the requested language contain say one which has term A and one which has term B, their positions in the relevance ranking (within the requested-language results on top) can be influenced by occurrences of terms A and B in the other languages, if a single search is used.

I agree to the apples/oranges remark: if a term occurs in more than one language, likely its IDF frequency is different for each language, so to have the best relevance ranking there should be separate indexes for each language. And searches should be really separate searches (no MultiSearcher which would produce combined relevance scores). So the results should also be presented as several, separate result sets.

Does anyone have experience with this ? Opinions ? Is the improved relevance per language worth the "hassle" of having separate indexes, doing separate searches and presenting results per language ? We do already take care of using appropriate stopwords/differnt analyzers when indexing and searching a particular language, but that's a different issue obviously.

thanks in advance,

Heikki Doeleman
Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

Paul Libbrecht-4

Le 3 janv. 2012 à 13:56, heikki a écrit :

> In our case, it is "known" in which language the user is searching (because
> he tells us, and if he doesn't, we use the current GUI language).

On the web it is often hard to trust such (e.g. because of people working in multiple languages, internet cafés...) but... it is your choice.

> Results
> are returned so that results in the requested language are ordered on top,
> and within that, ordered by relevance. Results in other languages are also
> returned, and presented after the requested-language results, ordered by
> relevance.

After?
Would "shallow matches" in the right language come after "precise matches" in a wrong language?

> If the results in the requested language contain say one which has term A
> and one which has term B, their positions in the relevance ranking (within
> the requested-language results on top) can be influenced by occurrences of
> terms A and B in the other languages, if a single search is used.
>
> I agree to the apples/oranges remark: if a term occurs in more than one
> language, likely its IDF frequency is different for each language, so to
> have the best relevance ranking there should be separate indexes for each
> language. And searches should be really separate searches (no MultiSearcher
> which would produce combined relevance scores). So the results should also
> be presented as several, separate result sets.


I believe the right solution for this is simple: use different fields per langauge.

In both solr and simple lucene, using different fields allows different analyzers, that's how you want things (e.g. a different stemmer per language).

Using different indexes is certainly a hassle, different fields not really.

The important bit is to use query-expansion.
Given a query of the user (with params or not, with text-queries), expand it to a query where the "normal text" is expected to be in the right language, but maybe also in one of the other languages (that the browser says, that your platform supports), with less weight of course.

Query expansion is done by post-processing the result of the query-parser in my case.

Then you can also differentiate fields which are precise matches and less: make one field with exact match (using the whitespace-tokenizer), one field with stemmed match (e.g. using the porter family), one field with phonetic matches.

Hope it helps.

paul

> Does anyone have experience with this ? Opinions ? Is the improved relevance
> per language worth the "hassle" of having separate indexes, doing separate
> searches and presenting results per language ? We do already take care of
> using appropriate stopwords/differnt analyzers when indexing and searching a
> particular language, but that's a different issue obviously.

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

heikki
hi,

thanks for your response :

> On the web it is often hard to trust such (e.g. because of people working
in multiple languages, internet cafés...) but... it is your choice.

our web app has a language selector for the user to choose the GUI language

>After?
>Would "shallow matches" in the right language come after "precise matches"
in a wrong language?

yes, that's the idea. Either that or present the results per language in
separate result sets (with sorting options per result set, etc)

> In both solr and simple lucene, using different fields allows different
analyzers, that's how you want things (e.g. a different stemmer per
language).

yes, in the single index solution we do use different analyzers for
different fields

> The important bit is to use query-expansion.
> Given a query of the user (with params or not, with text-queries), expand
it to a query where the "normal text" is expected to be in the right
language, but maybe also in one of the other languages (that
> the browser says, that your platform supports), with less weight of
course.

something like that we do now in a single index solution - results in the
requested language are boosted enough so they're always on top

I don't think though that this addresses what is my main point: the
frequency of terms in different domains (in this case, different languages)
is different for each domain. This means that if the domains are chunked
together in one index, the IDF value for a term is less "accurate" than if
multiple, separate indexes were used. A term is more or less frequent in
one domain or another, for a reason.. Relevance ranking is impacted by
that, and is more accurate if separate indexes are used -- I think this
seems logical.

I just don't know how much impact it really has, and whether it is worth to
deal with it by presenting separate result sets from separate index
searches ..


thanks for your reply !

Heikki Doeleman





On Tue, Jan 3, 2012 at 2:51 PM, Paul Libbrecht <[hidden email]> wrote:

>
> Le 3 janv. 2012 à 13:56, heikki a écrit :
>
> > In our case, it is "known" in which language the user is searching
> (because
> > he tells us, and if he doesn't, we use the current GUI language).
>
> On the web it is often hard to trust such (e.g. because of people working
> in multiple languages, internet cafés...) but... it is your choice.
>
> > Results
> > are returned so that results in the requested language are ordered on
> top,
> > and within that, ordered by relevance. Results in other languages are
> also
> > returned, and presented after the requested-language results, ordered by
> > relevance.
>
> After?
> Would "shallow matches" in the right language come after "precise matches"
> in a wrong language?
>
> > If the results in the requested language contain say one which has term A
> > and one which has term B, their positions in the relevance ranking
> (within
> > the requested-language results on top) can be influenced by occurrences
> of
> > terms A and B in the other languages, if a single search is used.
> >
> > I agree to the apples/oranges remark: if a term occurs in more than one
> > language, likely its IDF frequency is different for each language, so to
> > have the best relevance ranking there should be separate indexes for each
> > language. And searches should be really separate searches (no
> MultiSearcher
> > which would produce combined relevance scores). So the results should
> also
> > be presented as several, separate result sets.
>
>
> I believe the right solution for this is simple: use different fields per
> langauge.
>
> In both solr and simple lucene, using different fields allows different
> analyzers, that's how you want things (e.g. a different stemmer per
> language).
>
> Using different indexes is certainly a hassle, different fields not really.
>
> The important bit is to use query-expansion.
> Given a query of the user (with params or not, with text-queries), expand
> it to a query where the "normal text" is expected to be in the right
> language, but maybe also in one of the other languages (that the browser
> says, that your platform supports), with less weight of course.
>
> Query expansion is done by post-processing the result of the query-parser
> in my case.
>
> Then you can also differentiate fields which are precise matches and less:
> make one field with exact match (using the whitespace-tokenizer), one field
> with stemmed match (e.g. using the porter family), one field with phonetic
> matches.
>
> Hope it helps.
>
> paul
>
> > Does anyone have experience with this ? Opinions ? Is the improved
> relevance
> > per language worth the "hassle" of having separate indexes, doing
> separate
> > searches and presenting results per language ? We do already take care of
> > using appropriate stopwords/differnt analyzers when indexing and
> searching a
> > particular language, but that's a different issue obviously.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

Paul Libbrecht-4
Heikki,

it does solve your main concern: a term in lucene is a pair of a token and field name.
The term frequency is, thus, the frequency of a token in a field.

So the term-frequency of text-stemmed-de:firewall is independent of the term-frequency of text-stemmed-en:firewall (for example).

But using the query expansion mechanism, it is likely that both term-queries will be present and both contribute to the score. Which is correct I think.

paul


Le 3 janv. 2012 à 15:06, heikki a écrit :

>
>> The important bit is to use query-expansion.
>> Given a query of the user (with params or not, with text-queries), expand
>> it to a query where the "normal text" is expected to be in the right
>> language, but maybe also in one of the other languages (that
>> the browser says, that your platform supports), with less weight of
> course.
>
> something like that we do now in a single index solution - results in the
> requested language are boosted enough so they're always on top
>
> I don't think though that this addresses what is my main point: the
> frequency of terms in different domains (in this case, different languages)
> is different for each domain. This means that if the domains are chunked
> together in one index, the IDF value for a term is less "accurate" than if
> multiple, separate indexes were used. A term is more or less frequent in
> one domain or another, for a reason.. Relevance ranking is impacted by
> that, and is more accurate if separate indexes are used -- I think this
> seems logical.
>
> I just don't know how much impact it really has, and whether it is worth to
> deal with it by presenting separate result sets from separate index
> searches ..

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

heikki
hi Paul,

yes, but my concern isn't about the term-frequency, but rather the
inverted-document-frequency, which also is used in the relevance score and
which takes into account all documents in the index.. in this way the
relevance score of one document is influenced by the contents of all other
documents that are in the same index. This is why it seems logical to me
that if different domains use separate indexes, the relevance scoring is
more accurate.


Kind regards,
Heikki Doeleman




On Tue, Jan 3, 2012 at 3:29 PM, Paul Libbrecht <[hidden email]> wrote:

> Heikki,
>
> it does solve your main concern: a term in lucene is a pair of a token and
> field name.
> The term frequency is, thus, the frequency of a token in a field.
>
> So the term-frequency of text-stemmed-de:firewall is independent of the
> term-frequency of text-stemmed-en:firewall (for example).
>
> But using the query expansion mechanism, it is likely that both
> term-queries will be present and both contribute to the score. Which is
> correct I think.
>
> paul
>
>
> Le 3 janv. 2012 à 15:06, heikki a écrit :
> >
> >> The important bit is to use query-expansion.
> >> Given a query of the user (with params or not, with text-queries),
> expand
> >> it to a query where the "normal text" is expected to be in the right
> >> language, but maybe also in one of the other languages (that
> >> the browser says, that your platform supports), with less weight of
> > course.
> >
> > something like that we do now in a single index solution - results in the
> > requested language are boosted enough so they're always on top
> >
> > I don't think though that this addresses what is my main point: the
> > frequency of terms in different domains (in this case, different
> languages)
> > is different for each domain. This means that if the domains are chunked
> > together in one index, the IDF value for a term is less "accurate" than
> if
> > multiple, separate indexes were used. A term is more or less frequent in
> > one domain or another, for a reason.. Relevance ranking is impacted by
> > that, and is more accurate if separate indexes are used -- I think this
> > seems logical.
> >
> > I just don't know how much impact it really has, and whether it is worth
> to
> > deal with it by presenting separate result sets from separate index
> > searches ..
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

Paul Libbrecht-4
I think the idf is also about terms and not about tokens.
Maybe an expert can confirm my belief or we have to invent a test.

paul


Le 3 janv. 2012 à 15:43, heikki a écrit :

> hi Paul,
>
> yes, but my concern isn't about the term-frequency, but rather the
> inverted-document-frequency, which also is used in the relevance score and
> which takes into account all documents in the index.. in this way the
> relevance score of one document is influenced by the contents of all other
> documents that are in the same index. This is why it seems logical to me
> that if different domains use separate indexes, the relevance scoring is
> more accurate.
>
>
> Kind regards,
> Heikki Doeleman
>
>
>
>
> On Tue, Jan 3, 2012 at 3:29 PM, Paul Libbrecht <[hidden email]> wrote:
>
>> Heikki,
>>
>> it does solve your main concern: a term in lucene is a pair of a token and
>> field name.
>> The term frequency is, thus, the frequency of a token in a field.
>>
>> So the term-frequency of text-stemmed-de:firewall is independent of the
>> term-frequency of text-stemmed-en:firewall (for example).
>>
>> But using the query expansion mechanism, it is likely that both
>> term-queries will be present and both contribute to the score. Which is
>> correct I think.
>>
>> paul
>>
>>
>> Le 3 janv. 2012 à 15:06, heikki a écrit :
>>>
>>>> The important bit is to use query-expansion.
>>>> Given a query of the user (with params or not, with text-queries),
>> expand
>>>> it to a query where the "normal text" is expected to be in the right
>>>> language, but maybe also in one of the other languages (that
>>>> the browser says, that your platform supports), with less weight of
>>> course.
>>>
>>> something like that we do now in a single index solution - results in the
>>> requested language are boosted enough so they're always on top
>>>
>>> I don't think though that this addresses what is my main point: the
>>> frequency of terms in different domains (in this case, different
>> languages)
>>> is different for each domain. This means that if the domains are chunked
>>> together in one index, the IDF value for a term is less "accurate" than
>> if
>>> multiple, separate indexes were used. A term is more or less frequent in
>>> one domain or another, for a reason.. Relevance ranking is impacted by
>>> that, and is more accurate if separate indexes are used -- I think this
>>> seems logical.
>>>
>>> I just don't know how much impact it really has, and whether it is worth
>> to
>>> deal with it by presenting separate result sets from separate index
>>> searches ..
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Designing a multilingual index

Robert Muir
On Tue, Jan 3, 2012 at 10:10 AM, Paul Libbrecht <[hidden email]> wrote:
> I think the idf is also about terms and not about tokens.
> Maybe an expert can confirm my belief or we have to invent a test.
>

idf is docFreq and maxDoc.

docFreq is per-field, maxDoc is not. This might not even matter though.

if you are concerned about it in a situation where you have multiple
languages in different fields and some are sparse, you can look at
lucene's trunk, which has a "per-field maxdoc" (Terms.docCount), which
is the count of all documents that have at least one indexed term for
the field.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]