Multi-language indexing and searching

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Multi-language indexing and searching

dma_bamboo
Hi,

I'm just starting to use Solr and so far, it has been a very interesting
learning process. I wasn't a Lucene user, so I'm learning a lot about both.

My problem is:
I have to index and search content in several languages.

My scenario is a bit different from other that I've already read in this
forum, as my client is the same to search any language and it could be
accomplished using a field to define language.

My questions are more focused on how to keep the benefits of all the
protwords, stopwords and synonyms in a multilanguage situation....

Should I create new Analyzers that can deal with the "language" field of the
document? What do you recommend?

Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

Walter Underwood, Netflix
I'm not sure what sort of "field" you mean for defining the
language.

If you plan to use a single search UI regardless of language,
we used to do this in Ultraseek, but it doesn't really work.
Queries are too short for reliable language ID (is "die" in
German, English, or Latin?), and language-specific processing
can be pretty differnent.

We ran into surface words that collided in different languages.
As I remember, "mobile" is a plural noun in Dutch but a verb in
English.

Finally, Solr lingustic support is OK for English, but not as
good for more heavily-inflected langauge. For German, you
really need to decompose compound words, something not available
in Solr.

The only semi-successful cross-language search seems to be with
n-gram indexing. That usually produces a larger index and somewhat
slower performance (because of the number of terms), but at least
it works.

wunder

On 6/7/07 10:47 AM, "Daniel Alheiros" <[hidden email]> wrote:

> I have to index and search content in several languages.
>
> My scenario is a bit different from other that I've already read in this
> forum, as my client is the same to search any language and it could be
> accomplished using a field to define language.

Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
Thank you for your reply.

Yes, I realize that hitting a query against the hole content would come with
this problems, but what I'm trying to say is that I will always narrow by
the language (from my users point of view). I would like to know if it is
possible (and appropriate) to have all my content in one index for
administrative reasons (batch indexing, queries based on ID or Date,
centralized maintenance).

My index will be something around 4 GB (initially) and maybe in 5 years time
it will reach 8 GB.

What do you think about it?

Regards,
Daniel


On 7/6/07 18:56, "Walter Underwood" <[hidden email]> wrote:

> I'm not sure what sort of "field" you mean for defining the
> language.
>
> If you plan to use a single search UI regardless of language,
> we used to do this in Ultraseek, but it doesn't really work.
> Queries are too short for reliable language ID (is "die" in
> German, English, or Latin?), and language-specific processing
> can be pretty differnent.
>
> We ran into surface words that collided in different languages.
> As I remember, "mobile" is a plural noun in Dutch but a verb in
> English.
>
> Finally, Solr lingustic support is OK for English, but not as
> good for more heavily-inflected langauge. For German, you
> really need to decompose compound words, something not available
> in Solr.
>
> The only semi-successful cross-language search seems to be with
> n-gram indexing. That usually produces a larger index and somewhat
> slower performance (because of the number of terms), but at least
> it works.
>
> wunder
>
> On 6/7/07 10:47 AM, "Daniel Alheiros" <[hidden email]> wrote:
>
>> I have to index and search content in several languages.
>>
>> My scenario is a bit different from other that I've already read in this
>> forum, as my client is the same to search any language and it could be
>> accomplished using a field to define language.
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

Henrib-2
In reply to this post by dma_bamboo
Hi Daniel,
If it is functionally 'ok' to search in only one lang at a time, you could try having one index per lang. Each per-lang index would have one schema where you would describe field types (the lang part coming through stemming/snowball analyzers, per-lang stopwords & al) and the same field name could be used in each of them.
You could either deploy that solution through multiple web-apps (one per lang) (or try the patch for issue Solr-215).
Regards,
Henri

Daniel Alheiros wrote
Hi,

I'm just starting to use Solr and so far, it has been a very interesting
learning process. I wasn't a Lucene user, so I'm learning a lot about both.

My problem is:
I have to index and search content in several languages.

My scenario is a bit different from other that I've already read in this
forum, as my client is the same to search any language and it could be
accomplished using a field to define language.

My questions are more focused on how to keep the benefits of all the
protwords, stopwords and synonyms in a multilanguage situation....

Should I create new Analyzers that can deal with the "language" field of the
document? What do you recommend?

Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
Hi Henri.

Thanks for your reply.
I've just looked at the patch you referred, but doing this I will lose the
out of the box Solr installation... I'll have to create my own Solr
application responsible for creating the multiple cores and I'll have to
change my indexing process to something able to notify content for a
specific core.

Can't I have the same index, using one single core, same field names being
processed by language specific components based on a field/parameter?

I will try to draw what I'm thinking, please forgive me if I'm not using the
correct terms but I'm not an IR expert.

Thinking in a workflow:
    Indexing:
        Multilanguage indexer receives some documents
            for each document, verify the "language" field
                if language = "English" then process using the
EnglishIndexer
                else if language = "Chinese" then process using the
ChineseIndexer
                else if ...

    Querying:
        Multilanguage Request Handler receives a request
            if parameter language = "English" then process using the English
Request Handler
            else if parameter language = "Chinese" then process using the
Chinese Request Handler
            else if ...

I can see that in the schema field definitions, we have some language
dependent parameters... It can be a problem, as I would like to have the
same fields for all requests...

Sorry to bother, but before I split all my data this way I would like to be
sure that it's the best approach for me.

Regards,
Daniel        


On 8/6/07 15:15, "Henrib" <[hidden email]> wrote:

>
> Hi Daniel,
> If it is functionally 'ok' to search in only one lang at a time, you could
> try having one index per lang. Each per-lang index would have one schema
> where you would describe field types (the lang part coming through
> stemming/snowball analyzers, per-lang stopwords & al) and the same field
> name could be used in each of them.
> You could either deploy that solution through multiple web-apps (one per
> lang) (or try the patch for issue Solr-215).
> Regards,
> Henri
>
>
> Daniel Alheiros wrote:
>>
>> Hi,
>>
>> I'm just starting to use Solr and so far, it has been a very interesting
>> learning process. I wasn't a Lucene user, so I'm learning a lot about
>> both.
>>
>> My problem is:
>> I have to index and search content in several languages.
>>
>> My scenario is a bit different from other that I've already read in this
>> forum, as my client is the same to search any language and it could be
>> accomplished using a field to define language.
>>
>> My questions are more focused on how to keep the benefits of all the
>> protwords, stopwords and synonyms in a multilanguage situation....
>>
>> Should I create new Analyzers that can deal with the "language" field of
>> the
>> document? What do you recommend?
>>
>> Regards,
>> Daniel
>>
>>
>> http://www.bbc.co.uk/
>> This e-mail (and any attachments) is confidential and may contain personal
>> views which are not the views of the BBC unless specifically stated.
>> If you have received it in error, please delete it from your system.
>> Do not use, copy or disclose the information in any way nor act in
>> reliance on it and notify the sender immediately.
>> Please note that the BBC monitors e-mails sent or received.
>> Further communication will signify your consent to this.
>>
>>
>>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

Chris Hostetter-3

: Can't I have the same index, using one single core, same field names being
: processed by language specific components based on a field/parameter?

yes, but you don't really need the complexity you describe below ... you
don't need seperate request handlers per language, just seperate fields
per language.  asusming you care about 3 concepts: title, author, body ..
in a single language index those might corrispond ot three fields, in your
index they corrispond to 3*N fields where N is the number of languages you
wnat to support...

   title_french
   title_english
   title_german
   ...
   author_french
   author_english
   ...

documents which are in english only get values for th english fields,
documents in french etc... ... unless perhaps you want to support
"translations" of the documents in which case you can have values
fields for multiple langagues, it's up to you.  When a user wants to query
in french, you take their input and query against the body_french field
and display the title_french field, etc...

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

Henrib-2
In reply to this post by dma_bamboo
Hi Daniel,
Trying to recap: you are indexing documents that can be in different language. On the query side, users will only search in one language at a time & get results in that language.

Setting aside the webapp deployment problem, the alternative is thus:
option1: 1 schema will all fields of all languages pre-defined
option2: 1 schema per lang with the same field names (but a different type).

You indicate that your documents do have a field carrying the language. Is the Solr document format the authoring format of the documents you index or do they require some pre-processing to extract those fields? For instance, are the source documents in HTML and pre-processed using some XPath/magic to generate the fields?
In that case, using option1, the pre-processing transformation needs to know which fields to generate according to the language. Option2 needs you to know which core you need to target based on the lang. And it goes the same way for querying; for option1, you need a query with different fields for each language, option2 requires to target the correct core.
In the other case, ie if the Solr document format is the source format, indexing requires some script (curl or else) to send them to Solr; having the script determine which core to target don't seem (from far) a hard task (grep/awk  to the rescue :-)).

On the maintenance side, if you were to change the schema, need to reindex one lang or add a lang, option1 seems to have a 'wider' impact, the functional grain being coarser. Besides, if your collections are huge or grow fast, it might be nice to have an easy way to partition the workload on different machines which seems easier with option2, directing indexing & queries to a site based on the lang.

On the webapp deployment side, option1 is a breeze, option2 requires multiple web-app (Forgetting solr-215 patch that is unlikely to be reviewed and accepted soon since its functional value is not shared).

Hope this helps in your choice, regards,
Henri






Daniel Alheiros wrote
Hi Henri.

Thanks for your reply.
I've just looked at the patch you referred, but doing this I will lose the
out of the box Solr installation... I'll have to create my own Solr
application responsible for creating the multiple cores and I'll have to
change my indexing process to something able to notify content for a
specific core.

Can't I have the same index, using one single core, same field names being
processed by language specific components based on a field/parameter?

I will try to draw what I'm thinking, please forgive me if I'm not using the
correct terms but I'm not an IR expert.

Thinking in a workflow:
    Indexing:
        Multilanguage indexer receives some documents
            for each document, verify the "language" field
                if language = "English" then process using the
EnglishIndexer
                else if language = "Chinese" then process using the
ChineseIndexer
                else if ...

    Querying:
        Multilanguage Request Handler receives a request
            if parameter language = "English" then process using the English
Request Handler
            else if parameter language = "Chinese" then process using the
Chinese Request Handler
            else if ...

I can see that in the schema field definitions, we have some language
dependent parameters... It can be a problem, as I would like to have the
same fields for all requests...

Sorry to bother, but before I split all my data this way I would like to be
sure that it's the best approach for me.

Regards,
Daniel        


On 8/6/07 15:15, "Henrib" <hbiestro@gmail.com> wrote:

>
> Hi Daniel,
> If it is functionally 'ok' to search in only one lang at a time, you could
> try having one index per lang. Each per-lang index would have one schema
> where you would describe field types (the lang part coming through
> stemming/snowball analyzers, per-lang stopwords & al) and the same field
> name could be used in each of them.
> You could either deploy that solution through multiple web-apps (one per
> lang) (or try the patch for issue Solr-215).
> Regards,
> Henri
>
>
> Daniel Alheiros wrote:
>>
>> Hi,
>>
>> I'm just starting to use Solr and so far, it has been a very interesting
>> learning process. I wasn't a Lucene user, so I'm learning a lot about
>> both.
>>
>> My problem is:
>> I have to index and search content in several languages.
>>
>> My scenario is a bit different from other that I've already read in this
>> forum, as my client is the same to search any language and it could be
>> accomplished using a field to define language.
>>
>> My questions are more focused on how to keep the benefits of all the
>> protwords, stopwords and synonyms in a multilanguage situation....
>>
>> Should I create new Analyzers that can deal with the "language" field of
>> the
>> document? What do you recommend?
>>
>> Regards,
>> Daniel
>>
>>
>> http://www.bbc.co.uk/
>> This e-mail (and any attachments) is confidential and may contain personal
>> views which are not the views of the BBC unless specifically stated.
>> If you have received it in error, please delete it from your system.
>> Do not use, copy or disclose the information in any way nor act in
>> reliance on it and notify the sender immediately.
>> Please note that the BBC monitors e-mails sent or received.
>> Further communication will signify your consent to this.
>>
>>
>>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
In reply to this post by Chris Hostetter-3
This sounds OK.

I can create a field name mapping structure to change the requests /
responses in a way my client doesn't need to be aware of different fields.

Thanks for this directions,
Daniel


On 8/6/07 21:32, "Chris Hostetter" <[hidden email]> wrote:

>
> : Can't I have the same index, using one single core, same field names being
> : processed by language specific components based on a field/parameter?
>
> yes, but you don't really need the complexity you describe below ... you
> don't need seperate request handlers per language, just seperate fields
> per language.  asusming you care about 3 concepts: title, author, body ..
> in a single language index those might corrispond ot three fields, in your
> index they corrispond to 3*N fields where N is the number of languages you
> wnat to support...
>
>    title_french
>    title_english
>    title_german
>    ...
>    author_french
>    author_english
>    ...
>
> documents which are in english only get values for th english fields,
> documents in french etc... ... unless perhaps you want to support
> "translations" of the documents in which case you can have values
> fields for multiple langagues, it's up to you.  When a user wants to query
> in french, you take their input and query against the body_french field
> and display the title_french field, etc...
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
In reply to this post by Henrib-2
Hi Henri,

Thanks again, your considerations will sure help on my decision.
Now I'll do my homework to check document volume / growth - expected index
sizes and query load.

Regards,
Daniel Alheiros


On 9/6/07 10:53, "Henrib" <[hidden email]> wrote:

>
> Hi Daniel,
> Trying to recap: you are indexing documents that can be in different
> language. On the query side, users will only search in one language at a
> time & get results in that language.
>
> Setting aside the webapp deployment problem, the alternative is thus:
> option1: 1 schema will all fields of all languages pre-defined
> option2: 1 schema per lang with the same field names (but a different type).
>
> You indicate that your documents do have a field carrying the language. Is
> the Solr document format the authoring format of the documents you index or
> do they require some pre-processing to extract those fields? For instance,
> are the source documents in HTML and pre-processed using some XPath/magic to
> generate the fields?
> In that case, using option1, the pre-processing transformation needs to know
> which fields to generate according to the language. Option2 needs you to
> know which core you need to target based on the lang. And it goes the same
> way for querying; for option1, you need a query with different fields for
> each language, option2 requires to target the correct core.
> In the other case, ie if the Solr document format is the source format,
> indexing requires some script (curl or else) to send them to Solr; having
> the script determine which core to target don't seem (from far) a hard task
> (grep/awk  to the rescue :-)).
>
> On the maintenance side, if you were to change the schema, need to reindex
> one lang or add a lang, option1 seems to have a 'wider' impact, the
> functional grain being coarser. Besides, if your collections are huge or
> grow fast, it might be nice to have an easy way to partition the workload on
> different machines which seems easier with option2, directing indexing &
> queries to a site based on the lang.
>
> On the webapp deployment side, option1 is a breeze, option2 requires
> multiple web-app (Forgetting solr-215 patch that is unlikely to be reviewed
> and accepted soon since its functional value is not shared).
>
> Hope this helps in your choice, regards,
> Henri
>
>
>
>
>
>
>
> Daniel Alheiros wrote:
>>
>> Hi Henri.
>>
>> Thanks for your reply.
>> I've just looked at the patch you referred, but doing this I will lose the
>> out of the box Solr installation... I'll have to create my own Solr
>> application responsible for creating the multiple cores and I'll have to
>> change my indexing process to something able to notify content for a
>> specific core.
>>
>> Can't I have the same index, using one single core, same field names being
>> processed by language specific components based on a field/parameter?
>>
>> I will try to draw what I'm thinking, please forgive me if I'm not using
>> the
>> correct terms but I'm not an IR expert.
>>
>> Thinking in a workflow:
>>     Indexing:
>>         Multilanguage indexer receives some documents
>>             for each document, verify the "language" field
>>                 if language = "English" then process using the
>> EnglishIndexer
>>                 else if language = "Chinese" then process using the
>> ChineseIndexer
>>                 else if ...
>>
>>     Querying:
>>         Multilanguage Request Handler receives a request
>>             if parameter language = "English" then process using the
>> English
>> Request Handler
>>             else if parameter language = "Chinese" then process using the
>> Chinese Request Handler
>>             else if ...
>>
>> I can see that in the schema field definitions, we have some language
>> dependent parameters... It can be a problem, as I would like to have the
>> same fields for all requests...
>>
>> Sorry to bother, but before I split all my data this way I would like to
>> be
>> sure that it's the best approach for me.
>>
>> Regards,
>> Daniel        
>>
>>
>> On 8/6/07 15:15, "Henrib" <[hidden email]> wrote:
>>
>>>
>>> Hi Daniel,
>>> If it is functionally 'ok' to search in only one lang at a time, you
>>> could
>>> try having one index per lang. Each per-lang index would have one schema
>>> where you would describe field types (the lang part coming through
>>> stemming/snowball analyzers, per-lang stopwords & al) and the same field
>>> name could be used in each of them.
>>> You could either deploy that solution through multiple web-apps (one per
>>> lang) (or try the patch for issue Solr-215).
>>> Regards,
>>> Henri
>>>
>>>
>>> Daniel Alheiros wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm just starting to use Solr and so far, it has been a very interesting
>>>> learning process. I wasn't a Lucene user, so I'm learning a lot about
>>>> both.
>>>>
>>>> My problem is:
>>>> I have to index and search content in several languages.
>>>>
>>>> My scenario is a bit different from other that I've already read in this
>>>> forum, as my client is the same to search any language and it could be
>>>> accomplished using a field to define language.
>>>>
>>>> My questions are more focused on how to keep the benefits of all the
>>>> protwords, stopwords and synonyms in a multilanguage situation....
>>>>
>>>> Should I create new Analyzers that can deal with the "language" field of
>>>> the
>>>> document? What do you recommend?
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>>
>>>> http://www.bbc.co.uk/
>>>> This e-mail (and any attachments) is confidential and may contain
>>>> personal
>>>> views which are not the views of the BBC unless specifically stated.
>>>> If you have received it in error, please delete it from your system.
>>>> Do not use, copy or disclose the information in any way nor act in
>>>> reliance on it and notify the sender immediately.
>>>> Please note that the BBC monitors e-mails sent or received.
>>>> Further communication will signify your consent to this.
>>>>
>>>>
>>>>
>>
>>
>> http://www.bbc.co.uk/
>> This e-mail (and any attachments) is confidential and may contain personal
>> views which are not the views of the BBC unless specifically stated.
>> If you have received it in error, please delete it from your system.
>> Do not use, copy or disclose the information in any way nor act in
>> reliance on it and notify the sender immediately.
>> Please note that the BBC monitors e-mails sent or received.
>> Further communication will signify your consent to this.
>>
>>
>>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

RE: Multi-language indexing and searching

T. Kuro Kurosaka
In reply to this post by dma_bamboo
Daniel,
I was reading your email and responses to it with great
interest.

I was aware that Solr has an implicit assumption that
a field is mono-lingual per system. But your mail and
its correspondence made me wonder if this limitation
is practical for multi-lingual search applications.  For bi-lingual
or tri-lingual search, we can have parallel fields (title_en,
title_fr, title_de, for example) but this wouldn't scale well.  

Assume we are making a search application for multi-lingual
library in a university in Japan, for example,
the application would have a book title field in Japanese,
perhaps another title field in English for visiting
scholars, and a title field in the original language.  
The last field's field would vary among more than 50 modern
languages (and not so modern languages like Latin).  Solr
may need some rearchitecutring in this area.

I work for a company called Basis Technology,
(www.basistech.com) which develops a suite of language
processing software and I've written a module to integrate
this with Solr (and Lucene in general).  The module is
made of a universal Tokenizer and Analyzers for English and
Japanese, but they can be modified easily to handle any of
the 16 languages we can handle. (Source code is provided.)

When I was developing this module, I thought of writing
a super Analyzer that automatically detects the language
and do the right thing.  But I've found this won't fit
well with the design of Lucene and Solr.  For one thing,
there is no way to save the detected language in the field,
if the language is detected within the Analyzer.  Lucene and Solr
requires that the language be known before an Analyzer can be
instantiated,and it's the Analyzer that detects the language in my
design....  A second obstacle is that the kinds of Filters
the Analyzer use depends on the language, so it must be
dynamically changed. This could be done programatically but
it's not easy.  My big hope is that we can work together to
come up with some way so that the detected language within
the Analayzer can somehow be retrieved and made it into the field.

Anyway, if you are interested in trying my multi-lingual
Analyzers, please contact me in private email.

Regards,
-kuro
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

Yonik Seeley-2
On 6/12/07, Teruhiko Kurosaka <[hidden email]> wrote:
> For bi-lingual
> or tri-lingual search, we can have parallel fields (title_en,
> title_fr, title_de, for example) but this wouldn't scale well.

Due to search across multiple fields, or due to increased index size?

> Lucene and Solr
> requires that the language be known before an Analyzer can be
> instantiated,and it's the Analyzer that detects the language in my
> design....  A second obstacle is that the kinds of Filters
> the Analyzer use depends on the language, so it must be
> dynamically changed. This could be done programatically but
> it's not easy.  My big hope is that we can work together to
> come up with some way so that the detected language within
> the Analayzer can somehow be retrieved and made it into the field.

Something could be done for the indexing side of things, but then how
do you query?
Would you be able to do language detection on single word queries, or
do you apply multiple analyzers and query the same field multiple ways
(which seems very close to the multiple field approach)?

Also, would multiple languages in a single field perhaps cause idf skew?

50 languages is a lot... perhaps a simple analyzer that could just try
to break into words and lowercase?


-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: Multi-language indexing and searching

kkrugler
In reply to this post by T. Kuro Kurosaka
>Daniel,
>I was reading your email and responses to it with great
>interest.
>
>I was aware that Solr has an implicit assumption that
>a field is mono-lingual per system. But your mail and
>its correspondence made me wonder if this limitation
>is practical for multi-lingual search applications.  For bi-lingual
>or tri-lingual search, we can have parallel fields (title_en,
>title_fr, title_de, for example) but this wouldn't scale well.
>
>Assume we are making a search application for multi-lingual
>library in a university in Japan, for example,
>the application would have a book title field in Japanese,
>perhaps another title field in English for visiting
>scholars, and a title field in the original language.
>The last field's field would vary among more than 50 modern
>languages (and not so modern languages like Latin).  Solr
>may need some rearchitecutring in this area.

[snip]

One idea I thought about here, if any given document/field set would
only contain text for a single language, was to write out a special
token with the language name. E.g. have your analyzer add a
"my-special-token-prefix-esperanto" token to the field, and then at
query time (assuming you know the language) make this a required term.

-- Ken

>
>I work for a company called Basis Technology,
>(www.basistech.com) which develops a suite of language
>processing software and I've written a module to integrate
>this with Solr (and Lucene in general).  The module is
>made of a universal Tokenizer and Analyzers for English and
>Japanese, but they can be modified easily to handle any of
>the 16 languages we can handle. (Source code is provided.)
>
>When I was developing this module, I thought of writing
>a super Analyzer that automatically detects the language
>and do the right thing.  But I've found this won't fit
>well with the design of Lucene and Solr.  For one thing,
>there is no way to save the detected language in the field,
>if the language is detected within the Analyzer.  Lucene and Solr
>requires that the language be known before an Analyzer can be
>instantiated,and it's the Analyzer that detects the language in my
>design....  A second obstacle is that the kinds of Filters
>the Analyzer use depends on the language, so it must be
>dynamically changed. This could be done programatically but
>it's not easy.  My big hope is that we can work together to
>come up with some way so that the detected language within
>the Analayzer can somehow be retrieved and made it into the field.
>
>Anyway, if you are interested in trying my multi-lingual
>Analyzers, please contact me in private email.
>
>Regards,
>-kuro


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
In reply to this post by Yonik Seeley-2
Hi Yonik.

About how to handle with the index in query time:

I think that if you don't inform a language, you can return any document
matching the term, without considering different languages (if it's
possible) or if it's interesting for your solution, you can define a default
language to be used when it's not informed explicitly on the query.

So the analyzer has to be able to deal with a no-specific language situation
(that I think it's only acceptable at query time)...

Do you think it's doable?

It could be applied for the scenario Kuro explained (documents translated
into different languages) or for my actual scenario (different contents with
the same structure in different languages).
 

Regards,
Daniel


On 12/6/07 16:30, "Yonik Seeley" <[hidden email]> wrote:

> On 6/12/07, Teruhiko Kurosaka <[hidden email]> wrote:
>> For bi-lingual
>> or tri-lingual search, we can have parallel fields (title_en,
>> title_fr, title_de, for example) but this wouldn't scale well.
>
> Due to search across multiple fields, or due to increased index size?
>
>> Lucene and Solr
>> requires that the language be known before an Analyzer can be
>> instantiated,and it's the Analyzer that detects the language in my
>> design....  A second obstacle is that the kinds of Filters
>> the Analyzer use depends on the language, so it must be
>> dynamically changed. This could be done programatically but
>> it's not easy.  My big hope is that we can work together to
>> come up with some way so that the detected language within
>> the Analayzer can somehow be retrieved and made it into the field.
>
> Something could be done for the indexing side of things, but then how
> do you query?
> Would you be able to do language detection on single word queries, or
> do you apply multiple analyzers and query the same field multiple ways
> (which seems very close to the multiple field approach)?
>
> Also, would multiple languages in a single field perhaps cause idf skew?
>
> 50 languages is a lot... perhaps a simple analyzer that could just try
> to break into words and lowercase?
>
>
> -Yonik


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

RE: Multi-language indexing and searching

T. Kuro Kurosaka
In reply to this post by Yonik Seeley-2
Hi Yonik,
> On 6/12/07, Teruhiko Kurosaka <[hidden email]> wrote:
> > For bi-lingual
> > or tri-lingual search, we can have parallel fields (title_en,
> > title_fr, title_de, for example) but this wouldn't scale well.
>
> Due to search across multiple fields, or due to increased index size?

Due to the prolification of number of fields.  Say, we want
to have the field "title" to have the title of the book in
its original language.  But because Solr has this implicit
assumption of one language per field, we would have to have
the artifitial fields title_fr, title_de, title_en, title_es,
etc. etc. for the number of supported languages, only one of
which has a ral value per document.  This sounds silly, doesn't it?



> Something could be done for the indexing side of things, but
> then how do you query?
> Would you be able to do language detection on single word
> queries, or do you apply multiple analyzers and query the
> same field multiple ways (which seems very close to the
> multiple field approach)?

You are right that the language auto-detection does not
work on query. The search user would have to specify the
language somehow.  One commercial search engine vendor
does this by prefixing a query term with "$lang=en ".
I would do this by drop down list.  Each user or session
would have a default language that is configurable.



> Also, would multiple languages in a single field perhaps
> cause idf skew?

Sorry, I don't know enough about inside of the search engines
to discuss this.


> 50 languages is a lot... perhaps a simple analyzer that could
> just try to break into words and lowercase?

This won't work because:
(1) Concept of lowercase doesn't apply to all languages.
(2) Even among languages that use Latin script,
    there can be different normalization rules.  For many
    European languages, accent marks can be dropped ("ü" becomes
    "u"), but for German, "ü" may better be mapped to "ue"
    which is the alternative spelling of "ü" in German
    writing.
(3) Some languages such as Chinese and Japanese does not
    even use space or other delimiters to indicate the word
    boundary.  Language specific rules have to be applied
    just to extract words from the run of text.

-kuro
Reply | Threaded
Open this post in threaded view
|

RE: Multi-language indexing and searching

Chris Hostetter-3

: Due to the prolification of number of fields.  Say, we want
: to have the field "title" to have the title of the book in
: its original language.  But because Solr has this implicit
: assumption of one language per field, we would have to have
: the artifitial fields title_fr, title_de, title_en, title_es,
: etc. etc. for the number of supported languages, only one of
: which has a ral value per document.  This sounds silly, doesn't it?

not really, i have indexes with *thousands* of fields ... if you turn
field norms off it's extremely efficient, but even with norms: 50*n fields
where n is the number of "real" fields you have (title, author, etc..)
should work fine.

furthermore, declaration of these fields can be simple -- if you have a
language you want to treat special, then presumably you have a special
analyzer for it.  dynamicFields where the field name is the wildcard
and the language is set can be used to handle all of the different
"indexed" fields,

<dynamicField name="*english" type="english" />
<dynamicField name="*french" type="french" />
<dynamicField name="*spanish" type="german" />
...more like the above for each lanague you wnat to support...
<copyField source="*_english" dest="english" />
<copyField source="*_french" dest="french" />
<copyField source="*_spanish" dest="spanish" />
...more like the above for each lanague you wnat to support...

and now you can index documents with fields like this...

   author_english = Mr. Chris Hostetter
   author_spanish = Senor Cristobol Hostetter
   body_english = I can't Believe It's not butter
   body_spanish = No puedo creer que no es mantaquea
   title_english = One Man's Disbelief

...and you can search on english:Chris, spanish:Cristobol,
author_spanish:Cristobol, etc...

you could even add dynamicFields with the field name set and the language
wildcarded to handle any fields used solely for display with even less
declaration (one per field instead of one per langauge) ...

<dynamicField name="display_title_*" type="string" />
...




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
Hi Hoss

One bad thing in having fields specific for your language (in my point of
view) is that you will have to re-index your content when you add a new
language (some will need to start with one language and in future will have
others added). But OK, let's say the indexing is done.

So using dynamic fields and creating all language variations for your field
types that may need language aware processing could make it. But this way
you are going to have a different "interface" as the system will receive and
return a different set of fields in queries, wouldn't?
It could be avoided transforming the request / response to a language aware
/ unaware format:
requests: transforming  fieldName => fieldName_language
responses: transforming  fieldName_language => fieldName

And still you will not be able to search for all your documents... It may be
interesting to search for the last published contents (no matter in which
language this content is)...

What do you think about it?

Regards,
Daniel

On 12/6/07 19:50, "Chris Hostetter" <[hidden email]> wrote:

>
> : Due to the prolification of number of fields.  Say, we want
> : to have the field "title" to have the title of the book in
> : its original language.  But because Solr has this implicit
> : assumption of one language per field, we would have to have
> : the artifitial fields title_fr, title_de, title_en, title_es,
> : etc. etc. for the number of supported languages, only one of
> : which has a ral value per document.  This sounds silly, doesn't it?
>
> not really, i have indexes with *thousands* of fields ... if you turn
> field norms off it's extremely efficient, but even with norms: 50*n fields
> where n is the number of "real" fields you have (title, author, etc..)
> should work fine.
>
> furthermore, declaration of these fields can be simple -- if you have a
> language you want to treat special, then presumably you have a special
> analyzer for it.  dynamicFields where the field name is the wildcard
> and the language is set can be used to handle all of the different
> "indexed" fields,
>
> <dynamicField name="*english" type="english" />
> <dynamicField name="*french" type="french" />
> <dynamicField name="*spanish" type="german" />
> ...more like the above for each lanague you wnat to support...
> <copyField source="*_english" dest="english" />
> <copyField source="*_french" dest="french" />
> <copyField source="*_spanish" dest="spanish" />
> ...more like the above for each lanague you wnat to support...
>
> and now you can index documents with fields like this...
>
>    author_english = Mr. Chris Hostetter
>    author_spanish = Senor Cristobol Hostetter
>    body_english = I can't Believe It's not butter
>    body_spanish = No puedo creer que no es mantaquea
>    title_english = One Man's Disbelief
>
> ...and you can search on english:Chris, spanish:Cristobol,
> author_spanish:Cristobol, etc...
>
> you could even add dynamicFields with the field name set and the language
> wildcarded to handle any fields used solely for display with even less
> declaration (one per field instead of one per langauge) ...
>
> <dynamicField name="display_title_*" type="string" />
> ...
>
>
>
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

Chris Hostetter-3

: One bad thing in having fields specific for your language (in my point of
: view) is that you will have to re-index your content when you add a new
: language (some will need to start with one language and in future will have
: others added). But OK, let's say the indexing is done.

i don't see anyway you could possible avoid that, unless you index each
"langauge version" of your orriginl documents as seperate Solr Documents
(which would still work with the same type of schema) then if you add a
new translation for a "document" you only have to index the new text --
but the trade off is you can't do queries for things like "all documents
written between 2004 and 2005" becuse you'll get every translation of
every document.

: you are going to have a different "interface" as the system will receive and
: return a different set of fields in queries, wouldn't?
: It could be avoided transforming the request / response to a language aware
: / unaware format:
: requests: transforming  fieldName => fieldName_language
: responses: transforming  fieldName_language => fieldName

sure ... the fields have the langauge in them, if you want a downstream
client ot be able to refer to them without the language in the name,
you'll need to strip it off.

: And still you will not be able to search for all your documents... It may be
: interesting to search for the last published contents (no matter in which
: language this content is)...

why wouldn't you be able to do that?  "publish date" would be a field that
would be completley independent of the language, so you would just have it
as a regular field ing your schema (not dynamic, no langauge in the field
name)



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
Hi Hoss.

Yes, the idea is indexing each document independently (in my scenario they
are not translations, they are just documents with the same structure but
different languages). So that considerations you did about queries in a
range wouldn't be a problem in this case. The real issue I can see in this
approach, is related to Analyzers... How to make them deal with different
languages properly using one Solr instance with the same set of fields being
used by documents in different languages....

Good, I just forgot that some fields don't need special treatment depending
on language (like date or long)... Thanks for that.

Looks like my best alternative then is using dynamic fields having then a
set of fields for each language. But anyway I think I'll still need a way to
apply different analyzers at query time so I can deal with each language
details. Is it correct?

Regards,
Daniel


On 16/6/07 04:31, "Chris Hostetter" <[hidden email]> wrote:

>
> : One bad thing in having fields specific for your language (in my point of
> : view) is that you will have to re-index your content when you add a new
> : language (some will need to start with one language and in future will have
> : others added). But OK, let's say the indexing is done.
>
> i don't see anyway you could possible avoid that, unless you index each
> "langauge version" of your orriginl documents as seperate Solr Documents
> (which would still work with the same type of schema) then if you add a
> new translation for a "document" you only have to index the new text --
> but the trade off is you can't do queries for things like "all documents
> written between 2004 and 2005" becuse you'll get every translation of
> every document.
>
> : you are going to have a different "interface" as the system will receive and
> : return a different set of fields in queries, wouldn't?
> : It could be avoided transforming the request / response to a language aware
> : / unaware format:
> : requests: transforming  fieldName => fieldName_language
> : responses: transforming  fieldName_language => fieldName
>
> sure ... the fields have the langauge in them, if you want a downstream
> client ot be able to refer to them without the language in the name,
> you'll need to strip it off.
>
> : And still you will not be able to search for all your documents... It may be
> : interesting to search for the last published contents (no matter in which
> : language this content is)...
>
> why wouldn't you be able to do that?  "publish date" would be a field that
> would be completley independent of the language, so you would just have it
> as a regular field ing your schema (not dynamic, no langauge in the field
> name)
>
>
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

Chris Hostetter-3

: range wouldn't be a problem in this case. The real issue I can see in this
: approach, is related to Analyzers... How to make them deal with different
: languages properly using one Solr instance with the same set of fields being
: used by documents in different languages....

i would still use the same type of schema i suggested before ... one
fieldtype per langauge and dynamic fields per language ... it's just that
now you don't bother indexing text in both the english_title and
french_title fields ... you use one or the other depending on what the
langauge of this particular translation is, just as you guessed...

: Looks like my best alternative then is using dynamic fields having then a
: set of fields for each language. But anyway I think I'll still need a way to
: apply different analyzers at query time so I can deal with each language
: details. Is it correct?

No, you just need your client to query the right field based on the
langauge ... if the end user wants to search for "playa" in all "spanish"
documents your client code should query for something
like "spanish_title:playa^3 spanish_body:playa" ... you could even have a
dismax handler instance configured per language to make this transparent
to the client...

      q=playa&qt=spanish

...since you know only have one language per "document" the reponse docs
your client gets back are even easier to deal with, becuase you can have a
single stored field for each conceptual field for display purposes using
the canonical name ... the client can display the "title" field and hte
"author" field and not have to know/remember that the search is in
spanish.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Multi-language indexing and searching

dma_bamboo
Hi Hoss

Thanks again for your attention.

Looks like after your last instructions I thought the same way as you :)

What I did yesterday:
1. Created the schema with the fields with language variations (created as
concrete fields anyway because in this case, using dynamic it wouldn't be
better for me):
    - defined field types: txt_en, txt_fr (with proper analyzers for each
language)
    - created fields title_en, title_fr (in its respective field types)
    - created a field title (that receives a copy of title_en and title_fr -
as they are exclusive, they won't override each other)
 
2. Created two dismax RequestHandlers, one per language. So the only thing
my client has to change when searching for an specific language is the "qt"
parameter.

So far it sounds good for my needs, now I'm going to try if my other
features still work (I'm worried about highlighting as I'm going to return a
different field)...

I'll come back to you with my results.

Regards,
Daniel


On 19/6/07 19:19, "Chris Hostetter" <[hidden email]> wrote:

>
> : range wouldn't be a problem in this case. The real issue I can see in this
> : approach, is related to Analyzers... How to make them deal with different
> : languages properly using one Solr instance with the same set of fields being
> : used by documents in different languages....
>
> i would still use the same type of schema i suggested before ... one
> fieldtype per langauge and dynamic fields per language ... it's just that
> now you don't bother indexing text in both the english_title and
> french_title fields ... you use one or the other depending on what the
> langauge of this particular translation is, just as you guessed...
>
> : Looks like my best alternative then is using dynamic fields having then a
> : set of fields for each language. But anyway I think I'll still need a way to
> : apply different analyzers at query time so I can deal with each language
> : details. Is it correct?
>
> No, you just need your client to query the right field based on the
> langauge ... if the end user wants to search for "playa" in all "spanish"
> documents your client code should query for something
> like "spanish_title:playa^3 spanish_body:playa" ... you could even have a
> dismax handler instance configured per language to make this transparent
> to the client...
>
>       q=playa&qt=spanish
>
> ...since you know only have one language per "document" the reponse docs
> your client gets back are even easier to deal with, becuase you can have a
> single stored field for each conceptual field for display purposes using
> the canonical name ... the client can display the "title" field and hte
> "author" field and not have to know/remember that the search is in
> spanish.
>
>
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
12