multi language search engine in solr

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

multi language search engine in solr

Mugeesh Husain
This post was updated on .
Hi

I am working on multi language search Arabic, English, Bengali, Hindi, Malay  language and have seperate database each of the them.  can anybody guide me how to configure solr schema.

1.) should i need to configure all the language in a single
shard/collection. ?
2.)should I need to configure separate  shard/collection for each of
language ?

I am looking for the suggestion about architecture level of this project,
Please suggest and guide me to defining the schema and architecture.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: multi language search engine in solr

Rick Leir-2
Mugeesh,
One important question: will the typical document have a mix of English and Bangla and Hindi? If so, you would probably have them all in one collection.

Another thing to think about is the tokenizer. Are all words separated by white space? If not, then you might need to think about which tokenizer to use.

As for character sets, I think you should make sure all the inputs are in UTF-8, then there should be no problem.

There will be other things to consider but this is a start.
Cheers -- Rick


On September 10, 2017 9:32:11 AM EDT, Mugeesh Husain <[hidden email]> wrote:

>Hi
>
>I am working on multi language search engine for english,bangla, hindi
>and
>indonesia  language.  can anybody guide me how to configure solr
>schema.
>
>1.) should i need to configure all the language in a single
>shard/collection. ?
>2.)should I need to configure separate  shard/collection for each of
>language ?
>
>I am looking for the suggestion about architecture level of this
>project,
>Please suggest and guide me to defining the schema and architecture.
>
>
>
>--
>Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: multi language search engine in solr

Mugeesh Husain
This post was updated on .
Thank you rick for your response.

The document document have sepearte of the lanaguage instead of mix of
Arabic, English, Bengali, Hindi, Malay.
 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: multi language search engine in solr

Tommaso Teofili
another thing to consider is what users would expect, would english user
search over english docs only ? if yes, the most important task would be to
correctly set up / create accurate per language analyzers, otherwise you
may consider to also adopt machine translation, either on the search
queries or on the resulting docs, me and a friend of mine gave a talk at
bbuzz this year [1].

My 2 cents,
Tommaso

[1] :
https://berlinbuzzwords.de/17/session/embracing-diversity-searching-over-multiple-languages

Il giorno lun 11 set 2017 alle ore 03:46 Mugeesh Husain <[hidden email]>
ha scritto:

> Thank you rick for your response.
>
> The document document have sepearte of the lanaguage instead of mix of
> Arabic, English, Bengali, Hindi, Malay.
>
> I coul not find any tokenizer for Malay, can you suggest me if you know
> please.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

RE: multi language search engine in solr

Junte Zhang
In reply to this post by Mugeesh Husain
Having the language already separated makes it a lot easier.

You could add the language suffix (e.g. 3 letter with ISO 639-2B https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) per field where you have the different languages. Or else you could have copied an entire field to their language-analyzed fields, and hope that would be good enough for matching.

I think Malay should be very similar to Indonesian (https://wiki.apache.org/solr/LanguageAnalysis#Indonesian). However, you could extend this by adding your own dictionary (keywords) and stopwords (if that is desirable).

/JZ

-----Original Message-----
From: Mugeesh Husain [mailto:[hidden email]]
Sent: Monday, September 11, 2017 3:46 AM
To: [hidden email]
Subject: Re: multi language search engine in solr

Thank you rick for your response.

The document document have sepearte of the lanaguage instead of mix of Arabic, English, Bengali, Hindi, Malay.

I coul not find any tokenizer for Malay, can you suggest me if you know please.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

RE: multi language search engine in solr

Mugeesh Husain
thanks Junte Zhang, its really helpful for me



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html