Internationalization

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Internationalization

Jörg Pfründer
Hello,

is there anyone who has experience on internationalization (internationalisation) with SOLR?

How do you setup a multi language data index?  Should we use a dynamic field like text_en, text_fr, text_es?

Is there a GermanPorterFilterFactory or FrenchPorterFilterFactory?

Thank you very much.

Jörg Pfründer

_____________________________________________________
Gratis Emailpostfach mit 2 GB Speicher -
10 SMS - http://www.xemail.de
Spam? mailto:[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Internationalization

Bertrand Delacretaz
Hi Jorg,

On 1/16/07, Jörg Pfründer <[hidden email]> wrote:
> ...is there anyone who has experience on internationalization (internationalisation) with SOLR?...

I've been setting up a french language index in the last months, and
it works very well.

There are some pointers on how to analyze French text in my article at
xml.com (see http://wiki.apache.org/solr/SolrResources).

> ...How do you setup a multi language data index?  Should we use a dynamic field like
> text_en, text_fr, text_es?...

Yes, I don't think you can currently mix languages in the same field,
so having fields named after the language might be the easiest.

> Is there a GermanPorterFilterFactory or FrenchPorterFilterFactory?...

The SnowballFilterFactory now supports a language parameter, see
http://issues.apache.org/jira/browse/SOLR-27

Hope this helps,
-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Internationalization

Bess Sadler
In reply to this post by Jörg Pfründer
Hi, Jörg.

At the Tibetan Himalayan Digital Library, we are working with XML  
files that have fields that might be in Tibetan, Chinese, Nepalese,  
or English. Our solr schema.xml file looks like this:

    <dynamicField name="*_eng" type="string"    indexed="true"  
stored="true" multiValued="true"/>
    <dynamicField name="*_chi" type="string"    indexed="true"  
stored="true" multiValued="true"/>
    <dynamicField name="*_tib" type="string"    indexed="true"  
stored="true" multiValued="true"/>
    <dynamicField name="*_nep" type="string" indexed="true"  
stored="true" multiValued="true"/>

I run all of our XML data through a XSL transformation that puts it  
in solr indexable form and also figures out what language a field is  
in and gives it an appropriate name, e.g., "location_eng" or  
"formalname_tib". So far this is working very well for us.

Currently, we are assigning all fields, no matter what language to  
type string, defined as

<fieldtype name="string" class="solr.StrField" sortMissingLast="true"/>

This does string matching very well, but doesn't do any stop words,  
or stemming, or anything fancy. We are toying with the idea of a  
custom Tibetan indexer to better break up the Tibetan into discrete  
words, but for this particular project (because it mostly has to do  
with proper names, not long passages of text) this hasn't been a  
problem yet, and the above solution seems to be doing the trick.

I hope this helps.

Good luck!

Bess

On Jan 16, 2007, at 10:23 AM, Jörg Pfründer wrote:

> Hello,
>
> is there anyone who has experience on internationalization  
> (internationalisation) with SOLR?
>
> How do you setup a multi language data index?  Should we use a  
> dynamic field like text_en, text_fr, text_es?
>
> Is there a GermanPorterFilterFactory or FrenchPorterFilterFactory?
>
> Thank you very much.
>
> Jörg Pfründer
>
> _____________________________________________________
> Gratis Emailpostfach mit 2 GB Speicher -
> 10 SMS - http://www.xemail.de
> Spam? mailto:[hidden email]
>

Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[hidden email]
(434) 243-2305


Reply | Threaded
Open this post in threaded view
|

Re: Internationalization

Erik Hatcher
Way to go Bess!   This is great stuff you're sharing.

I have a question though...

On Jan 16, 2007, at 11:48 AM, Bess Sadler wrote:

> Currently, we are assigning all fields, no matter what language to  
> type string, defined as
>
> <fieldtype name="string" class="solr.StrField"  
> sortMissingLast="true"/>
>
> This does string matching very well, but doesn't do any stop words,  
> or stemming, or anything fancy. We are toying with the idea of a  
> custom Tibetan indexer to better break up the Tibetan into discrete  
> words, but for this particular project (because it mostly has to do  
> with proper names, not long passages of text) this hasn't been a  
> problem yet, and the above solution seems to be doing the trick.

Why are you assigning all fields to a "string" type?  That indexes  
each field as-is, with no tokenization at all.  How are you using  
that field from the front-end?   I'd think you'd want to copyField  
everything into a "text" field.

> Elizabeth (Bess) Sadler
> Head, Technical and Metadata Services
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904

Just two floors down.... what amazing folks we have on this!

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Internationalization

Bess Sadler

On Jan 17, 2007, at 3:07 AM, Erik Hatcher wrote:

> Why are you assigning all fields to a "string" type?  That indexes  
> each field as-is, with no tokenization at all.  How are you using  
> that field from the front-end?   I'd think you'd want to copyField  
> everything into a "text" field.

The short answer is there is no good reason for this. I guess I just  
hadn't thought too hard yet about the difference between string and  
text. This particular project is a gazetteer, so we're mostly  
indexing proper names (e.g. "China" and "中国") which are mostly one-
word and so don't need much tokenization anyway. But of course this  
isn't true for all our fields, and even some proper names (e.g., "lha  
sa") might benefit from tokenization.

I've been planning to separately index all our Chinese text with the  
ChineseAnalyzer (á la pages 142 - 145 in Lucene in Action) and Ed  
Garrett (who I think is also on this list... hi, Ed!) at U Michigan  
is working on a Tibetan analyzer that I also want to use, I just  
haven't got that far yet.

So now I'm all motivated to go re-write this thing so that it process  
each language properly. Maybe I'll write something up for the wiki  
when I'm done.

Thanks again, Erik.

Bess


Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[hidden email]
(434) 243-2305