French stemming / size of synonyms file

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

French stemming / size of synonyms file

Emmanuel Bégué-2
Hello,

According to the wiki http://wiki.apache.org/solr/LanguageAnalysis,
the light stemmers for French (solr.FrenchLightStemFilterFactory and
solr.FrenchMinimalStemFilterFactory) are only available for SOLR 3.1.

Is there a way to make them work with 1.4.1?

- - -

Additionally, there is an "official" list of inflected word forms for
the French language produced by a government agency (this being
France...) It's called "Morphalou":
http://www.cnrtl.fr/lexiques/morphalou/ and it contains over 540 k
inflicted forms.

It's a 162 Mo XML file; it would not be very hard to transform it into
the format for synonyms files for SOLR, but it would result in a
rather huge text file (probably smaller than the original XML, but
still around 100 Mo). How large can a synonyms file be? Is it
dependant on the Java heap size...?

Or is there a better way to use such a list than a synonyms file?

Thanks,
Regards,
EB
Reply | Threaded
Open this post in threaded view
|

Re: French stemming / size of synonyms file

Robert Muir
2010/12/15 Emmanuel Bégué <[hidden email]>:
> Hello,
>
> According to the wiki http://wiki.apache.org/solr/LanguageAnalysis,
> the light stemmers for French (solr.FrenchLightStemFilterFactory and
> solr.FrenchMinimalStemFilterFactory) are only available for SOLR 3.1.
>
> Is there a way to make them work with 1.4.1?

you could take the source code and backport it to solr 1.4.1... but see below:

>
> - - -
>
> Additionally, there is an "official" list of inflected word forms for
> the French language produced by a government agency (this being
> France...) It's called "Morphalou":
> http://www.cnrtl.fr/lexiques/morphalou/ and it contains over 540 k
> inflicted forms.
>
> Or is there a better way to use such a list than a synonyms file?

In this case I would recommend also considering StemmerOverrideFilter
(again only in 3.1+, sorry)
See http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory

The StemmerOverrideFilter will "stem" based on a tab-separated
dictionary. But, when it does this it also marks the word with
KeywordAttribute, which tells any future stemmer to ignore it.

So with this approach you can have a StemmerOverrideFilter with your
dictionary, then followed by a stemmer which will only work on words
that aren't in your dictionary.
The words that hit the dictionary will be completely ignored by the stemmer.

This should also be much more RAM-efficient than using SynonymFilter