Synonyms and stemming revisited

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Synonyms and stemming revisited

Christian Vogler-3
I apologize for beating a dead horse, but upon searching the archives,
I found no satisfactory resolution. According to the archives, Hoss
recommends in multiple messages that the synonym filter is put before
the stemmer and that synonym stemming at query time then should work
as expected. Unfortunately, this is only true for the first word that
appears in the synonym list.

Consider the following simplified index-time configuration:

      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="test_synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German2"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>


Furthermore, consider the following synonym definition:

reise,urlaub

(These mean travel and vacation, respectively)

Both words can appear with many different endings, such as:

reise, reisen, reist, ...
urlaub,urlaube,urlauben, ...

The stemmer reduces all these to "reis" and "urlaub", respectively.

Now, suppose that a document contains "reise" at index time. According
to the filter order, this
will be expanded by the synonym filter to:

reise urlaub, and then stemmed as:

reis urlaub.

So far, so good. In this case, queries for urlaube, reisen, etc., will
all hit the indexed document.

However, consider a document that contains "reisen" at index time. As
the synonym filter comes first, there is no match for the synonym, and
the analyzer progresses to index this document with "reisen" -> "reis"
only, with "urlaub" missing.

Hence, queries such as "reisen, reist" will hit, but "urlaub",
"urlaube", etc. will not.

I see two solutions:

Either put all possible endings in the synonym file - I do not really
like this solution, as it would make the file very large, and it also
is too easy to miss some specific ending. Or run the stemmer before
the synonym filter, in which case the synonym definitions need to
appear in their stemmed forms. Am I missing something, or does the
conversion of the synonym text file need to be done by hand at the
moment? I suppose that it would not be too difficult to write some
code that does this conversion automatically, so that the synonym
definition:

reise,urlaub is converted to
reis,urlaub

which then should solve all problems.

Best regards
- Christian
--
Christian Vogler, Ph.D.
Institute for Language and Speech Processing
Athens, Greece
Reply | Threaded
Open this post in threaded view
|

Re: Synonyms and stemming revisited

hossman

: I see two solutions:
:
: Either put all possible endings in the synonym file - I do not really
: like this solution, as it would make the file very large, and it also
: is too easy to miss some specific ending. Or run the stemmer before
: the synonym filter, in which case the synonym definitions need to
: appear in their stemmed forms. Am I missing something, or does the

Based on my understanding of your description of your problem, i think i
agree with you.

If i've given differnet advice in the past, I'm sure i had a good reason
for -- possible due to some aspect of those problems that are subtly
differnet then yours ... can you post links to hte specific messages
you're refering to, it might help jog my memory.

: conversion of the synonym text file need to be done by hand at the
: moment? I suppose that it would not be too difficult to write some

A recently added feature is that when configuring SynonymFilterFactory
you can give it the name of a TokenizerFactory to use when parsing the
synonym file.  This could be used to stem words *if* you write a
TokenizerFactory that calls out to your Stemmer.

(see SOLR-319 for the backround on why you can only specify a Tokenizer
and not a full "fieldType" to get the analysis chain from ... in a
nutshell: 1. it would have been harder to implement; 2. the only use cases
people could think of where Tokenization based.)


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Synonyms and stemming revisited

Christian Vogler-3
> If i've given differnet advice in the past, I'm sure i had a good reason
> for -- possible due to some aspect of those problems that are subtly
> differnet then yours ... can you post links to hte specific messages
> you're refering to, it might help jog my memory.

One thread is: http://www.nabble.com/synonyms-td16284520.html

Based on my reading of that thread, I believe that the issue raised there is
the same as the one I just raised, but the original post was not entirely
clear and perhaps easy to misunderstand.

Another thread is:
http://www.nabble.com/stemming-the-synonyms-to16945953.html#a16945953

> A recently added feature is that when configuring SynonymFilterFactory
> you can give it the name of a TokenizerFactory to use when parsing the
> synonym file.  This could be used to stem words *if* you write a
> TokenizerFactory that calls out to your Stemmer.

Ah, cool. I will give the SOLR 1.3 nightlies a spin, once I make it past my
current deadlines and obligations.

> (see SOLR-319 for the backround on why you can only specify a Tokenizer
> and not a full "fieldType" to get the analysis chain from ... in a
> nutshell: 1. it would have been harder to implement; 2. the only use cases
> people could think of where Tokenization based.)

There probably needs to be a chain of tokenizers, because in the German
language compound words need to be split before stemming. I will take a stab
at writing the TokenizerFactory that chains them. Should not be too
difficult.

Best regards
- Christian
Reply | Threaded
Open this post in threaded view
|

Re: Synonyms and stemming revisited

hossman

: One thread is: http://www.nabble.com/synonyms-td16284520.html
:
: Based on my reading of that thread, I believe that the issue raised there is
: the same as the one I just raised, but the original post was not entirely
: clear and perhaps easy to misunderstand.
:
: Another thread is:
: http://www.nabble.com/stemming-the-synonyms-to16945953.html#a16945953

Ahhh... see, in those threads there is no explicit mention of processing
synonyms at index time.  i think i must have assumed SynonymFilter
was being used at query time. (in which case using synonyms first and then
stemming should work ... right?)

: Ah, cool. I will give the SOLR 1.3 nightlies a spin, once I make it past my
: current deadlines and obligations.

if you wait a few days you won't need to use a nightly, 1.3 should
official by then.



-Hoss