Synonyms list breaks solr

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Synonyms list breaks solr

matt connolly
I'm setting up Solr to run on a web site I'm working on.

Basically, if I use no synonym file, then Solr is working really well for finding text, the porter stemmer filter is great.

It also works with a small synonym file, like the one in the example, which defines Television,TV.

But when I add a large synonym file (like approx 7000 synonyms), then everything breaks down. Even queries for exact words don't return any results.

Could it be that there is something in the synonym file (non-ascii char for example) that is causing the synonym filter to do something wierd, like not pass any tokens?

Could it be that the synonym filter is now expanding practically everything so that no document is considered relevant enough? (I tried making the defaultOperator="OR" no difference.)


My text field is defined in the schema as:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


Thanks for any help,
Matt

Reply | Threaded
Open this post in threaded view
|

Re: Synonyms list breaks solr

matt connolly
I discovered that moving the synonym expansion to at index time rather than query time works just fine with my synonym list.

I'd still like to know why it doesn't work expanding at query time though.... :(
Reply | Threaded
Open this post in threaded view
|

Re: Synonyms list breaks solr

Grant Ingersoll-2
In reply to this post by matt connolly
Are there any errors in your logs?  Have you tried looking at the  
admin analysis page to see how text gets treated on that field?

Are you sure the large synonym file is formatted correctly?

-Grant

On Jul 11, 2008, at 7:23 AM, matt connolly wrote:

>
> I'm setting up Solr to run on a web site I'm working on.
>
> Basically, if I use no synonym file, then Solr is working really  
> well for
> finding text, the porter stemmer filter is great.
>
> It also works with a small synonym file, like the one in the  
> example, which
> defines Television,TV.
>
> But when I add a large synonym file (like approx 7000 synonyms), then
> everything breaks down. Even queries for exact words don't return any
> results.
>
> Could it be that there is something in the synonym file (non-ascii  
> char for
> example) that is causing the synonym filter to do something wierd,  
> like not
> pass any tokens?
>
> Could it be that the synonym filter is now expanding practically  
> everything
> so that no document is considered relevant enough? (I tried making the
> defaultOperator="OR" no difference.)
>
>
> My text field is defined in the schema as:
>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory"  
> synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>
> Thanks for any help,
> Matt
>
>
> --
> View this message in context: http://www.nabble.com/Synonyms-list-breaks-solr-tp18401710p18401710.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







Reply | Threaded
Open this post in threaded view
|

Re: Synonyms list breaks solr

matt connolly
There's no errors in my log, just a list of GET HEAD and POST entries, it looks just like an Apache access log.

There are a few entries in the log file that have " " and "-" in them, but as far as I can see that isn't a problem.

Is there a way to make Solr's logging a bit more verbose to help debug this?

-Matt

Grant Ingersoll-6 wrote
Are there any errors in your logs?  Have you tried looking at the  
admin analysis page to see how text gets treated on that field?

Are you sure the large synonym file is formatted correctly?

-Grant
Reply | Threaded
Open this post in threaded view
|

Re: Synonyms list breaks solr

matt connolly
In reply to this post by Grant Ingersoll-2
Hmmmm... The Analyzer shows me *almost* what I am expecting to see. When I show it being verbose with debug info, I can see exactly what is going on, which is great. Thanks for the tip.

What's happening (for most of my test cases) is that some of the synonyms are multiple words (and it's a big synonym list), and then also the word delimiter is creating even more terms. The analyzer finds a match in individual words (highlighted words) but the query engine makes a more complex.

Consider:

a document with the text "the quick brown fox jumps over the lazy dog" in a "body" field of type "text" like in schema mentioned above.

a synonym list like:

dog,canine,mut,domestic dog,barker
wretch,dog
hound,dog,pooch,doggy

and query for the word "dog"

The analyzer creates two terms, like this:

Term position 1: dog,canin,mut,domest,barker,wretch,hound,pooch,doggi
Term position 2: dog

(here, the synonym "domestic dog" for "dog" creates two tokens: "domestic" and "dog")

And highlights the word dog in the query. So the analyzer can find it.

The query is parsed into: MultiPhraseQuery(text:"(dog canin mut domest barker wretch hound pooch doggi) dog")

Which only matches a document with "dog dog" or "canine dog" or "domestic dog" (etc) in it. If these words are separated, eg: "a canine is a kind of dog" then we get no match! :(

Why does a two word synonym require a two word match for all synonyms?

I was also hoping that the synonym list might be one way: ie: dog expands to hound but not wretch in the example above. Is there a way to do this too? (that might be a story for another thread).

Thanks,
Matt


Reply | Threaded
Open this post in threaded view
|

Re: Synonyms list breaks solr

hossman
In reply to this post by matt connolly

: I discovered that moving the synonym expansion to at index time rather than
: query time works just fine with my synonym list.
:
: I'd still like to know why it doesn't work expanding at query time
: though.... :(

did you read the comments in the wiki about SynonymFilterFactory?  
particularly the part after the example where it says "Keep in mind..."

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Synonyms list breaks solr

Chris Harris-2
In reply to this post by matt connolly
Matt,

If I understand you correctly, then the log you mention is what your
servlet container / web server is logging, not what Solr is logging.
Solr logging needs to be configured separately. See

http://wiki.apache.org/solr/FAQ?highlight=(logging)#head-ffe035452f21ffdb4e4658c2f87777f6553bd6ca

If you are using the default jetty setup (example/start.jar), you
could also consult

http://wiki.apache.org/solr/LoggingInDefaultJettySetup

Hope this helps.

Chris

On Fri, Jul 11, 2008 at 6:55 AM, matt connolly <[hidden email]> wrote:

>
> There's no errors in my log, just a list of GET HEAD and POST entries, it
> looks just like an Apache access log.
>
> There are a few entries in the log file that have " " and "-" in them, but
> as far as I can see that isn't a problem.
>
> Is there a way to make Solr's logging a bit more verbose to help debug this?
>
> -Matt