offsets issues with multiword synonyms since LUCENE_33

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

offsets issues with multiword synonyms since LUCENE_33

Marc Sturlese
Has someone noticed this problem and solved it somehow? (without using LUCENE_33 in the solrconfig.xml)
https://issues.apache.org/jira/browse/LUCENE-3668

Thanks in advance
Reply | Threaded
Open this post in threaded view
|

Re: offsets issues with multiword synonyms since LUCENE_33

Jack Krupansky-2
What is your specific example? There are lots of issues and "gotchas" with
synonyms. Is your example exactly identical to the referenced Jira, or
merely roughly similar. The exact example is needed to analyze these types
of issues.

And please be specific about which term in the sequence has an incorrect
offset, including the actual offset vs. what you expected. Unless, of
course, your example is the exact one listed in that Jira. Sometimes bug
fixes do get lost.

-- Jack Krupansky

-----Original Message-----
From: Marc Sturlese
Sent: Tuesday, August 14, 2012 11:53 AM
To: [hidden email]
Subject: offsets issues with multiword synonyms since LUCENE_33

Has someone noticed this problem and solved it somehow? (without using
LUCENE_33 in the solrconfig.xml)
https://issues.apache.org/jira/browse/LUCENE-3668

Thanks in advance



--
View this message in context:
http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: offsets issues with multiword synonyms since LUCENE_33

Marc Sturlese
Well an example would be:
synonyms.txt:
huge,big size

The I have the docs:
1- The huge fox attacks first
2- The big size fox attacks first

Then if I query for huge, the highlights for each document are:

1- The <strong>huge</strong> <strong>fox</strong> attacks first
2- The <strong>big size</strong> fox attacks first

The analyzer looks like this:
fieldType name="sy_text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true" /> 
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true" /> 
      </analyzer>
    </fieldType>

This was working with a previous version of Solr (couldn't make it work with 3.6, 4-alpha nor 4-beta).
Reply | Threaded
Open this post in threaded view
|

Re: offsets issues with multiword synonyms since LUCENE_33

Michael McCandless-2
In reply to this post by Marc Sturlese
See also SOLR-3390.

Some cases have been addressed.  Eg, if you match domain name system
-> dns, then dns will have correct offsets spanning the full phrase
"domain name system" in the input.  (However: QueryParser won't work
because a query for "domain name system" is pre-split on whitespace so
the synonym never matches).

But for the reverse case, which I call "expanding" (ie, match dns ->
domain name system), the results are not "correct" (or at least
different from the previous SynFilter impl): the three tokens are
overlapped onto subsequent tokens, resulting in highlighting the wrong
tokens. However, QueryParser will work "correctly" for the query
"domain name system"...

But, I'd like to ask: why do apps want to "expand" (replace a match
with more than one input token, ie the dns -> domain name system
case)?  Is it ONLY because of QueryParser's limitation (that it
pre-splits on whitespace)?  Or are there other realistic use cases?

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 14, 2012 at 11:53 AM, Marc Sturlese <[hidden email]> wrote:

> Has someone noticed this problem and solved it somehow? (without using
> LUCENE_33 in the solrconfig.xml)
> https://issues.apache.org/jira/browse/LUCENE-3668
>
> Thanks in advance
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: offsets issues with multiword synonyms since LUCENE_33

Konrad Lötzsch
In reply to this post by Marc Sturlese
I don't know wether this was discussed previously,
but if you tell the synonmyfilter to not break your synonyms (which
might be the default). In this case, the parts of the synonyms get new
word positions. So you could use a Keywordtokenizer to avoid that behaviour:

         <filter class="solr.SynonymFilterFactory"
             synonyms="Synonyms.txt"
             ignoreCase="true"
             expand="false"
             tokenizerFactory="solr.KeywordTokenizerFactory"
         />

with regards,
konrad.

Am 14.08.2012 18:51, schrieb Marc Sturlese:

> Well an example would be:
> synonyms.txt:
> huge,big size
>
> The I have the docs:
> 1- The huge fox attacks first
> 2- The big size fox attacks first
>
> Then if I query for huge, the highlights for each document are:
>
> 1- The <strong>huge</strong> <strong>fox</strong> attacks first
> 2- The <strong>big size</strong> fox attacks first
>
> The analyzer looks like this:
> fieldType name="sy_text" class="solr.TextField" positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.ASCIIFoldingFilterFactory"/>
>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="false" expand="true" />
>        </analyzer>
>        <analyzer type="query">
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.ASCIIFoldingFilterFactory"/>
>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="false" expand="true" />
>        </analyzer>
>      </fieldType>
>
> This was working with a previous version of Solr (couldn't make it work with
> 3.6, 4-alpha nor 4-beta).
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195p4001213.html
> Sent from the Solr - User mailing list archive at Nabble.com.