Multi words query time synonyms

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Multi words query time synonyms

Dominique Bejean
Hi,

I am trying multi words query time synonyms with Solr 6.6.2and
SynonymGraphFilterFactory filter as explain in this article
https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/

My field type is :

<fieldType name="textSyn" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
            articles="lang/contractions_fr.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
      <filter class="solr.FrenchMinimalStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
            articles="lang/contractions_fr.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
      <filter class="solr.FrenchMinimalStemFilterFactory"/>
    </analyzer>
  </fieldType>


synonyms.txt contains the line

om, olympique de marseille


The order of words in my query has an impact on the generated query in
edismax

q={!edismax qf='name_text_gp' v=$qq}
&sow=false
&qq=...

with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the
synonyms expansion. It is working as expected.

"parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil
+name_text_gp:maillot) name_text_gp:om))",
"parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu
+name_text_gp:marseil +name_text_gp:maillot)))",


with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the
same generated query

"parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
"parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",

I don't understand these generated queries. The first one looks like the
synonym expansion is ignored, but the second one shows it is not ignored
and only the synonym term is used.


What is wrong in the way I am doing this ?

Regards

Dominique

--
Dominique Béjean
06 08 46 12 43
Reply | Threaded
Open this post in threaded view
|

Re: Multi words query time synonyms

Dominique Bejean
Hi,

More info.

When I test the analisys for the field type the synonyms are correctly
expanded for both expressions

om maillot
maillot om
olympique de marseille maillot
maillot olympique de marseille

resulting outputs always include the following terms (obvioulsly not always
in the same order)

olympiqu om marseil maillot


So, i suspect an issue with edismax query parser.

Regards.

Dominique


Le ven. 9 févr. 2018 à 18:25, Dominique Bejean <[hidden email]>
a écrit :

> Hi,
>
> I am trying multi words query time synonyms with Solr 6.6.2and
> SynonymGraphFilterFactory filter as explain in this article
>
> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
>
> My field type is :
>
> <fieldType name="textSyn" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.ElisionFilterFactory" ignoreCase="true"
>             articles="lang/contractions_fr.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>       <filter class="solr.FrenchMinimalStemFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.ElisionFilterFactory" ignoreCase="true"
>             articles="lang/contractions_fr.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt"
>             ignoreCase="true" expand="true"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>       <filter class="solr.FrenchMinimalStemFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
>
> synonyms.txt contains the line
>
> om, olympique de marseille
>
>
> The order of words in my query has an impact on the generated query in
> edismax
>
> q={!edismax qf='name_text_gp' v=$qq}
> &sow=false
> &qq=...
>
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the
> synonyms expansion. It is working as expected.
>
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil
> +name_text_gp:maillot) name_text_gp:om))",
> "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu
> +name_text_gp:marseil +name_text_gp:maillot)))",
>
>
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the
> same generated query
>
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>
> I don't understand these generated queries. The first one looks like the
> synonym expansion is ignored, but the second one shows it is not ignored
> and only the synonym term is used.
>
>
> What is wrong in the way I am doing this ?
>
> Regards
>
> Dominique
>
> --
> Dominique Béjean
> 06 08 46 12 43
>
--
Dominique Béjean
06 08 46 12 43
Reply | Threaded
Open this post in threaded view
|

Re: Multi words query time synonyms

sarowe
Hi Dominique,

Looks like it’s a bug, not sure where exactly though.  Can you please create a JIRA?

I can see the same behavior on master too, not just on the releases/lucene-solr/6.6.2 tag.

One interesting thing I found is that if I remove the stop filter from the query analyzer, I get the following for qq=“maillot om”:

+((name_text_gp:maillot) (((+name_text_gp:olympiqu +name_text_gp:de +name_text_gp:marseil) name_text_gp:om)))

(btw my stop list only has “de” on it)

Thanks,

--
Steve
www.lucidworks.com

> On Feb 10, 2018, at 2:12 AM, Dominique Bejean <[hidden email]> wrote:
>
> Hi,
>
> More info.
>
> When I test the analisys for the field type the synonyms are correctly
> expanded for both expressions
>
> om maillot
> maillot om
> olympique de marseille maillot
> maillot olympique de marseille
>
> resulting outputs always include the following terms (obvioulsly not always
> in the same order)
>
> olympiqu om marseil maillot
>
>
> So, i suspect an issue with edismax query parser.
>
> Regards.
>
> Dominique
>
>
> Le ven. 9 févr. 2018 à 18:25, Dominique Bejean <[hidden email]>
> a écrit :
>
>> Hi,
>>
>> I am trying multi words query time synonyms with Solr 6.6.2and
>> SynonymGraphFilterFactory filter as explain in this article
>>
>> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
>>
>> My field type is :
>>
>> <fieldType name="textSyn" class="solr.TextField"
>> positionIncrementGap="100">
>>    <analyzer type="index">
>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
>>            articles="lang/contractions_fr.txt"/>
>>      <filter class="solr.LowerCaseFilterFactory"/>
>>      <filter class="solr.ASCIIFoldingFilterFactory"/>
>>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
>> ignoreCase="true"/>
>>      <filter class="solr.FrenchMinimalStemFilterFactory"/>
>>    </analyzer>
>>    <analyzer type="query">
>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
>>            articles="lang/contractions_fr.txt"/>
>>      <filter class="solr.LowerCaseFilterFactory"/>
>>      <filter class="solr.SynonymGraphFilterFactory"
>> synonyms="synonyms.txt"
>>            ignoreCase="true" expand="true"/>
>>      <filter class="solr.ASCIIFoldingFilterFactory"/>
>>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
>> ignoreCase="true"/>
>>      <filter class="solr.FrenchMinimalStemFilterFactory"/>
>>    </analyzer>
>>  </fieldType>
>>
>>
>> synonyms.txt contains the line
>>
>> om, olympique de marseille
>>
>>
>> The order of words in my query has an impact on the generated query in
>> edismax
>>
>> q={!edismax qf='name_text_gp' v=$qq}
>> &sow=false
>> &qq=...
>>
>> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the
>> synonyms expansion. It is working as expected.
>>
>> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil
>> +name_text_gp:maillot) name_text_gp:om))",
>> "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu
>> +name_text_gp:marseil +name_text_gp:maillot)))",
>>
>>
>> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the
>> same generated query
>>
>> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>>
>> I don't understand these generated queries. The first one looks like the
>> synonym expansion is ignored, but the second one shows it is not ignored
>> and only the synonym term is used.
>>
>>
>> What is wrong in the way I am doing this ?
>>
>> Regards
>>
>> Dominique
>>
>> --
>> Dominique Béjean
>> 06 08 46 12 43
>>
> --
> Dominique Béjean
> 06 08 46 12 43

Reply | Threaded
Open this post in threaded view
|

Re: Multi words query time synonyms

Dominique Bejean
Hi Steve,

Thank you for your response.
The Jira was created : SOLR-11968

I let you add your comments.

Regards.

Dominique


Le sam. 10 févr. 2018 à 20:30, Steve Rowe <[hidden email]> a écrit :

> Hi Dominique,
>
> Looks like it’s a bug, not sure where exactly though.  Can you please
> create a JIRA?
>
> I can see the same behavior on master too, not just on the
> releases/lucene-solr/6.6.2 tag.
>
> One interesting thing I found is that if I remove the stop filter from the
> query analyzer, I get the following for qq=“maillot om”:
>
> +((name_text_gp:maillot) (((+name_text_gp:olympiqu +name_text_gp:de
> +name_text_gp:marseil) name_text_gp:om)))
>
> (btw my stop list only has “de” on it)
>
> Thanks,
>
> --
> Steve
> www.lucidworks.com
>
> > On Feb 10, 2018, at 2:12 AM, Dominique Bejean <[hidden email]>
> wrote:
> >
> > Hi,
> >
> > More info.
> >
> > When I test the analisys for the field type the synonyms are correctly
> > expanded for both expressions
> >
> > om maillot
> > maillot om
> > olympique de marseille maillot
> > maillot olympique de marseille
> >
> > resulting outputs always include the following terms (obvioulsly not
> always
> > in the same order)
> >
> > olympiqu om marseil maillot
> >
> >
> > So, i suspect an issue with edismax query parser.
> >
> > Regards.
> >
> > Dominique
> >
> >
> > Le ven. 9 févr. 2018 à 18:25, Dominique Bejean <
> [hidden email]>
> > a écrit :
> >
> >> Hi,
> >>
> >> I am trying multi words query time synonyms with Solr 6.6.2and
> >> SynonymGraphFilterFactory filter as explain in this article
> >>
> >>
> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
> >>
> >> My field type is :
> >>
> >> <fieldType name="textSyn" class="solr.TextField"
> >> positionIncrementGap="100">
> >>    <analyzer type="index">
> >>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> >>            articles="lang/contractions_fr.txt"/>
> >>      <filter class="solr.LowerCaseFilterFactory"/>
> >>      <filter class="solr.ASCIIFoldingFilterFactory"/>
> >>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
> >> ignoreCase="true"/>
> >>      <filter class="solr.FrenchMinimalStemFilterFactory"/>
> >>    </analyzer>
> >>    <analyzer type="query">
> >>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
> >>            articles="lang/contractions_fr.txt"/>
> >>      <filter class="solr.LowerCaseFilterFactory"/>
> >>      <filter class="solr.SynonymGraphFilterFactory"
> >> synonyms="synonyms.txt"
> >>            ignoreCase="true" expand="true"/>
> >>      <filter class="solr.ASCIIFoldingFilterFactory"/>
> >>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
> >> ignoreCase="true"/>
> >>      <filter class="solr.FrenchMinimalStemFilterFactory"/>
> >>    </analyzer>
> >>  </fieldType>
> >>
> >>
> >> synonyms.txt contains the line
> >>
> >> om, olympique de marseille
> >>
> >>
> >> The order of words in my query has an impact on the generated query in
> >> edismax
> >>
> >> q={!edismax qf='name_text_gp' v=$qq}
> >> &sow=false
> >> &qq=...
> >>
> >> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see
> the
> >> synonyms expansion. It is working as expected.
> >>
> >> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil
> >> +name_text_gp:maillot) name_text_gp:om))",
> >> "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu
> >> +name_text_gp:marseil +name_text_gp:maillot)))",
> >>
> >>
> >> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see
> the
> >> same generated query
> >>
> >> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
> >> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
> >>
> >> I don't understand these generated queries. The first one looks like the
> >> synonym expansion is ignored, but the second one shows it is not ignored
> >> and only the synonym term is used.
> >>
> >>
> >> What is wrong in the way I am doing this ?
> >>
> >> Regards
> >>
> >> Dominique
> >>
> >> --
> >> Dominique Béjean
> >> 06 08 46 12 43
> >>
> > --
> > Dominique Béjean
> > 06 08 46 12 43
>
> --
Dominique Béjean
06 08 46 12 43
Reply | Threaded
Open this post in threaded view
|

Re: Multi words query time synonyms

Dominique Bejean
Steve,

According to your comment, I made this test :

1/ put the SynonymGraphFilterFactory after the StopFilterFactory in query
time analyze chain

    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_fr.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory"
synonyms="gosport_synonyms.txt"
            ignoreCase="true" expand="true" />
      <filter class="solr.FrenchMinimalStemFilterFactory"/>
    </analyzer>

2/ remove the stop word in the synonyms file

om, olympique marseille


The parsed query string are :

for "om maillot"
"parsedquery_toString":"+(((((+name_text_gp:olympiqu +name_text_gp:marseil)
name_text_gp:om)) (name_text_gp:maillot))~1)",

for "olympique de marseille maillot"
"parsedquery_toString":"+((((name_text_gp:om (+name_text_gp:olympiqu
+name_text_gp:marseil))) (name_text_gp:maillot))~1)",

for "maillot om"
parsedquery_toString":"+(((name_text_gp:maillot) (((+name_text_gp:olympiqu
+name_text_gp:marseil) name_text_gp:om)))~1)",

for "maillot olympique de marseille"
 "parsedquery_toString":"+(((name_text_gp:maillot) ((name_text_gp:om
(+name_text_gp:olympiqu +name_text_gp:marseil))))~1)",


The query result are the same for all queries.

It looks like this could be an acceptable workaround.

Thank you

Dominique



Le dim. 11 févr. 2018 à 10:31, Dominique Bejean <[hidden email]>
a écrit :

> Hi Steve,
>
> Thank you for your response.
> The Jira was created : SOLR-11968
>
> I let you add your comments.
>
> Regards.
>
> Dominique
>
>
> Le sam. 10 févr. 2018 à 20:30, Steve Rowe <[hidden email]> a écrit :
>
>> Hi Dominique,
>>
>> Looks like it’s a bug, not sure where exactly though.  Can you please
>> create a JIRA?
>>
>> I can see the same behavior on master too, not just on the
>> releases/lucene-solr/6.6.2 tag.
>>
>> One interesting thing I found is that if I remove the stop filter from
>> the query analyzer, I get the following for qq=“maillot om”:
>>
>> +((name_text_gp:maillot) (((+name_text_gp:olympiqu +name_text_gp:de
>> +name_text_gp:marseil) name_text_gp:om)))
>>
>> (btw my stop list only has “de” on it)
>>
>> Thanks,
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>> > On Feb 10, 2018, at 2:12 AM, Dominique Bejean <
>> [hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > More info.
>> >
>> > When I test the analisys for the field type the synonyms are correctly
>> > expanded for both expressions
>> >
>> > om maillot
>> > maillot om
>> > olympique de marseille maillot
>> > maillot olympique de marseille
>> >
>> > resulting outputs always include the following terms (obvioulsly not
>> always
>> > in the same order)
>> >
>> > olympiqu om marseil maillot
>> >
>> >
>> > So, i suspect an issue with edismax query parser.
>> >
>> > Regards.
>> >
>> > Dominique
>> >
>> >
>> > Le ven. 9 févr. 2018 à 18:25, Dominique Bejean <
>> [hidden email]>
>> > a écrit :
>> >
>> >> Hi,
>> >>
>> >> I am trying multi words query time synonyms with Solr 6.6.2and
>> >> SynonymGraphFilterFactory filter as explain in this article
>> >>
>> >>
>> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
>> >>
>> >> My field type is :
>> >>
>> >> <fieldType name="textSyn" class="solr.TextField"
>> >> positionIncrementGap="100">
>> >>    <analyzer type="index">
>> >>      <tokenizer class="solr.StandardTokenizerFactory"/>
>> >>      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
>> >>            articles="lang/contractions_fr.txt"/>
>> >>      <filter class="solr.LowerCaseFilterFactory"/>
>> >>      <filter class="solr.ASCIIFoldingFilterFactory"/>
>> >>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
>> >> ignoreCase="true"/>
>> >>      <filter class="solr.FrenchMinimalStemFilterFactory"/>
>> >>    </analyzer>
>> >>    <analyzer type="query">
>> >>      <tokenizer class="solr.StandardTokenizerFactory"/>
>> >>      <filter class="solr.ElisionFilterFactory" ignoreCase="true"
>> >>            articles="lang/contractions_fr.txt"/>
>> >>      <filter class="solr.LowerCaseFilterFactory"/>
>> >>      <filter class="solr.SynonymGraphFilterFactory"
>> >> synonyms="synonyms.txt"
>> >>            ignoreCase="true" expand="true"/>
>> >>      <filter class="solr.ASCIIFoldingFilterFactory"/>
>> >>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
>> >> ignoreCase="true"/>
>> >>      <filter class="solr.FrenchMinimalStemFilterFactory"/>
>> >>    </analyzer>
>> >>  </fieldType>
>> >>
>> >>
>> >> synonyms.txt contains the line
>> >>
>> >> om, olympique de marseille
>> >>
>> >>
>> >> The order of words in my query has an impact on the generated query in
>> >> edismax
>> >>
>> >> q={!edismax qf='name_text_gp' v=$qq}
>> >> &sow=false
>> >> &qq=...
>> >>
>> >> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see
>> the
>> >> synonyms expansion. It is working as expected.
>> >>
>> >> "parsedquery_toString":"+(((+name_text_gp:olympiqu
>> +name_text_gp:marseil
>> >> +name_text_gp:maillot) name_text_gp:om))",
>> >> "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu
>> >> +name_text_gp:marseil +name_text_gp:maillot)))",
>> >>
>> >>
>> >> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see
>> the
>> >> same generated query
>> >>
>> >> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>> >> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>> >>
>> >> I don't understand these generated queries. The first one looks like
>> the
>> >> synonym expansion is ignored, but the second one shows it is not
>> ignored
>> >> and only the synonym term is used.
>> >>
>> >>
>> >> What is wrong in the way I am doing this ?
>> >>
>> >> Regards
>> >>
>> >> Dominique
>> >>
>> >> --
>> >> Dominique Béjean
>> >> 06 08 46 12 43
>> >>
>> > --
>> > Dominique Béjean
>> > 06 08 46 12 43
>>
>> --
> Dominique Béjean
> 06 08 46 12 43
>
--
Dominique Béjean
06 08 46 12 43