Query generation is different for search terms with and without "-"

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Query generation is different for search terms with and without "-"

Samuel Gutierrez
I am troubleshooting an issue with ranking for search terms that contain a
"-" vs the same query that does not contain the dash e.g. "high-tech" vs
"high tech". The field that I am querying is using the standard tokenizer,
so I would expect that the underlying lucene query should be the same for
both versions of the query, however when printing the debug, it appears
they are generated differently. I know "-" must be escaped as it has
special meaning in lucene, however escaping does not fix the problem. It
appears that with the "-" present, the pf2 edismax parameter is not
respected and omitted from the final query. We use sow=false as we have
multiterm synonyms and need to ensure they are included in the final lucene
query. My expectation is that the final underlying lucene query should be
based on the output  of the field analyzer, however after briefly looking
at the code for ExtendedDismaxQParser, it appears that there is some string
processing happening outside of the analysis step which causes the
unexpected lucene query.


Solr Debug for "high tech":

parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
parsedquery_toString: "+(((Name_enUS:high)~0.4
(Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
(Name_enUS:"high tech"~4)~0.4",


Solr Debug for "high-tech"

parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
tech"~5)~0.4)",
parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
(Name_enUS:"high tech"~5)~0.4"

SolrConfig:

  <requestHandler name="/search" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="omitHeader">true</str>
      <str name="indent">true</str>
      <str name="wt">json</str>
      <str name="mm">3&lt;75%</str>
      <str name="qf">Name_enUS</str>
      <str name="pf">Name_enUS</str>
      <str name="ps">5</str>    <!---->
      <str name="pf2">Name_enUS</str>
      <str name="ps2">4</str>   <!---->
      <str name="qs">3</str>    <!---->
      <str name="tie">0.4</str>
      <str name="echoParams">explicit</str>
      <int name="rows">100</int>
      <str name="sow">false</str>
    </lst>
    <lst name="invariants">
      <str name="defType">edismax</str>
    </lst>
  </requestHandler>

Schema:

  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"/>
      </analyzer>
  </fieldType>


Using Solr 8.6.3

--
*The information contained in this message is the sole and exclusive
property of ***iHerb Inc.*** and may be privileged and confidential. It may
not be disseminated or distributed to persons or entities other than the
ones intended without the written authority of ***iHerb Inc.** *If you have
received this e-mail in error or are not the intended recipient, you may
not use, copy, disseminate or distribute it. Do not open any attachments.
Please delete it immediately from your system and notify the sender
promptly by e-mail that you have done so.*
Reply | Threaded
Open this post in threaded view
|

Re: Query generation is different for search terms with and without "-"

Erick Erickson
This is a common point of confusion. There are two phases for creating a query,
query _parsing_ first, then the analysis chain for the parsed result.

So what e-dismax sees in the two cases is:

Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes into play.

Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, splitting it on the hyphen comes later.

It’s especially confusing since the field analysis then breaks up “high-tech” into two tokens that
look the same as “high tech” in the debug response, just without the phrase query.

Name_enUS:high
Name_enUS:tech

Best,
Erick

> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <[hidden email]> wrote:
>
> I am troubleshooting an issue with ranking for search terms that contain a
> "-" vs the same query that does not contain the dash e.g. "high-tech" vs
> "high tech". The field that I am querying is using the standard tokenizer,
> so I would expect that the underlying lucene query should be the same for
> both versions of the query, however when printing the debug, it appears
> they are generated differently. I know "-" must be escaped as it has
> special meaning in lucene, however escaping does not fix the problem. It
> appears that with the "-" present, the pf2 edismax parameter is not
> respected and omitted from the final query. We use sow=false as we have
> multiterm synonyms and need to ensure they are included in the final lucene
> query. My expectation is that the final underlying lucene query should be
> based on the output  of the field analyzer, however after briefly looking
> at the code for ExtendedDismaxQParser, it appears that there is some string
> processing happening outside of the analysis step which causes the
> unexpected lucene query.
>
>
> Solr Debug for "high tech":
>
> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high)~0.4
> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> (Name_enUS:"high tech"~4)~0.4",
>
>
> Solr Debug for "high-tech"
>
> parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> tech"~5)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> (Name_enUS:"high tech"~5)~0.4"
>
> SolrConfig:
>
>  <requestHandler name="/search" class="solr.SearchHandler">
>    <lst name="defaults">
>      <str name="omitHeader">true</str>
>      <str name="indent">true</str>
>      <str name="wt">json</str>
>      <str name="mm">3&lt;75%</str>
>      <str name="qf">Name_enUS</str>
>      <str name="pf">Name_enUS</str>
>      <str name="ps">5</str>    <!---->
>      <str name="pf2">Name_enUS</str>
>      <str name="ps2">4</str>   <!---->
>      <str name="qs">3</str>    <!---->
>      <str name="tie">0.4</str>
>      <str name="echoParams">explicit</str>
>      <int name="rows">100</int>
>      <str name="sow">false</str>
>    </lst>
>    <lst name="invariants">
>      <str name="defType">edismax</str>
>    </lst>
>  </requestHandler>
>
> Schema:
>
>  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory"/>
>      </analyzer>
>  </fieldType>
>
>
> Using Solr 8.6.3
>
> --
> *The information contained in this message is the sole and exclusive
> property of ***iHerb Inc.*** and may be privileged and confidential. It may
> not be disseminated or distributed to persons or entities other than the
> ones intended without the written authority of ***iHerb Inc.** *If you have
> received this e-mail in error or are not the intended recipient, you may
> not use, copy, disseminate or distribute it. Do not open any attachments.
> Please delete it immediately from your system and notify the sender
> promptly by e-mail that you have done so.*

Reply | Threaded
Open this post in threaded view
|

Re: Query generation is different for search terms with and without "-"

matthew sporleder
Is the normal/standard solution here to regex remove the '-'s and
combine them into a single token?

On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <[hidden email]> wrote:

>
> This is a common point of confusion. There are two phases for creating a query,
> query _parsing_ first, then the analysis chain for the parsed result.
>
> So what e-dismax sees in the two cases is:
>
> Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes into play.
>
> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, splitting it on the hyphen comes later.
>
> It’s especially confusing since the field analysis then breaks up “high-tech” into two tokens that
> look the same as “high tech” in the debug response, just without the phrase query.
>
> Name_enUS:high
> Name_enUS:tech
>
> Best,
> Erick
>
> > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <[hidden email]> wrote:
> >
> > I am troubleshooting an issue with ranking for search terms that contain a
> > "-" vs the same query that does not contain the dash e.g. "high-tech" vs
> > "high tech". The field that I am querying is using the standard tokenizer,
> > so I would expect that the underlying lucene query should be the same for
> > both versions of the query, however when printing the debug, it appears
> > they are generated differently. I know "-" must be escaped as it has
> > special meaning in lucene, however escaping does not fix the problem. It
> > appears that with the "-" present, the pf2 edismax parameter is not
> > respected and omitted from the final query. We use sow=false as we have
> > multiterm synonyms and need to ensure they are included in the final lucene
> > query. My expectation is that the final underlying lucene query should be
> > based on the output  of the field analyzer, however after briefly looking
> > at the code for ExtendedDismaxQParser, it appears that there is some string
> > processing happening outside of the analysis step which causes the
> > unexpected lucene query.
> >
> >
> > Solr Debug for "high tech":
> >
> > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high)~0.4
> > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> > (Name_enUS:"high tech"~4)~0.4",
> >
> >
> > Solr Debug for "high-tech"
> >
> > parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
> > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> > tech"~5)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> > (Name_enUS:"high tech"~5)~0.4"
> >
> > SolrConfig:
> >
> >  <requestHandler name="/search" class="solr.SearchHandler">
> >    <lst name="defaults">
> >      <str name="omitHeader">true</str>
> >      <str name="indent">true</str>
> >      <str name="wt">json</str>
> >      <str name="mm">3&lt;75%</str>
> >      <str name="qf">Name_enUS</str>
> >      <str name="pf">Name_enUS</str>
> >      <str name="ps">5</str>    <!---->
> >      <str name="pf2">Name_enUS</str>
> >      <str name="ps2">4</str>   <!---->
> >      <str name="qs">3</str>    <!---->
> >      <str name="tie">0.4</str>
> >      <str name="echoParams">explicit</str>
> >      <int name="rows">100</int>
> >      <str name="sow">false</str>
> >    </lst>
> >    <lst name="invariants">
> >      <str name="defType">edismax</str>
> >    </lst>
> >  </requestHandler>
> >
> > Schema:
> >
> >  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
> >      <analyzer>
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.EnglishPossessiveFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"/>
> >      </analyzer>
> >  </fieldType>
> >
> >
> > Using Solr 8.6.3
> >
Reply | Threaded
Open this post in threaded view
|

Re: Query generation is different for search terms with and without "-"

Samuel Gutierrez
Are there any good workarounds/parameters we can use to fix this so it
doesn't have to be solved client side?

On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder <[hidden email]>
wrote:

> Is the normal/standard solution here to regex remove the '-'s and
> combine them into a single token?
>
> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <[hidden email]>
> wrote:
> >
> > This is a common point of confusion. There are two phases for creating a
> query,
> > query _parsing_ first, then the analysis chain for the parsed result.
> >
> > So what e-dismax sees in the two cases is:
> >
> > Name_enUS:“high tech” -> two tokens, since there are two of them pf2
> comes into play.
> >
> > Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
> splitting it on the hyphen comes later.
> >
> > It’s especially confusing since the field analysis then breaks up
> “high-tech” into two tokens that
> > look the same as “high tech” in the debug response, just without the
> phrase query.
> >
> > Name_enUS:high
> > Name_enUS:tech
> >
> > Best,
> > Erick
> >
> > > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
> [hidden email]> wrote:
> > >
> > > I am troubleshooting an issue with ranking for search terms that
> contain a
> > > "-" vs the same query that does not contain the dash e.g. "high-tech"
> vs
> > > "high tech". The field that I am querying is using the standard
> tokenizer,
> > > so I would expect that the underlying lucene query should be the same
> for
> > > both versions of the query, however when printing the debug, it appears
> > > they are generated differently. I know "-" must be escaped as it has
> > > special meaning in lucene, however escaping does not fix the problem.
> It
> > > appears that with the "-" present, the pf2 edismax parameter is not
> > > respected and omitted from the final query. We use sow=false as we have
> > > multiterm synonyms and need to ensure they are included in the final
> lucene
> > > query. My expectation is that the final underlying lucene query should
> be
> > > based on the output  of the field analyzer, however after briefly
> looking
> > > at the code for ExtendedDismaxQParser, it appears that there is some
> string
> > > processing happening outside of the analysis step which causes the
> > > unexpected lucene query.
> > >
> > >
> > > Solr Debug for "high tech":
> > >
> > > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> > > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> > > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> > > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> > > parsedquery_toString: "+(((Name_enUS:high)~0.4
> > > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> > > (Name_enUS:"high tech"~4)~0.4",
> > >
> > >
> > > Solr Debug for "high-tech"
> > >
> > > parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
> > > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> > > tech"~5)~0.4)",
> > > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> > > (Name_enUS:"high tech"~5)~0.4"
> > >
> > > SolrConfig:
> > >
> > >  <requestHandler name="/search" class="solr.SearchHandler">
> > >    <lst name="defaults">
> > >      <str name="omitHeader">true</str>
> > >      <str name="indent">true</str>
> > >      <str name="wt">json</str>
> > >      <str name="mm">3&lt;75%</str>
> > >      <str name="qf">Name_enUS</str>
> > >      <str name="pf">Name_enUS</str>
> > >      <str name="ps">5</str>    <!---->
> > >      <str name="pf2">Name_enUS</str>
> > >      <str name="ps2">4</str>   <!---->
> > >      <str name="qs">3</str>    <!---->
> > >      <str name="tie">0.4</str>
> > >      <str name="echoParams">explicit</str>
> > >      <int name="rows">100</int>
> > >      <str name="sow">false</str>
> > >    </lst>
> > >    <lst name="invariants">
> > >      <str name="defType">edismax</str>
> > >    </lst>
> > >  </requestHandler>
> > >
> > > Schema:
> > >
> > >  <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> > >      <analyzer>
> > >        <tokenizer class="solr.StandardTokenizerFactory"/>
> > >        <filter class="solr.LowerCaseFilterFactory"/>
> > >        <filter class="solr.EnglishPossessiveFilterFactory"/>
> > >        <filter class="solr.SnowballPorterFilterFactory"/>
> > >      </analyzer>
> > >  </fieldType>
> > >
> > >
> > > Using Solr 8.6.3
> > >
>

--
*The information contained in this message is the sole and exclusive
property of ***iHerb Inc.*** and may be privileged and confidential. It may
not be disseminated or distributed to persons or entities other than the
ones intended without the written authority of ***iHerb Inc.** *If you have
received this e-mail in error or are not the intended recipient, you may
not use, copy, disseminate or distribute it. Do not open any attachments.
Please delete it immediately from your system and notify the sender
promptly by e-mail that you have done so.*
Reply | Threaded
Open this post in threaded view
|

Re: Query generation is different for search terms with and without "-"

Erick Erickson
Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:

*FilterFactory are _not_ what you want in this case, they are applied to individual tokens after parsing

*CharFiterFactory are invoked on the entire input to the field, although I can’t say for certain that even that’s early enough.

There are two other options to consider:
StatelessScriptUpdateProcessor
FieldMutatingUpdateProcessor

Stateless... is probably easiest…

Best,
ERick

> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez <[hidden email]> wrote:
>
> Are there any good workarounds/parameters we can use to fix this so it
> doesn't have to be solved client side?
>
> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder <[hidden email]>
> wrote:
>
>> Is the normal/standard solution here to regex remove the '-'s and
>> combine them into a single token?
>>
>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <[hidden email]>
>> wrote:
>>>
>>> This is a common point of confusion. There are two phases for creating a
>> query,
>>> query _parsing_ first, then the analysis chain for the parsed result.
>>>
>>> So what e-dismax sees in the two cases is:
>>>
>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>> comes into play.
>>>
>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>> splitting it on the hyphen comes later.
>>>
>>> It’s especially confusing since the field analysis then breaks up
>> “high-tech” into two tokens that
>>> look the same as “high tech” in the debug response, just without the
>> phrase query.
>>>
>>> Name_enUS:high
>>> Name_enUS:tech
>>>
>>> Best,
>>> Erick
>>>
>>>> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>> [hidden email]> wrote:
>>>>
>>>> I am troubleshooting an issue with ranking for search terms that
>> contain a
>>>> "-" vs the same query that does not contain the dash e.g. "high-tech"
>> vs
>>>> "high tech". The field that I am querying is using the standard
>> tokenizer,
>>>> so I would expect that the underlying lucene query should be the same
>> for
>>>> both versions of the query, however when printing the debug, it appears
>>>> they are generated differently. I know "-" must be escaped as it has
>>>> special meaning in lucene, however escaping does not fix the problem.
>> It
>>>> appears that with the "-" present, the pf2 edismax parameter is not
>>>> respected and omitted from the final query. We use sow=false as we have
>>>> multiterm synonyms and need to ensure they are included in the final
>> lucene
>>>> query. My expectation is that the final underlying lucene query should
>> be
>>>> based on the output  of the field analyzer, however after briefly
>> looking
>>>> at the code for ExtendedDismaxQParser, it appears that there is some
>> string
>>>> processing happening outside of the analysis step which causes the
>>>> unexpected lucene query.
>>>>
>>>>
>>>> Solr Debug for "high tech":
>>>>
>>>> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
>>>> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
>>>> parsedquery_toString: "+(((Name_enUS:high)~0.4
>>>> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
>>>> (Name_enUS:"high tech"~4)~0.4",
>>>>
>>>>
>>>> Solr Debug for "high-tech"
>>>>
>>>> parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
>>>> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
>>>> tech"~5)~0.4)",
>>>> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
>>>> (Name_enUS:"high tech"~5)~0.4"
>>>>
>>>> SolrConfig:
>>>>
>>>> <requestHandler name="/search" class="solr.SearchHandler">
>>>>   <lst name="defaults">
>>>>     <str name="omitHeader">true</str>
>>>>     <str name="indent">true</str>
>>>>     <str name="wt">json</str>
>>>>     <str name="mm">3&lt;75%</str>
>>>>     <str name="qf">Name_enUS</str>
>>>>     <str name="pf">Name_enUS</str>
>>>>     <str name="ps">5</str>    <!---->
>>>>     <str name="pf2">Name_enUS</str>
>>>>     <str name="ps2">4</str>   <!---->
>>>>     <str name="qs">3</str>    <!---->
>>>>     <str name="tie">0.4</str>
>>>>     <str name="echoParams">explicit</str>
>>>>     <int name="rows">100</int>
>>>>     <str name="sow">false</str>
>>>>   </lst>
>>>>   <lst name="invariants">
>>>>     <str name="defType">edismax</str>
>>>>   </lst>
>>>> </requestHandler>
>>>>
>>>> Schema:
>>>>
>>>> <fieldType name="text_en" class="solr.TextField"
>> positionIncrementGap="100">
>>>>     <analyzer>
>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>       <filter class="solr.SnowballPorterFilterFactory"/>
>>>>     </analyzer>
>>>> </fieldType>
>>>>
>>>>
>>>> Using Solr 8.6.3
>>>>
>>
>
> --
> *The information contained in this message is the sole and exclusive
> property of ***iHerb Inc.*** and may be privileged and confidential. It may
> not be disseminated or distributed to persons or entities other than the
> ones intended without the written authority of ***iHerb Inc.** *If you have
> received this e-mail in error or are not the intended recipient, you may
> not use, copy, disseminate or distribute it. Do not open any attachments.
> Please delete it immediately from your system and notify the sender
> promptly by e-mail that you have done so.*

Reply | Threaded
Open this post in threaded view
|

Re: Query generation is different for search terms with and without "-"

Walter Underwood
Ages ago at Netflix, I fixed this with a few hundred synonyms. If you are working with
a fixed vocabulary (movie titles, product names), that can work just fine.

babysitter, baby-sitter, baby sitter
fullmetal, full-metal, full metal
manhunter, man-hunter, man hunter
spiderman, spider-man, spider man

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Nov 25, 2020, at 9:26 AM, Erick Erickson <[hidden email]> wrote:
>
> Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:
>
> *FilterFactory are _not_ what you want in this case, they are applied to individual tokens after parsing
>
> *CharFiterFactory are invoked on the entire input to the field, although I can’t say for certain that even that’s early enough.
>
> There are two other options to consider:
> StatelessScriptUpdateProcessor
> FieldMutatingUpdateProcessor
>
> Stateless... is probably easiest…
>
> Best,
> ERick
>
>> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez <[hidden email]> wrote:
>>
>> Are there any good workarounds/parameters we can use to fix this so it
>> doesn't have to be solved client side?
>>
>> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder <[hidden email]>
>> wrote:
>>
>>> Is the normal/standard solution here to regex remove the '-'s and
>>> combine them into a single token?
>>>
>>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <[hidden email]>
>>> wrote:
>>>>
>>>> This is a common point of confusion. There are two phases for creating a
>>> query,
>>>> query _parsing_ first, then the analysis chain for the parsed result.
>>>>
>>>> So what e-dismax sees in the two cases is:
>>>>
>>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>>> comes into play.
>>>>
>>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>>> splitting it on the hyphen comes later.
>>>>
>>>> It’s especially confusing since the field analysis then breaks up
>>> “high-tech” into two tokens that
>>>> look the same as “high tech” in the debug response, just without the
>>> phrase query.
>>>>
>>>> Name_enUS:high
>>>> Name_enUS:tech
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>>> [hidden email]> wrote:
>>>>>
>>>>> I am troubleshooting an issue with ranking for search terms that
>>> contain a
>>>>> "-" vs the same query that does not contain the dash e.g. "high-tech"
>>> vs
>>>>> "high tech". The field that I am querying is using the standard
>>> tokenizer,
>>>>> so I would expect that the underlying lucene query should be the same
>>> for
>>>>> both versions of the query, however when printing the debug, it appears
>>>>> they are generated differently. I know "-" must be escaped as it has
>>>>> special meaning in lucene, however escaping does not fix the problem.
>>> It
>>>>> appears that with the "-" present, the pf2 edismax parameter is not
>>>>> respected and omitted from the final query. We use sow=false as we have
>>>>> multiterm synonyms and need to ensure they are included in the final
>>> lucene
>>>>> query. My expectation is that the final underlying lucene query should
>>> be
>>>>> based on the output  of the field analyzer, however after briefly
>>> looking
>>>>> at the code for ExtendedDismaxQParser, it appears that there is some
>>> string
>>>>> processing happening outside of the analysis step which causes the
>>>>> unexpected lucene query.
>>>>>
>>>>>
>>>>> Solr Debug for "high tech":
>>>>>
>>>>> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
>>>>> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
>>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
>>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
>>>>> parsedquery_toString: "+(((Name_enUS:high)~0.4
>>>>> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
>>>>> (Name_enUS:"high tech"~4)~0.4",
>>>>>
>>>>>
>>>>> Solr Debug for "high-tech"
>>>>>
>>>>> parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
>>>>> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
>>>>> tech"~5)~0.4)",
>>>>> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
>>>>> (Name_enUS:"high tech"~5)~0.4"
>>>>>
>>>>> SolrConfig:
>>>>>
>>>>> <requestHandler name="/search" class="solr.SearchHandler">
>>>>>  <lst name="defaults">
>>>>>    <str name="omitHeader">true</str>
>>>>>    <str name="indent">true</str>
>>>>>    <str name="wt">json</str>
>>>>>    <str name="mm">3&lt;75%</str>
>>>>>    <str name="qf">Name_enUS</str>
>>>>>    <str name="pf">Name_enUS</str>
>>>>>    <str name="ps">5</str>    <!---->
>>>>>    <str name="pf2">Name_enUS</str>
>>>>>    <str name="ps2">4</str>   <!---->
>>>>>    <str name="qs">3</str>    <!---->
>>>>>    <str name="tie">0.4</str>
>>>>>    <str name="echoParams">explicit</str>
>>>>>    <int name="rows">100</int>
>>>>>    <str name="sow">false</str>
>>>>>  </lst>
>>>>>  <lst name="invariants">
>>>>>    <str name="defType">edismax</str>
>>>>>  </lst>
>>>>> </requestHandler>
>>>>>
>>>>> Schema:
>>>>>
>>>>> <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>>>>    <analyzer>
>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"/>
>>>>>    </analyzer>
>>>>> </fieldType>
>>>>>
>>>>>
>>>>> Using Solr 8.6.3
>>>>>
>>>
>>
>> --
>> *The information contained in this message is the sole and exclusive
>> property of ***iHerb Inc.*** and may be privileged and confidential. It may
>> not be disseminated or distributed to persons or entities other than the
>> ones intended without the written authority of ***iHerb Inc.** *If you have
>> received this e-mail in error or are not the intended recipient, you may
>> not use, copy, disseminate or distribute it. Do not open any attachments.
>> Please delete it immediately from your system and notify the sender
>> promptly by e-mail that you have done so.*
>