wildcards match end-of-word?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

wildcards match end-of-word?

Fischer, Stephen
Hi,

I am a solr newbie.  I was surprised to discover that a search for kinase* returned fewer results than kinase.

Then I read the wildcard documentation<https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>, and saw why.  kinase* will not match the word "kinase".

Our end-users won't expect this behavior.  Presumably the solution would be for them (actually us, on their behalf), to use kinase* OR kinase.

But that is kind of a hack.

Is there a way we can configure solr to have wildcards match on end-of-word?

Thanks,
Steve
Reply | Threaded
Open this post in threaded view
|

Re: wildcards match end-of-word?

Walter Underwood
“kinase*” does match “kinase”. On the page you linked to, it defines “*” as matching "Multiple characters (matches zero or more sequential characters)”.

If it is not matching, you may be using a stemmer on that field or doing some other processing that changes the tokens.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2020, at 6:24 PM, Fischer, Stephen <[hidden email]> wrote:
>
> Hi,
>
> I am a solr newbie.  I was surprised to discover that a search for kinase* returned fewer results than kinase.
>
> Then I read the wildcard documentation<https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>, and saw why.  kinase* will not match the word "kinase".
>
> Our end-users won't expect this behavior.  Presumably the solution would be for them (actually us, on their behalf), to use kinase* OR kinase.
>
> But that is kind of a hack.
>
> Is there a way we can configure solr to have wildcards match on end-of-word?
>
> Thanks,
> Steve

Reply | Threaded
Open this post in threaded view
|

RE: [External] Re: wildcards match end-of-word?

Fischer, Stephen
Thanks *very much* for replying.  (You're right, I missed the "zero or more," having focused only on the examples in the doc.  Oops).

New discovery.  kin*ase returns 0 hits.   Below I show the debug output and the pertinent parts of the schema.   Maybe you can spot my problem?

{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "q":"kin*ase",
      "defType":"edismax",
      "debug":"all",
      "qf":"TEXT__gene_product",
      "fl":"id,document-type,TEXT__gene_product,score",
      "stopwords":"true"}},
  "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
  },
  "debug":{
    "rawquerystring":"kin*ase",
    "querystring":"kin*ase",
    "parsedquery":"+DisjunctionMaxQuery((TEXT__gene_product:kin*ase))",
    "parsedquery_toString":"+(TEXT__gene_product:kin*ase)",
    "explain":{},
    "QParser":"ExtendedDismaxQParser",
    "altquerystring":null,
    "boost_queries":null,
    "parsed_boost_queries":[],
    "boostfuncs":null,
    "timing":{
      "time":2.0,
      "prepare":{
        "time":1.0,
        "query":{
          "time":1.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}},
      "process":{
        "time":1.0,
        "query":{
          "time":0.0},
        "facet":{
          "time":0.0},
        "facet_module":{
          "time":0.0},
        "mlt":{
          "time":0.0},
        "highlight":{
          "time":0.0},
        "stats":{
          "time":0.0},
        "expand":{
          "time":0.0},
        "terms":{
          "time":0.0},
        "debug":{
          "time":0.0}}}}}

    <dynamicField name="TEXT__*" type="text_en_splitting" indexed="true" stored="true" storeOffsetsWithPositions="true" termVectors="true"/>

    <fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
             possible with WordDelimiterGraphFilter in conjuncton with stemming. -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.FlattenGraphFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
             possible with WordDelimiterGraphFilter in conjuncton with stemming. -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>



-----Original Message-----
From: Walter Underwood <[hidden email]>
Sent: Wednesday, February 12, 2020 12:31 AM
To: [hidden email]
Subject: [External] Re: wildcards match end-of-word?

“kinase*” does match “kinase”. On the page you linked to, it defines “*” as matching "Multiple characters (matches zero or more sequential characters)”.

If it is not matching, you may be using a stemmer on that field or doing some other processing that changes the tokens.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2020, at 6:24 PM, Fischer, Stephen <[hidden email]> wrote:
>
> Hi,
>
> I am a solr newbie.  I was surprised to discover that a search for kinase* returned fewer results than kinase.
>
> Then I read the wildcard documentation<https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>, and saw why.  kinase* will not match the word "kinase".
>
> Our end-users won't expect this behavior.  Presumably the solution would be for them (actually us, on their behalf), to use kinase* OR kinase.
>
> But that is kind of a hack.
>
> Is there a way we can configure solr to have wildcards match on end-of-word?
>
> Thanks,
> Steve

Reply | Threaded
Open this post in threaded view
|

Re: wildcards match end-of-word?

Erick Erickson
In reply to this post by Fischer, Stephen
Steve:

You _really_ want to get acquainted with the admin UI/Analysis page ;). Choose a core/collection and you should see the choice. It shows you exactly what transformations your data goes through. If you hover over the light gray pairs of letters, you’ll get a tooltip showing you what part of your analysis chain is responsible for a particular change. I un-check the “verbose” box 95% of the time BTW.

The critical bit is that what comes out of the end of the analysis pipe are the tokens that are actually _in_ the index. From there, problems like this make more sense.

My bet is that, as Walter says, you have a stemmer in the analysis chain and the actual token in the index is “kinas” so of course “kinase*” won’t be found. By adding OR kinase to the query, that token is stemmed to “kinas” and matches.

Also, adding &debug=query to your URL will show you what the query looks like after parsing and analysis, also a major tool for figuring out what’s really happening.

Wildcards are not stemmed, which can lead to surprising results. There’s no perfect answer here. Let’s claim wildcards _were_ stemmed. Then you’d have to try to explain why “running*” returned a doc with only “run” or “runner” or “runs” or... in it, but searching for “runnin*” did not due the stemmer not recognizing it as a stemmable word.

Finally, one of my personal hot buttons is wildcards in general. They’re very often over-used because people are used to simple search capabilities. Something about “if your only tool is a hammer, every problem looks like a nail”. That gets into training users too though...

Best,
Erick

> On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <[hidden email]> wrote:
>
> Hi,
>
> I am a solr newbie.  I was surprised to discover that a search for kinase* returned fewer results than kinase.
>
> Then I read the wildcard documentation<https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>, and saw why.  kinase* will not match the word "kinase".
>
> Our end-users won't expect this behavior.  Presumably the solution would be for them (actually us, on their behalf), to use kinase* OR kinase.
>
> But that is kind of a hack.
>
> Is there a way we can configure solr to have wildcards match on end-of-word?
>
> Thanks,
> Steve

Reply | Threaded
Open this post in threaded view
|

Re: wildcards match end-of-word?

Sotiris Fragkiskos
Hi Erick,
thanks very much for this information, it was immensely useful, I always
had the same question!
I'm now seeing the Analysis page and finally I don't have to rely on an
external online stemmer to see what solr *probably* stemmed the term to!!
But I still can't make the asterisk and question mark work inside the term,
even in the earlier parts of it.
e.g. tr?ining
I would expect it to match train. But it doesn't.
PSF at the end just shows t | ain
every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
Am I doing something very wrong??

thanks again!
Sotiri

On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <[hidden email]>
wrote:

> Steve:
>
> You _really_ want to get acquainted with the admin UI/Analysis page ;).
> Choose a core/collection and you should see the choice. It shows you
> exactly what transformations your data goes through. If you hover over the
> light gray pairs of letters, you’ll get a tooltip showing you what part of
> your analysis chain is responsible for a particular change. I un-check the
> “verbose” box 95% of the time BTW.
>
> The critical bit is that what comes out of the end of the analysis pipe
> are the tokens that are actually _in_ the index. From there, problems like
> this make more sense.
>
> My bet is that, as Walter says, you have a stemmer in the analysis chain
> and the actual token in the index is “kinas” so of course “kinase*” won’t
> be found. By adding OR kinase to the query, that token is stemmed to
> “kinas” and matches.
>
> Also, adding &debug=query to your URL will show you what the query looks
> like after parsing and analysis, also a major tool for figuring out what’s
> really happening.
>
> Wildcards are not stemmed, which can lead to surprising results. There’s
> no perfect answer here. Let’s claim wildcards _were_ stemmed. Then you’d
> have to try to explain why “running*” returned a doc with only “run” or
> “runner” or “runs” or... in it, but searching for “runnin*” did not due the
> stemmer not recognizing it as a stemmable word.
>
> Finally, one of my personal hot buttons is wildcards in general. They’re
> very often over-used because people are used to simple search capabilities.
> Something about “if your only tool is a hammer, every problem looks like a
> nail”. That gets into training users too though...
>
> Best,
> Erick
>
> > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> [hidden email]> wrote:
> >
> > Hi,
> >
> > I am a solr newbie.  I was surprised to discover that a search for
> kinase* returned fewer results than kinase.
> >
> > Then I read the wildcard documentation<
> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>,
> and saw why.  kinase* will not match the word "kinase".
> >
> > Our end-users won't expect this behavior.  Presumably the solution would
> be for them (actually us, on their behalf), to use kinase* OR kinase.
> >
> > But that is kind of a hack.
> >
> > Is there a way we can configure solr to have wildcards match on
> end-of-word?
> >
> > Thanks,
> > Steve
>
>
Reply | Threaded
Open this post in threaded view
|

RE: [External] Re: wildcards match end-of-word?

Fischer, Stephen
Folks,

I am seeing very strange (bad) wildcard behavior (solr 8).  

"kinase" finds hits as expected.  

"kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like "kinase," and "kinase-" but not "kinase"

I have done the analysis as Erick suggested (thanks!) but it is not helping me understand why we'd have this problem.

I have put together 12 screenshots from the Solr web UI that show in detail:
- the queries I ran to get the results above
- various analyses trying to understand why
- the schema for the fieldType in question

https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing

thanks,
steve

-----Original Message-----
From: Sotiris Fragkiskos <[hidden email]>
Sent: Thursday, February 13, 2020 4:03 AM
To: [hidden email]
Subject: [External] Re: wildcards match end-of-word?

Hi Erick,
thanks very much for this information, it was immensely useful, I always had the same question!
I'm now seeing the Analysis page and finally I don't have to rely on an external online stemmer to see what solr *probably* stemmed the term to!!
But I still can't make the asterisk and question mark work inside the term, even in the earlier parts of it.
e.g. tr?ining
I would expect it to match train. But it doesn't.
PSF at the end just shows t | ain
every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I doing something very wrong??

thanks again!
Sotiri

On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <[hidden email]>
wrote:

> Steve:
>
> You _really_ want to get acquainted with the admin UI/Analysis page ;).
> Choose a core/collection and you should see the choice. It shows you
> exactly what transformations your data goes through. If you hover over
> the light gray pairs of letters, you’ll get a tooltip showing you what
> part of your analysis chain is responsible for a particular change. I
> un-check the “verbose” box 95% of the time BTW.
>
> The critical bit is that what comes out of the end of the analysis
> pipe are the tokens that are actually _in_ the index. From there,
> problems like this make more sense.
>
> My bet is that, as Walter says, you have a stemmer in the analysis
> chain and the actual token in the index is “kinas” so of course
> “kinase*” won’t be found. By adding OR kinase to the query, that token
> is stemmed to “kinas” and matches.
>
> Also, adding &debug=query to your URL will show you what the query
> looks like after parsing and analysis, also a major tool for figuring
> out what’s really happening.
>
> Wildcards are not stemmed, which can lead to surprising results.
> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed.
> Then you’d have to try to explain why “running*” returned a doc with
> only “run” or “runner” or “runs” or... in it, but searching for
> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>
> Finally, one of my personal hot buttons is wildcards in general.
> They’re very often over-used because people are used to simple search capabilities.
> Something about “if your only tool is a hammer, every problem looks
> like a nail”. That gets into training users too though...
>
> Best,
> Erick
>
> > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> [hidden email]> wrote:
> >
> > Hi,
> >
> > I am a solr newbie.  I was surprised to discover that a search for
> kinase* returned fewer results than kinase.
> >
> > Then I read the wildcard documentation<
> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> l#TheStandardQueryParser-WildcardSearches>,
> and saw why.  kinase* will not match the word "kinase".
> >
> > Our end-users won't expect this behavior.  Presumably the solution
> > would
> be for them (actually us, on their behalf), to use kinase* OR kinase.
> >
> > But that is kind of a hack.
> >
> > Is there a way we can configure solr to have wildcards match on
> end-of-word?
> >
> > Thanks,
> > Steve
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: wildcards match end-of-word?

Sotiris Fragkiskos
Hi,
I could be wrong, but I'm starting to think that it has to do with the
fieldType. In our case, wildcards don't seem to work at all with text_en
types, but they do work with string types.

On Thu, Feb 13, 2020 at 1:52 PM Fischer, Stephen <
[hidden email]> wrote:

> Folks,
>
> I am seeing very strange (bad) wildcard behavior (solr 8).
>
> "kinase" finds hits as expected.
>
> "kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like
> "kinase," and "kinase-" but not "kinase"
>
> I have done the analysis as Erick suggested (thanks!) but it is not
> helping me understand why we'd have this problem.
>
> I have put together 12 screenshots from the Solr web UI that show in
> detail:
> - the queries I ran to get the results above
> - various analyses trying to understand why
> - the schema for the fieldType in question
>
>
> https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing
>
> thanks,
> steve
>
> -----Original Message-----
> From: Sotiris Fragkiskos <[hidden email]>
> Sent: Thursday, February 13, 2020 4:03 AM
> To: [hidden email]
> Subject: [External] Re: wildcards match end-of-word?
>
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always
> had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an
> external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the
> term, even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
> Am I doing something very wrong??
>
> thanks again!
> Sotiri
>
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <[hidden email]>
> wrote:
>
> > Steve:
> >
> > You _really_ want to get acquainted with the admin UI/Analysis page ;).
> > Choose a core/collection and you should see the choice. It shows you
> > exactly what transformations your data goes through. If you hover over
> > the light gray pairs of letters, you’ll get a tooltip showing you what
> > part of your analysis chain is responsible for a particular change. I
> > un-check the “verbose” box 95% of the time BTW.
> >
> > The critical bit is that what comes out of the end of the analysis
> > pipe are the tokens that are actually _in_ the index. From there,
> > problems like this make more sense.
> >
> > My bet is that, as Walter says, you have a stemmer in the analysis
> > chain and the actual token in the index is “kinas” so of course
> > “kinase*” won’t be found. By adding OR kinase to the query, that token
> > is stemmed to “kinas” and matches.
> >
> > Also, adding &debug=query to your URL will show you what the query
> > looks like after parsing and analysis, also a major tool for figuring
> > out what’s really happening.
> >
> > Wildcards are not stemmed, which can lead to surprising results.
> > There’s no perfect answer here. Let’s claim wildcards _were_ stemmed.
> > Then you’d have to try to explain why “running*” returned a doc with
> > only “run” or “runner” or “runs” or... in it, but searching for
> > “runnin*” did not due the stemmer not recognizing it as a stemmable word.
> >
> > Finally, one of my personal hot buttons is wildcards in general.
> > They’re very often over-used because people are used to simple search
> capabilities.
> > Something about “if your only tool is a hammer, every problem looks
> > like a nail”. That gets into training users too though...
> >
> > Best,
> > Erick
> >
> > > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> > [hidden email]> wrote:
> > >
> > > Hi,
> > >
> > > I am a solr newbie.  I was surprised to discover that a search for
> > kinase* returned fewer results than kinase.
> > >
> > > Then I read the wildcard documentation<
> > https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> > l#TheStandardQueryParser-WildcardSearches>,
> > and saw why.  kinase* will not match the word "kinase".
> > >
> > > Our end-users won't expect this behavior.  Presumably the solution
> > > would
> > be for them (actually us, on their behalf), to use kinase* OR kinase.
> > >
> > > But that is kind of a hack.
> > >
> > > Is there a way we can configure solr to have wildcards match on
> > end-of-word?
> > >
> > > Thanks,
> > > Steve
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: [External] Re: wildcards match end-of-word?

Fischer, Stephen
In reply to this post by Fischer, Stephen
Also, if helpful, here is our solarconfig.xml
 https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf/solrconfig.xml

Thanks again, from a Solr Newbie,
steve

-----Original Message-----
From: Fischer, Stephen <[hidden email]>
Sent: Thursday, February 13, 2020 7:52 AM
To: [hidden email]
Subject: RE: [External] Re: wildcards match end-of-word?

Folks,

I am seeing very strange (bad) wildcard behavior (solr 8).  

"kinase" finds hits as expected.  

"kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like "kinase," and "kinase-" but not "kinase"

I have done the analysis as Erick suggested (thanks!) but it is not helping me understand why we'd have this problem.

I have put together 12 screenshots from the Solr web UI that show in detail:
- the queries I ran to get the results above
- various analyses trying to understand why
- the schema for the fieldType in question

https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing

thanks,
steve

-----Original Message-----
From: Sotiris Fragkiskos <[hidden email]>
Sent: Thursday, February 13, 2020 4:03 AM
To: [hidden email]
Subject: [External] Re: wildcards match end-of-word?

Hi Erick,
thanks very much for this information, it was immensely useful, I always had the same question!
I'm now seeing the Analysis page and finally I don't have to rely on an external online stemmer to see what solr *probably* stemmed the term to!!
But I still can't make the asterisk and question mark work inside the term, even in the earlier parts of it.
e.g. tr?ining
I would expect it to match train. But it doesn't.
PSF at the end just shows t | ain
every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I doing something very wrong??

thanks again!
Sotiri

On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <[hidden email]>
wrote:

> Steve:
>
> You _really_ want to get acquainted with the admin UI/Analysis page ;).
> Choose a core/collection and you should see the choice. It shows you
> exactly what transformations your data goes through. If you hover over
> the light gray pairs of letters, you’ll get a tooltip showing you what
> part of your analysis chain is responsible for a particular change. I
> un-check the “verbose” box 95% of the time BTW.
>
> The critical bit is that what comes out of the end of the analysis
> pipe are the tokens that are actually _in_ the index. From there,
> problems like this make more sense.
>
> My bet is that, as Walter says, you have a stemmer in the analysis
> chain and the actual token in the index is “kinas” so of course
> “kinase*” won’t be found. By adding OR kinase to the query, that token
> is stemmed to “kinas” and matches.
>
> Also, adding &debug=query to your URL will show you what the query
> looks like after parsing and analysis, also a major tool for figuring
> out what’s really happening.
>
> Wildcards are not stemmed, which can lead to surprising results.
> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed.
> Then you’d have to try to explain why “running*” returned a doc with
> only “run” or “runner” or “runs” or... in it, but searching for
> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>
> Finally, one of my personal hot buttons is wildcards in general.
> They’re very often over-used because people are used to simple search capabilities.
> Something about “if your only tool is a hammer, every problem looks
> like a nail”. That gets into training users too though...
>
> Best,
> Erick
>
> > On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
> [hidden email]> wrote:
> >
> > Hi,
> >
> > I am a solr newbie.  I was surprised to discover that a search for
> kinase* returned fewer results than kinase.
> >
> > Then I read the wildcard documentation<
> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
> l#TheStandardQueryParser-WildcardSearches>,
> and saw why.  kinase* will not match the word "kinase".
> >
> > Our end-users won't expect this behavior.  Presumably the solution
> > would
> be for them (actually us, on their behalf), to use kinase* OR kinase.
> >
> > But that is kind of a hack.
> >
> > Is there a way we can configure solr to have wildcards match on
> end-of-word?
> >
> > Thanks,
> > Steve
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [External] wildcards match end-of-word?

Jan Høydahl / Cominvent
In reply to this post by Fischer, Stephen
Be aware that if you search a field with stemming, then the index will only contain the stems, i.e. cars, caring may both be indexed as «car», and when you do a wildcard search, all analysis is skipped, so you are only targeting the exact tokens that happen to be in that field. Thus a search for «ca*s» or «c*ing» or «cars*» will not match, but «car*» and even «c*r» will match both these words, which would be surprising right? So if wildcard search is a key feature you better provide a copyField with a fieldType in your schema that do not do stemming. Probably only StandardTokenizer and LowercaseFilter. Then use that field for your wildcard queries instead of the generic stemmed field.

Jan

> 13. feb. 2020 kl. 13:52 skrev Fischer, Stephen <[hidden email]>:
>
> Folks,
>
> I am seeing very strange (bad) wildcard behavior (solr 8).  
>
> "kinase" finds hits as expected.  
>
> "kin*ase" and "kin*se" find 0 results.  "kinase*" matches only values like "kinase," and "kinase-" but not "kinase"
>
> I have done the analysis as Erick suggested (thanks!) but it is not helping me understand why we'd have this problem.
>
> I have put together 12 screenshots from the Solr web UI that show in detail:
> - the queries I ran to get the results above
> - various analyses trying to understand why
> - the schema for the fieldType in question
>
> https://docs.google.com/presentation/d/10fIAesqkTnvmJBFaerEhnqWhSiaEvVW7u9jE1nX564Q/edit?usp=sharing
>
> thanks,
> steve
>
> -----Original Message-----
> From: Sotiris Fragkiskos <[hidden email]>
> Sent: Thursday, February 13, 2020 4:03 AM
> To: [hidden email]
> Subject: [External] Re: wildcards match end-of-word?
>
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the term, even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF) Am I doing something very wrong??
>
> thanks again!
> Sotiri
>
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <[hidden email]>
> wrote:
>
>> Steve:
>>
>> You _really_ want to get acquainted with the admin UI/Analysis page ;).
>> Choose a core/collection and you should see the choice. It shows you
>> exactly what transformations your data goes through. If you hover over
>> the light gray pairs of letters, you’ll get a tooltip showing you what
>> part of your analysis chain is responsible for a particular change. I
>> un-check the “verbose” box 95% of the time BTW.
>>
>> The critical bit is that what comes out of the end of the analysis
>> pipe are the tokens that are actually _in_ the index. From there,
>> problems like this make more sense.
>>
>> My bet is that, as Walter says, you have a stemmer in the analysis
>> chain and the actual token in the index is “kinas” so of course
>> “kinase*” won’t be found. By adding OR kinase to the query, that token
>> is stemmed to “kinas” and matches.
>>
>> Also, adding &debug=query to your URL will show you what the query
>> looks like after parsing and analysis, also a major tool for figuring
>> out what’s really happening.
>>
>> Wildcards are not stemmed, which can lead to surprising results.
>> There’s no perfect answer here. Let’s claim wildcards _were_ stemmed.
>> Then you’d have to try to explain why “running*” returned a doc with
>> only “run” or “runner” or “runs” or... in it, but searching for
>> “runnin*” did not due the stemmer not recognizing it as a stemmable word.
>>
>> Finally, one of my personal hot buttons is wildcards in general.
>> They’re very often over-used because people are used to simple search capabilities.
>> Something about “if your only tool is a hammer, every problem looks
>> like a nail”. That gets into training users too though...
>>
>> Best,
>> Erick
>>
>>> On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
>> [hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I am a solr newbie.  I was surprised to discover that a search for
>> kinase* returned fewer results than kinase.
>>>
>>> Then I read the wildcard documentation<
>> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.htm
>> l#TheStandardQueryParser-WildcardSearches>,
>> and saw why.  kinase* will not match the word "kinase".
>>>
>>> Our end-users won't expect this behavior.  Presumably the solution
>>> would
>> be for them (actually us, on their behalf), to use kinase* OR kinase.
>>>
>>> But that is kind of a hack.
>>>
>>> Is there a way we can configure solr to have wildcards match on
>> end-of-word?
>>>
>>> Thanks,
>>> Steve
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: wildcards match end-of-word?

Walter Underwood
In reply to this post by Sotiris Fragkiskos
Remove the stopword and stemmer filters from your schema and reindex.

Removing stopwords means you can never match “vitamin a”.

Stemming interferes with wildcard matches. Either stem or do wildcards on a field, not both.

Also, what do your users expect to get with wildcard matches? Those are a slow and imprecise way to search. There is almost always a better way.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Feb 13, 2020, at 1:03 AM, Sotiris Fragkiskos <[hidden email]> wrote:
>
> Hi Erick,
> thanks very much for this information, it was immensely useful, I always
> had the same question!
> I'm now seeing the Analysis page and finally I don't have to rely on an
> external online stemmer to see what solr *probably* stemmed the term to!!
> But I still can't make the asterisk and question mark work inside the term,
> even in the earlier parts of it.
> e.g. tr?ining
> I would expect it to match train. But it doesn't.
> PSF at the end just shows t | ain
> every line before that actually shows t | aining (ST,SF,SF,LCF,EPF,SKMF)
> Am I doing something very wrong??
>
> thanks again!
> Sotiri
>
> On Wed, Feb 12, 2020 at 1:44 PM Erick Erickson <[hidden email]>
> wrote:
>
>> Steve:
>>
>> You _really_ want to get acquainted with the admin UI/Analysis page ;).
>> Choose a core/collection and you should see the choice. It shows you
>> exactly what transformations your data goes through. If you hover over the
>> light gray pairs of letters, you’ll get a tooltip showing you what part of
>> your analysis chain is responsible for a particular change. I un-check the
>> “verbose” box 95% of the time BTW.
>>
>> The critical bit is that what comes out of the end of the analysis pipe
>> are the tokens that are actually _in_ the index. From there, problems like
>> this make more sense.
>>
>> My bet is that, as Walter says, you have a stemmer in the analysis chain
>> and the actual token in the index is “kinas” so of course “kinase*” won’t
>> be found. By adding OR kinase to the query, that token is stemmed to
>> “kinas” and matches.
>>
>> Also, adding &debug=query to your URL will show you what the query looks
>> like after parsing and analysis, also a major tool for figuring out what’s
>> really happening.
>>
>> Wildcards are not stemmed, which can lead to surprising results. There’s
>> no perfect answer here. Let’s claim wildcards _were_ stemmed. Then you’d
>> have to try to explain why “running*” returned a doc with only “run” or
>> “runner” or “runs” or... in it, but searching for “runnin*” did not due the
>> stemmer not recognizing it as a stemmable word.
>>
>> Finally, one of my personal hot buttons is wildcards in general. They’re
>> very often over-used because people are used to simple search capabilities.
>> Something about “if your only tool is a hammer, every problem looks like a
>> nail”. That gets into training users too though...
>>
>> Best,
>> Erick
>>
>>> On Feb 11, 2020, at 9:24 PM, Fischer, Stephen <
>> [hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I am a solr newbie.  I was surprised to discover that a search for
>> kinase* returned fewer results than kinase.
>>>
>>> Then I read the wildcard documentation<
>> https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-WildcardSearches>,
>> and saw why.  kinase* will not match the word "kinase".
>>>
>>> Our end-users won't expect this behavior.  Presumably the solution would
>> be for them (actually us, on their behalf), to use kinase* OR kinase.
>>>
>>> But that is kind of a hack.
>>>
>>> Is there a way we can configure solr to have wildcards match on
>> end-of-word?
>>>
>>> Thanks,
>>> Steve
>>
>>