Complexphrase treats wildcards differently than other query parsers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Complexphrase treats wildcards differently than other query parsers

Bjarke Buur Mortensen
Hi list,

I'm trying to search for the term funktionsnedsättning*
In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
So I would expect that funktionsnedsättning* would translate to
funktionsnedsattning*.

If I use e.g. the lucene query parser, this is indeed what happens:
...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
"rawquerystring":"funktionsnedsättning*", "querystring":
"funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsattning*"
and 15 documents returned.

Trying the same with complexphrase gives me:
...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning* gives me
"rawquerystring":"funktionsnedsättning*", "querystring":
"funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsättning*"
and 0 documents. Notice how ä has not been changed to a.

How can this be? Is complexphrase somehow skipping the analysis chain for
multiterms, even though components and in particular
MappingCharFilterFactory are Multi-term aware

Are there any configuration gotchas that I'm not aware of?

Thanks for the help,
Bjarke Buur Mortensen
Senior Software Engineer, Eluence A/S
Reply | Threaded
Open this post in threaded view
|

Re: Complexphrase treats wildcards differently than other query parsers

Emir Arnautović
Hi Bjarke,
It is not multiterm that is causing query parser to skip analysis chain but wildcard. The majority of query parsers do not analyse query string if there are wildcards.

HTH
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen <[hidden email]> wrote:
>
> Hi list,
>
> I'm trying to search for the term funktionsnedsättning*
> In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> So I would expect that funktionsnedsättning* would translate to
> funktionsnedsattning*.
>
> If I use e.g. the lucene query parser, this is indeed what happens:
> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
> "rawquerystring":"funktionsnedsättning*", "querystring":
> "funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsattning*"
> and 15 documents returned.
>
> Trying the same with complexphrase gives me:
> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning* gives me
> "rawquerystring":"funktionsnedsättning*", "querystring":
> "funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsättning*"
> and 0 documents. Notice how ä has not been changed to a.
>
> How can this be? Is complexphrase somehow skipping the analysis chain for
> multiterms, even though components and in particular
> MappingCharFilterFactory are Multi-term aware
>
> Are there any configuration gotchas that I'm not aware of?
>
> Thanks for the help,
> Bjarke Buur Mortensen
> Senior Software Engineer, Eluence A/S

Reply | Threaded
Open this post in threaded view
|

Re: Complexphrase treats wildcards differently than other query parsers

Bjarke Buur Mortensen
Well, according to
https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
multiterm means

wildcard
range
prefix

so it is that way i'm using the word. That same article explains how
analysis will be performed with wildcards if the analyzers are multi-term
aware.
Furthermore, both lucene and dismax do the correct analysis, so I don't
think you are right in your statement about the majority of QPs skipping
analysis for wildcards.

So I'm still confused as to why complexphrase does things differently.

Thanks,
/Bjarke

2017-10-05 10:16 GMT+02:00 Emir Arnautović <[hidden email]>:

> Hi Bjarke,
> It is not multiterm that is causing query parser to skip analysis chain
> but wildcard. The majority of query parsers do not analyse query string if
> there are wildcards.
>
> HTH
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen <[hidden email]>
> wrote:
> >
> > Hi list,
> >
> > I'm trying to search for the term funktionsnedsättning*
> > In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > So I would expect that funktionsnedsättning* would translate to
> > funktionsnedsattning*.
> >
> > If I use e.g. the lucene query parser, this is indeed what happens:
> > ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
> > "rawquerystring":"funktionsnedsättning*", "querystring":
> > "funktionsnedsättning*", "parsedquery":"content_ol:
> funktionsnedsattning*"
> > and 15 documents returned.
> >
> > Trying the same with complexphrase gives me:
> > ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning*
> gives me
> > "rawquerystring":"funktionsnedsättning*", "querystring":
> > "funktionsnedsättning*", "parsedquery":"content_ol:
> funktionsnedsättning*"
> > and 0 documents. Notice how ä has not been changed to a.
> >
> > How can this be? Is complexphrase somehow skipping the analysis chain for
> > multiterms, even though components and in particular
> > MappingCharFilterFactory are Multi-term aware
> >
> > Are there any configuration gotchas that I'm not aware of?
> >
> > Thanks for the help,
> > Bjarke Buur Mortensen
> > Senior Software Engineer, Eluence A/S
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Complexphrase treats wildcards differently than other query parsers

Emir Arnautović
Hi Bjarke,
You are right - I jumped into wrong/old conclusion as the simplest answer to your question. I guess looking at the code could give you an answer.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen <[hidden email]> wrote:
>
> Well, according to
> https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
> multiterm means
>
> wildcard
> range
> prefix
>
> so it is that way i'm using the word. That same article explains how
> analysis will be performed with wildcards if the analyzers are multi-term
> aware.
> Furthermore, both lucene and dismax do the correct analysis, so I don't
> think you are right in your statement about the majority of QPs skipping
> analysis for wildcards.
>
> So I'm still confused as to why complexphrase does things differently.
>
> Thanks,
> /Bjarke
>
> 2017-10-05 10:16 GMT+02:00 Emir Arnautović <[hidden email]>:
>
>> Hi Bjarke,
>> It is not multiterm that is causing query parser to skip analysis chain
>> but wildcard. The majority of query parsers do not analyse query string if
>> there are wildcards.
>>
>> HTH
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen <[hidden email]>
>> wrote:
>>>
>>> Hi list,
>>>
>>> I'm trying to search for the term funktionsnedsättning*
>>> In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
>>> So I would expect that funktionsnedsättning* would translate to
>>> funktionsnedsattning*.
>>>
>>> If I use e.g. the lucene query parser, this is indeed what happens:
>>> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
>>> "rawquerystring":"funktionsnedsättning*", "querystring":
>>> "funktionsnedsättning*", "parsedquery":"content_ol:
>> funktionsnedsattning*"
>>> and 15 documents returned.
>>>
>>> Trying the same with complexphrase gives me:
>>> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning*
>> gives me
>>> "rawquerystring":"funktionsnedsättning*", "querystring":
>>> "funktionsnedsättning*", "parsedquery":"content_ol:
>> funktionsnedsättning*"
>>> and 0 documents. Notice how ä has not been changed to a.
>>>
>>> How can this be? Is complexphrase somehow skipping the analysis chain for
>>> multiterms, even though components and in particular
>>> MappingCharFilterFactory are Multi-term aware
>>>
>>> Are there any configuration gotchas that I'm not aware of?
>>>
>>> Thanks for the help,
>>> Bjarke Buur Mortensen
>>> Senior Software Engineer, Eluence A/S
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Complexphrase treats wildcards differently than other query parsers

Bjarke Buur Mortensen
2017-10-05 11:29 GMT+02:00 Emir Arnautović <[hidden email]>:

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest answer
> to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen <[hidden email]>
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how
> > analysis will be performed with wildcards if the analyzers are multi-term
> > aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I don't
> > think you are right in your statement about the majority of QPs skipping
> > analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović <[hidden email]
> >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis chain
> >> but wildcard. The majority of query parsers do not analyse query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen <[hidden email]>
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning*
> >>> In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning*
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis chain
> for
> >>> multiterms, even though components and in particular
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Complexphrase treats wildcards differently than other query parsers

Allison, Timothy B.
What version of Solr are you using?

I thought this had been fixed fairly recently, but I can't quickly find the JIRA.  Let me take a look.

Best,

             Tim

This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and [2], which handles analysis of multiterms even in phrases.

[1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
[2] https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205/6.6-0.1 

-----Original Message-----
From: Bjarke Buur Mortensen [mailto:[hidden email]]
Sent: Thursday, October 5, 2017 6:28 AM
To: [hidden email]
Subject: Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 11:29 GMT+02:00 Emir Arnautović <[hidden email]>:

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest
> answer to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen
> > <[hidden email]>
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how
> > analysis will be performed with wildcards if the analyzers are
> > multi-term aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I
> > don't think you are right in your statement about the majority of
> > QPs skipping analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović
> ><[hidden email]
> >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis
> >> chain but wildcard. The majority of query parsers do not analyse
> >> query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen
> >>> <[hidden email]>
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning* In my
> >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives
> >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning
> >>> *
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis
> >>> chain
> for
> >>> multiterms, even though components and in particular
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Complexphrase treats wildcards differently than other query parsers

Allison, Timothy B.
There's every chance that I'm missing something at the Solr level, but it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still not applying analysis to multiterms.

When I call this on 7.0.0:
   QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName, analyzer);
    return qp.parse(qString);

 where the analyzer is a mock "uppercase vowel" analyzer[1] and the qString is;

"the* quick~" the* quick~ the quick

I get this:
"the* quick~" name:the* name:quick~2 name:thE name:qUIck


[1] https://github.com/tballison/lucene-addons/blob/master/lucene-5205/src/test/java/org/apache/lucene/queryparser/spans/TestAdvancedAnalyzers.java#L117

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Thursday, October 5, 2017 8:02 AM
To: [hidden email]
Subject: RE: Complexphrase treats wildcards differently than other query parsers

What version of Solr are you using?

I thought this had been fixed fairly recently, but I can't quickly find the JIRA.  Let me take a look.

Best,

             Tim

This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and [2], which handles analysis of multiterms even in phrases.

[1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
[2] https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205/6.6-0.1 

-----Original Message-----
From: Bjarke Buur Mortensen [mailto:[hidden email]]
Sent: Thursday, October 5, 2017 6:28 AM
To: [hidden email]
Subject: Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 11:29 GMT+02:00 Emir Arnautović <[hidden email]>:

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest
> answer to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen
> > <[hidden email]>
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how
> > analysis will be performed with wildcards if the analyzers are
> > multi-term aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I
> > don't think you are right in your statement about the majority of
> > QPs skipping analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović
> ><[hidden email]
> >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis
> >> chain but wildcard. The majority of query parsers do not analyse
> >> query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen
> >>> <[hidden email]>
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning* In my
> >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives
> >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning
> >>> *
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis
> >>> chain
> for
> >>> multiterms, even though components and in particular
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Complexphrase treats wildcards differently than other query parsers

Bjarke Buur Mortensen
Thanks Tim,
that might be what I'm experiencing. I'm actually quite certain of it :-)

Do you remember any reason that multi term analysis is not happening in
ComplexPhraseQueryParser?

I'm on 6.6.1, so latest on the 6.x branch.

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <[hidden email]>:

> There's every chance that I'm missing something at the Solr level, but it
> _looks_ at the Lucene level, like ComplexPhraseQueryParser is still not
> applying analysis to multiterms.
>
> When I call this on 7.0.0:
>    QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
>     return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>
> [1] https://github.com/tballison/lucene-addons/blob/master/
> lucene-5205/src/test/java/org/apache/lucene/queryparser/
> spans/TestAdvancedAnalyzers.java#L117
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:[hidden email]]
> Sent: Thursday, October 5, 2017 8:02 AM
> To: [hidden email]
> Subject: RE: Complexphrase treats wildcards differently than other query
> parsers
>
> What version of Solr are you using?
>
> I thought this had been fixed fairly recently, but I can't quickly find
> the JIRA.  Let me take a look.
>
> Best,
>
>              Tim
>
> This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1]
> and [2], which handles analysis of multiterms even in phrases.
>
> [1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
> [2] https://mvnrepository.com/artifact/org.tallison.lucene/
> lucene-5205/6.6-0.1
>
> -----Original Message-----
> From: Bjarke Buur Mortensen [mailto:[hidden email]]
> Sent: Thursday, October 5, 2017 6:28 AM
> To: [hidden email]
> Subject: Re: Complexphrase treats wildcards differently than other query
> parsers
>
> 2017-10-05 11:29 GMT+02:00 Emir Arnautović <[hidden email]>:
>
> > Hi Bjarke,
> > You are right - I jumped into wrong/old conclusion as the simplest
> > answer to your question.
>
>
>  No problem :-)
>
> I guess looking at the code could give you an answer.
> >
>
> This is what I would like to avoid out of fear that my head would explode
> ;-)
>
>
> >
> > Thanks,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen
> > > <[hidden email]>
> > wrote:
> > >
> > > Well, according to
> > > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> > wildcard-multiterm-queries-in-solr/
> > > multiterm means
> > >
> > > wildcard
> > > range
> > > prefix
> > >
> > > so it is that way i'm using the word. That same article explains how
> > > analysis will be performed with wildcards if the analyzers are
> > > multi-term aware.
> > > Furthermore, both lucene and dismax do the correct analysis, so I
> > > don't think you are right in your statement about the majority of
> > > QPs skipping analysis for wildcards.
> > >
> > > So I'm still confused as to why complexphrase does things differently.
> > >
> > > Thanks,
> > > /Bjarke
> > >
> > > 2017-10-05 10:16 GMT+02:00 Emir Arnautović
> > ><[hidden email]
> > >:
> > >
> > >> Hi Bjarke,
> > >> It is not multiterm that is causing query parser to skip analysis
> > >> chain but wildcard. The majority of query parsers do not analyse
> > >> query string
> > if
> > >> there are wildcards.
> > >>
> > >> HTH
> > >> Emir
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > >> Elasticsearch Consulting Support Training - http://sematext.com/
> > >>
> > >>
> > >>
> > >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen
> > >>> <[hidden email]>
> > >> wrote:
> > >>>
> > >>> Hi list,
> > >>>
> > >>> I'm trying to search for the term funktionsnedsättning* In my
> > >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > >>> So I would expect that funktionsnedsättning* would translate to
> > >>> funktionsnedsattning*.
> > >>>
> > >>> If I use e.g. the lucene query parser, this is indeed what happens:
> > >>> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives
> > >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> > >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> > >> funktionsnedsattning*"
> > >>> and 15 documents returned.
> > >>>
> > >>> Trying the same with complexphrase gives me:
> > >>> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning
> > >>> *
> > >> gives me
> > >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> > >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> > >> funktionsnedsättning*"
> > >>> and 0 documents. Notice how ä has not been changed to a.
> > >>>
> > >>> How can this be? Is complexphrase somehow skipping the analysis
> > >>> chain
> > for
> > >>> multiterms, even though components and in particular
> > >>> MappingCharFilterFactory are Multi-term aware
> > >>>
> > >>> Are there any configuration gotchas that I'm not aware of?
> > >>>
> > >>> Thanks for the help,
> > >>> Bjarke Buur Mortensen
> > >>> Senior Software Engineer, Eluence A/S
> > >>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Complexphrase treats wildcards differently than other query parsers

Allison, Timothy B.
Prob the usual reasons...no one has submitted a patch yet, or could be a regression after LUCENE-7355.

See also:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201407.mbox/%3C1D06A081892ADF4589BD83EE24B9DC302597113E@...%3E

I'll take a look.


-----Original Message-----
From: Bjarke Buur Mortensen [mailto:[hidden email]]
Sent: Thursday, October 5, 2017 8:52 AM
To: [hidden email]
Subject: Re: Complexphrase treats wildcards differently than other query parsers

Thanks Tim,
that might be what I'm experiencing. I'm actually quite certain of it :-)

Do you remember any reason that multi term analysis is not happening in ComplexPhraseQueryParser?

I'm on 6.6.1, so latest on the 6.x branch.

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <[hidden email]>:

> There's every chance that I'm missing something at the Solr level, but
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still
> not applying analysis to multiterms.
>
> When I call this on 7.0.0:
>    QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
>     return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>
> [1] https://github.com/tballison/lucene-addons/blob/master/
> lucene-5205/src/test/java/org/apache/lucene/queryparser/
> spans/TestAdvancedAnalyzers.java#L117
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:[hidden email]]
> Sent: Thursday, October 5, 2017 8:02 AM
> To: [hidden email]
> Subject: RE: Complexphrase treats wildcards differently than other
> query parsers
>
> What version of Solr are you using?
>
> I thought this had been fixed fairly recently, but I can't quickly
> find the JIRA.  Let me take a look.
>
> Best,
>
>              Tim
>
> This was one of my initial reasons for my SpanQueryParser
> LUCENE-5205[1] and [2], which handles analysis of multiterms even in phrases.
>
> [1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
> [2] https://mvnrepository.com/artifact/org.tallison.lucene/
> lucene-5205/6.6-0.1
>
> -----Original Message-----
> From: Bjarke Buur Mortensen [mailto:[hidden email]]
> Sent: Thursday, October 5, 2017 6:28 AM
> To: [hidden email]
> Subject: Re: Complexphrase treats wildcards differently than other
> query parsers
>
> 2017-10-05 11:29 GMT+02:00 Emir Arnautović <[hidden email]>:
>
> > Hi Bjarke,
> > You are right - I jumped into wrong/old conclusion as the simplest
> > answer to your question.
>
>
>  No problem :-)
>
> I guess looking at the code could give you an answer.
> >
>
> This is what I would like to avoid out of fear that my head would
> explode
> ;-)
>
>
> >
> > Thanks,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen
> > > <[hidden email]>
> > wrote:
> > >
> > > Well, according to
> > > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> > wildcard-multiterm-queries-in-solr/
> > > multiterm means
> > >
> > > wildcard
> > > range
> > > prefix
> > >
> > > so it is that way i'm using the word. That same article explains
> > > how analysis will be performed with wildcards if the analyzers are
> > > multi-term aware.
> > > Furthermore, both lucene and dismax do the correct analysis, so I
> > > don't think you are right in your statement about the majority of
> > > QPs skipping analysis for wildcards.
> > >
> > > So I'm still confused as to why complexphrase does things differently.
> > >
> > > Thanks,
> > > /Bjarke
> > >
> > > 2017-10-05 10:16 GMT+02:00 Emir Arnautović
> > ><[hidden email]
> > >:
> > >
> > >> Hi Bjarke,
> > >> It is not multiterm that is causing query parser to skip analysis
> > >> chain but wildcard. The majority of query parsers do not analyse
> > >> query string
> > if
> > >> there are wildcards.
> > >>
> > >> HTH
> > >> Emir
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > >> Elasticsearch Consulting Support Training - http://sematext.com/
> > >>
> > >>
> > >>
> > >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen
> > >>> <[hidden email]>
> > >> wrote:
> > >>>
> > >>> Hi list,
> > >>>
> > >>> I'm trying to search for the term funktionsnedsättning* In my
> > >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > >>> So I would expect that funktionsnedsättning* would translate to
> > >>> funktionsnedsattning*.
> > >>>
> > >>> If I use e.g. the lucene query parser, this is indeed what happens:
> > >>> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning*
> > >>> gives me "rawquerystring":"funktionsnedsättning*", "querystring":
> > >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> > >> funktionsnedsattning*"
> > >>> and 15 documents returned.
> > >>>
> > >>> Trying the same with complexphrase gives me:
> > >>> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttni
> > >>> ng
> > >>> *
> > >> gives me
> > >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> > >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> > >> funktionsnedsättning*"
> > >>> and 0 documents. Notice how ä has not been changed to a.
> > >>>
> > >>> How can this be? Is complexphrase somehow skipping the analysis
> > >>> chain
> > for
> > >>> multiterms, even though components and in particular
> > >>> MappingCharFilterFactory are Multi-term aware
> > >>>
> > >>> Are there any configuration gotchas that I'm not aware of?
> > >>>
> > >>> Thanks for the help,
> > >>> Bjarke Buur Mortensen
> > >>> Senior Software Engineer, Eluence A/S
> > >>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Complexphrase treats wildcards differently than other query parsers

Allison, Timothy B.
After some more digging, I'm wrong even at the Lucene level.

When I use the CustomAnalyzer and make my UC vowel mock filter MultitermAware, I get this with Lucene in trunk:

"the* quick~" name:thE* name:qUIck~2 name:thE name:qUIck

So, there's room for improvement with phrases, but the regular multiterms should be ok.

Still no answer for you...

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <[hidden email]>:

> There's every chance that I'm missing something at the Solr level, but
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still
> not applying analysis to multiterms.
>
> When I call this on 7.0.0:
>    QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
>     return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck

Reply | Threaded
Open this post in threaded view
|

Re: Complexphrase treats wildcards differently than other query parsers

Bjarke Buur Mortensen
Thanks a lot for your effort, Tim.

Looking at it from the Solr side, I see some use of local classes. The
snippet below in particular caught my eye (in
solr/core/src/java/org/apache/solr/search/ComplexPhraseQParserPlugin.java).
The instance of ComplexPhraseQueryParser is not the clean one from Lucene,
but a modified one. If any of the modifications messes with the analysis
logic, well then that might answer it.

What do you make of it?

lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
getQueryAnalyzer())
{
protected Query newWildcardQuery(org.apache.lucene.index.Term t) {
try {
org.apache.lucene.search.Query wildcardQuery = reverseAwareParser.
getWildcardQuery(t.field(), t.text());
setRewriteMethod(wildcardQuery);
return wildcardQuery;
} catch (SyntaxError e) {
throw new RuntimeException(e);
}
}
private Query setRewriteMethod(org.apache.lucene.search.Query query) {
if (query instanceof MultiTermQuery) {
((MultiTermQuery) query).setRewriteMethod(
org.apache.lucene.search.MultiTermQuery.SCORING_BOOLEAN_REWRITE);
}
return query;
}
protected Query newRangeQuery(String field, String part1, String part2,
boolean startInclusive,
boolean endInclusive) {
boolean reverse = reverseAwareParser.isRangeShouldBeProtectedFromReverse(field,
part1);
return super.newRangeQuery(field,
reverse ? reverseAwareParser.getLowerBoundForReverse() : part1,
part2,
startInclusive || reverse,
endInclusive);
}
}
;

Thanks,
Bjarke

2017-10-05 21:15 GMT+02:00 Allison, Timothy B. <[hidden email]>:

> After some more digging, I'm wrong even at the Lucene level.
>
> When I use the CustomAnalyzer and make my UC vowel mock filter
> MultitermAware, I get this with Lucene in trunk:
>
> "the* quick~" name:thE* name:qUIck~2 name:thE name:qUIck
>
> So, there's room for improvement with phrases, but the regular multiterms
> should be ok.
>
> Still no answer for you...
>
> 2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <[hidden email]>:
>
> > There's every chance that I'm missing something at the Solr level, but
> > it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still
> > not applying analysis to multiterms.
> >
> > When I call this on 7.0.0:
> >    QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> > analyzer);
> >     return qp.parse(qString);
> >
> >  where the analyzer is a mock "uppercase vowel" analyzer[1] and the
> > qString is;
> >
> > "the* quick~" the* quick~ the quick
> >
> > I get this:
> > "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Complexphrase treats wildcards differently than other query parsers

Allison, Timothy B.
That could be it.  I'm not able to reproduce this with trunk.  More next week.

In trunk, if I add this to schema15.xml:
  <fieldType name="text_iso_latin1_mapping" class="solr.TextField">
    <analyzer>
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.MockTokenizerFactory"/>
    </analyzer>
  </fieldType>
  <field name="iso-latin1" type="text_iso_latin1_mapping" indexed="true" stored="true"/>

This test passes.

  @Test
  public void testCharFilter() {
    assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
    assertU(commit());
    assertU(optimize());

    assertQ(req("q", "{!complexphrase} iso-latin1:craezy")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:traen")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:caezy~1")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:crae*")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:*aezy")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:crae*y")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:\"craezy traen\"")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:\"caezy~1 traen\"")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:\"craez* traen\"")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:\"*aezy traen\"")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );

    assertQ(req("q", "{!complexphrase} iso-latin1:\"crae*y traen\"")
        , "//result[@numFound='1']"
        , "//doc[./str[@name='id']='1']"
    );
  }



-----Original Message-----
From: Bjarke Buur Mortensen [mailto:[hidden email]]
Sent: Friday, October 6, 2017 6:46 AM
To: [hidden email]
Subject: Re: Complexphrase treats wildcards differently than other query parsers

Thanks a lot for your effort, Tim.

Looking at it from the Solr side, I see some use of local classes. The snippet below in particular caught my eye (in solr/core/src/java/org/apache/solr/search/ComplexPhraseQParserPlugin.java).
The instance of ComplexPhraseQueryParser is not the clean one from Lucene, but a modified one. If any of the modifications messes with the analysis logic, well then that might answer it.

What do you make of it?

lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
getQueryAnalyzer())
{
protected Query newWildcardQuery(org.apache.lucene.index.Term t) { try { org.apache.lucene.search.Query wildcardQuery = reverseAwareParser.
getWildcardQuery(t.field(), t.text());
setRewriteMethod(wildcardQuery);
return wildcardQuery;
} catch (SyntaxError e) {
throw new RuntimeException(e);
}
}
private Query setRewriteMethod(org.apache.lucene.search.Query query) { if (query instanceof MultiTermQuery) {
((MultiTermQuery) query).setRewriteMethod( org.apache.lucene.search.MultiTermQuery.SCORING_BOOLEAN_REWRITE);
}
return query;
}
protected Query newRangeQuery(String field, String part1, String part2, boolean startInclusive, boolean endInclusive) { boolean reverse = reverseAwareParser.isRangeShouldBeProtectedFromReverse(field,
part1);
return super.newRangeQuery(field,
reverse ? reverseAwareParser.getLowerBoundForReverse() : part1, part2, startInclusive || reverse, endInclusive); } } ;

Thanks,
Bjarke


Reply | Threaded
Open this post in threaded view
|

Re: Complexphrase treats wildcards differently than other query parsers

Bjarke Buur Mortensen
Thanks again, Tim,
following your recipe, I was able to write a failing test:

    assertQ(req("q", "{!complexphrase} iso-latin1:cr\u00E6zy*")
    , "//result[@numFound='1']"
    , "//doc[./str[@name='id']='1']"
    );

Notice how cr\u00E6zy* is used as a query term which mimics the behaviour I
originally reported, namely that CPQP does not analyse it because of the
wildcard and thus does not hit the charfilter from the query side.


2017-10-06 20:54 GMT+02:00 Allison, Timothy B. <[hidden email]>:

> That could be it.  I'm not able to reproduce this with trunk.  More next
> week.
>
> In trunk, if I add this to schema15.xml:
>   <fieldType name="text_iso_latin1_mapping" class="solr.TextField">
>     <analyzer>
>       <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-
> ISOLatin1Accent.txt"/>
>       <tokenizer class="solr.MockTokenizerFactory"/>
>     </analyzer>
>   </fieldType>
>   <field name="iso-latin1" type="text_iso_latin1_mapping" indexed="true"
> stored="true"/>
>
> This test passes.
>
>   @Test
>   public void testCharFilter() {
>     assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
>     assertU(commit());
>     assertU(optimize());
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:craezy")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:traen")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:caezy~1")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:crae*")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:*aezy")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:crae*y")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"craezy traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"caezy~1 traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"craez* traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"*aezy traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"crae*y traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>   }
>
>
>
> -----Original Message-----
> From: Bjarke Buur Mortensen [mailto:[hidden email]]
> Sent: Friday, October 6, 2017 6:46 AM
> To: [hidden email]
> Subject: Re: Complexphrase treats wildcards differently than other query
> parsers
>
> Thanks a lot for your effort, Tim.
>
> Looking at it from the Solr side, I see some use of local classes. The
> snippet below in particular caught my eye (in solr/core/src/java/org/apache/
> solr/search/ComplexPhraseQParserPlugin.java).
> The instance of ComplexPhraseQueryParser is not the clean one from Lucene,
> but a modified one. If any of the modifications messes with the analysis
> logic, well then that might answer it.
>
> What do you make of it?
>
> lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
> getQueryAnalyzer())
> {
> protected Query newWildcardQuery(org.apache.lucene.index.Term t) { try {
> org.apache.lucene.search.Query wildcardQuery = reverseAwareParser.
> getWildcardQuery(t.field(), t.text());
> setRewriteMethod(wildcardQuery);
> return wildcardQuery;
> } catch (SyntaxError e) {
> throw new RuntimeException(e);
> }
> }
> private Query setRewriteMethod(org.apache.lucene.search.Query query) { if
> (query instanceof MultiTermQuery) {
> ((MultiTermQuery) query).setRewriteMethod( org.apache.lucene.search.
> MultiTermQuery.SCORING_BOOLEAN_REWRITE);
> }
> return query;
> }
> protected Query newRangeQuery(String field, String part1, String part2,
> boolean startInclusive, boolean endInclusive) { boolean reverse =
> reverseAwareParser.isRangeShouldBeProtectedFromReverse(field,
> part1);
> return super.newRangeQuery(field,
> reverse ? reverseAwareParser.getLowerBoundForReverse() : part1, part2,
> startInclusive || reverse, endInclusive); } } ;
>
> Thanks,
> Bjarke
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Complexphrase treats wildcards differently than other query parsers

Allison, Timothy B.
<face_palm/>  Right.  Sorry.

Despite appearances to the contrary, I'm not a bot designed to lead you down the garden path of debugging for yourself with the goal of increasing the size of the Solr contributor pool...

I confirmed the failure in 6.x, but all seems to work in 7.x and trunk.  I opened SOLR-11450 and attached a unit test based on your correction of mine. 😊

Thank you, again!


-----Original Message-----
From: Bjarke Buur Mortensen [mailto:[hidden email]]
Sent: Monday, October 9, 2017 8:39 AM
To: [hidden email]
Subject: Re: Complexphrase treats wildcards differently than other query parsers

Thanks again, Tim,
following your recipe, I was able to write a failing test:

    assertQ(req("q", "{!complexphrase} iso-latin1:cr\u00E6zy*")
    , "//result[@numFound='1']"
    , "//doc[./str[@name='id']='1']"
    );

Notice how cr\u00E6zy* is used as a query term which mimics the behaviour I originally reported, namely that CPQP does not analyse it because of the wildcard and thus does not hit the charfilter from the query side.


2017-10-06 20:54 GMT+02:00 Allison, Timothy B. <[hidden email]>:

> That could be it.  I'm not able to reproduce this with trunk.  More
> next week.
>
> In trunk, if I add this to schema15.xml:
>   <fieldType name="text_iso_latin1_mapping" class="solr.TextField">
>     <analyzer>
>       <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping- ISOLatin1Accent.txt"/>
>       <tokenizer class="solr.MockTokenizerFactory"/>
>     </analyzer>
>   </fieldType>
>   <field name="iso-latin1" type="text_iso_latin1_mapping" indexed="true"
> stored="true"/>
>
> This test passes.
>
>   @Test
>   public void testCharFilter() {
>     assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
>     assertU(commit());
>     assertU(optimize());
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:craezy")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:traen")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:caezy~1")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:crae*")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:*aezy")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:crae*y")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"craezy traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"caezy~1 traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"craez* traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"*aezy traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>
>     assertQ(req("q", "{!complexphrase} iso-latin1:\"crae*y traen\"")
>         , "//result[@numFound='1']"
>         , "//doc[./str[@name='id']='1']"
>     );
>   }
>
>
>
> -----Original Message-----
> From: Bjarke Buur Mortensen [mailto:[hidden email]]
> Sent: Friday, October 6, 2017 6:46 AM
> To: [hidden email]
> Subject: Re: Complexphrase treats wildcards differently than other
> query parsers
>
> Thanks a lot for your effort, Tim.
>
> Looking at it from the Solr side, I see some use of local classes. The
> snippet below in particular caught my eye (in
> solr/core/src/java/org/apache/ solr/search/ComplexPhraseQParserPlugin.java).
> The instance of ComplexPhraseQueryParser is not the clean one from
> Lucene, but a modified one. If any of the modifications messes with
> the analysis logic, well then that might answer it.
>
> What do you make of it?
>
> lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
> getQueryAnalyzer())
> {
> protected Query newWildcardQuery(org.apache.lucene.index.Term t) { try
> { org.apache.lucene.search.Query wildcardQuery = reverseAwareParser.
> getWildcardQuery(t.field(), t.text());
> setRewriteMethod(wildcardQuery); return wildcardQuery; } catch
> (SyntaxError e) { throw new RuntimeException(e); } } private Query
> setRewriteMethod(org.apache.lucene.search.Query query) { if (query
> instanceof MultiTermQuery) {
> ((MultiTermQuery) query).setRewriteMethod( org.apache.lucene.search.
> MultiTermQuery.SCORING_BOOLEAN_REWRITE);
> }
> return query;
> }
> protected Query newRangeQuery(String field, String part1, String
> part2, boolean startInclusive, boolean endInclusive) { boolean reverse
> = reverseAwareParser.isRangeShouldBeProtectedFromReverse(field,
> part1);
> return super.newRangeQuery(field,
> reverse ? reverseAwareParser.getLowerBoundForReverse() : part1, part2,
> startInclusive || reverse, endInclusive); } } ;
>
> Thanks,
> Bjarke
>
>
>