Behaviour of punctuation marks in phrase queries

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Behaviour of punctuation marks in phrase queries

Doris Peter
Hello,

We use Solr 7.6.0 to build our index, and I have got a Question about
Phrase Queries:

We use the following configuration in schema.xml:
   
    <!-- Text Standard -->
    <fieldType name="text" class="solr.TextField"
positionIncrementGap="1000" sortMissingLast="true"
autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt"/>
        <filter class="solr.CJKBigramFilterFactory"/>
        <filter class="solr.WordDelimiterGraphFilterFactory"
protected="protectedword.txt"
             preserveOriginal="0" splitOnNumerics="1"
splitOnCaseChange="0"
             catenateWords="1" catenateNumbers="1" catenateAll="1"
             generateWordParts="1" generateNumberParts="1"
stemEnglishPossessive="1"
             types="wdfftypes.txt" />
        <filter class="solr.LengthFilterFactory" min="1"
max="2147483647"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt"/>
        <filter class="solr.CJKBigramFilterFactory"/>
        <filter class="solr.WordDelimiterGraphFilterFactory"
protected="protectedword.txt"
             preserveOriginal="0" splitOnNumerics="1"
splitOnCaseChange="0"
             catenateWords="1" catenateNumbers="1" catenateAll="1"
             generateWordParts="1" generateNumberParts="1"
stemEnglishPossessive="1"
             types="wdfftypes.txt" />
        <filter class="solr.LengthFilterFactory" min="1"
max="2147483647"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


    If we search for a phrase like "Moosburg a.d. Isar" we don't get a
match, though it's definitely in our Index.
    If we search for "Moosburg a. d. Isar" with a blank between "a."
and "d." we get a match.

    This also happens for other non-word characters, like ' or , for
example.

    The strange thing about it is, that the Solr Analysis-Tool reports
a match for the first version, but when we send a Solr Query, we get no
result Documents.

    Has anyone got an idea, what this could be?

    Thank you very much in advance,

    Doris Peter
Reply | Threaded
Open this post in threaded view
|

Re: Behaviour of punctuation marks in phrase queries

Erick Erickson
Three things:

1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after it in the index config

2> It is usually unnecessary to have the exact same parameters at both query and index time for WDGFF. If you’ve split parts up at index time then mashed them all back together, you can usually only split them up at query time.

3> try adding &debug=query to the query and see what the results show for the parsed query. That usually gives you a clue what is really happening .vs. what you think is happening.

Best,
Erick

> On May 17, 2019, at 12:59 AM, Doris Peter <[hidden email]> wrote:
>
> Hello,
>
> We use Solr 7.6.0 to build our index, and I have got a Question about
> Phrase Queries:
>
> We use the following configuration in schema.xml:
>
>    <!-- Text Standard -->
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="1000" sortMissingLast="true"
> autoGeneratePhraseQueries="true">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-FoldToASCII.txt"/>
>        <filter class="solr.CJKBigramFilterFactory"/>
>        <filter class="solr.WordDelimiterGraphFilterFactory"
> protected="protectedword.txt"
>             preserveOriginal="0" splitOnNumerics="1"
> splitOnCaseChange="0"
>             catenateWords="1" catenateNumbers="1" catenateAll="1"
>             generateWordParts="1" generateNumberParts="1"
> stemEnglishPossessive="1"
>             types="wdfftypes.txt" />
>        <filter class="solr.LengthFilterFactory" min="1"
> max="2147483647"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-FoldToASCII.txt"/>
>        <filter class="solr.CJKBigramFilterFactory"/>
>        <filter class="solr.WordDelimiterGraphFilterFactory"
> protected="protectedword.txt"
>             preserveOriginal="0" splitOnNumerics="1"
> splitOnCaseChange="0"
>             catenateWords="1" catenateNumbers="1" catenateAll="1"
>             generateWordParts="1" generateNumberParts="1"
> stemEnglishPossessive="1"
>             types="wdfftypes.txt" />
>        <filter class="solr.LengthFilterFactory" min="1"
> max="2147483647"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>
>    If we search for a phrase like "Moosburg a.d. Isar" we don't get a
> match, though it's definitely in our Index.
>    If we search for "Moosburg a. d. Isar" with a blank between "a."
> and "d." we get a match.
>
>    This also happens for other non-word characters, like ' or , for
> example.
>
>    The strange thing about it is, that the Solr Analysis-Tool reports
> a match for the first version, but when we send a Solr Query, we get no
> result Documents.
>
>    Has anyone got an idea, what this could be?
>
>    Thank you very much in advance,
>
>    Doris Peter

Reply | Threaded
Open this post in threaded view
|

Antw: Re: Behaviour of punctuation marks in phrase queries

Doris Peter
Thanks a lot! I tried the debug parameter, which shows interesting differences:

debug": {

    "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
    "querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
    "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
    "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
    "QParser": "LuceneQParser"
}

debug": {
        "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
        "querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
        "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true))",
        "parsedquery_toString": "spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true)",
        "QParser": "LuceneQParser"
    }


Something seems to go wrong here, as the parsedquery contains the SpanNearQuery instead of a PhraseQuery.







 
 
>>> Erick Erickson <[hidden email]> 5/17/2019 4:27 PM >>>
Three things:

1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after it in the index config

2> It is usually unnecessary to have the exact same parameters at both query and index time for WDGFF. If you’ve split parts up at index time then mashed them all back together, you can usually only split them up at query time.

3> try adding &debug=query to the query and see what the results show for the parsed query. That usually gives you a clue what is really happening .vs. what you think is happening.

Best,
Erick

> On May 17, 2019, at 12:59 AM, Doris Peter <[hidden email]> wrote:
>
> Hello,
>
> We use Solr 7.6.0 to build our index, and I have got a Question about
> Phrase Queries:
>
> We use the following configuration in schema.xml:
>
>    <!-- Text Standard -->
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="1000" sortMissingLast="true"
> autoGeneratePhraseQueries="true">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-FoldToASCII.txt"/>
>        <filter class="solr.CJKBigramFilterFactory"/>
>        <filter class="solr.WordDelimiterGraphFilterFactory"
> protected="protectedword.txt"
>             preserveOriginal="0" splitOnNumerics="1"
> splitOnCaseChange="0"
>             catenateWords="1" catenateNumbers="1" catenateAll="1"
>             generateWordParts="1" generateNumberParts="1"
> stemEnglishPossessive="1"
>             types="wdfftypes.txt" />
>        <filter class="solr.LengthFilterFactory" min="1"
> max="2147483647"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-FoldToASCII.txt"/>
>        <filter class="solr.CJKBigramFilterFactory"/>
>        <filter class="solr.WordDelimiterGraphFilterFactory"
> protected="protectedword.txt"
>             preserveOriginal="0" splitOnNumerics="1"
> splitOnCaseChange="0"
>             catenateWords="1" catenateNumbers="1" catenateAll="1"
>             generateWordParts="1" generateNumberParts="1"
> stemEnglishPossessive="1"
>             types="wdfftypes.txt" />
>        <filter class="solr.LengthFilterFactory" min="1"
> max="2147483647"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>
>    If we search for a phrase like "Moosburg a.d. Isar" we don't get a
> match, though it's definitely in our Index.
>    If we search for "Moosburg a. d. Isar" with a blank between "a."
> and "d." we get a match.
>
>    This also happens for other non-word characters, like ' or , for
> example.
>
>    The strange thing about it is, that the Solr Analysis-Tool reports
> a match for the first version, but when we send a Solr Query, we get no
> result Documents.
>
>    Has anyone got an idea, what this could be?
>
>    Thank you very much in advance,
>
>    Doris Peter


Reply | Threaded
Open this post in threaded view
|

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

Erick Erickson
I’ll leave that explanation to someone who understands query parsers ;)

> On May 17, 2019, at 7:57 AM, Doris Peter <[hidden email]> wrote:
>
> Thanks a lot! I tried the debug parameter, which shows interesting differences:
>
> debug": {
>
>    "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
>    "querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
>    "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
>    "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
>    "QParser": "LuceneQParser"
> }
>
> debug": {
>        "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
>        "querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
>        "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true))",
>        "parsedquery_toString": "spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true)",
>        "QParser": "LuceneQParser"
>    }
>
>
> Something seems to go wrong here, as the parsedquery contains the SpanNearQuery instead of a PhraseQuery.
>
>
>
>
>
>
>
>
>
>>>> Erick Erickson <[hidden email]> 5/17/2019 4:27 PM >>>
> Three things:
>
> 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after it in the index config
>
> 2> It is usually unnecessary to have the exact same parameters at both query and index time for WDGFF. If you’ve split parts up at index time then mashed them all back together, you can usually only split them up at query time.
>
> 3> try adding &debug=query to the query and see what the results show for the parsed query. That usually gives you a clue what is really happening .vs. what you think is happening.
>
> Best,
> Erick
>
>> On May 17, 2019, at 12:59 AM, Doris Peter <[hidden email]> wrote:
>>
>> Hello,
>>
>> We use Solr 7.6.0 to build our index, and I have got a Question about
>> Phrase Queries:
>>
>> We use the following configuration in schema.xml:
>>
>>   <!-- Text Standard -->
>>   <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="1000" sortMissingLast="true"
>> autoGeneratePhraseQueries="true">
>>     <analyzer type="index">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-FoldToASCII.txt"/>
>>       <filter class="solr.CJKBigramFilterFactory"/>
>>       <filter class="solr.WordDelimiterGraphFilterFactory"
>> protected="protectedword.txt"
>>            preserveOriginal="0" splitOnNumerics="1"
>> splitOnCaseChange="0"
>>            catenateWords="1" catenateNumbers="1" catenateAll="1"
>>            generateWordParts="1" generateNumberParts="1"
>> stemEnglishPossessive="1"
>>            types="wdfftypes.txt" />
>>       <filter class="solr.LengthFilterFactory" min="1"
>> max="2147483647"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-FoldToASCII.txt"/>
>>       <filter class="solr.CJKBigramFilterFactory"/>
>>       <filter class="solr.WordDelimiterGraphFilterFactory"
>> protected="protectedword.txt"
>>            preserveOriginal="0" splitOnNumerics="1"
>> splitOnCaseChange="0"
>>            catenateWords="1" catenateNumbers="1" catenateAll="1"
>>            generateWordParts="1" generateNumberParts="1"
>> stemEnglishPossessive="1"
>>            types="wdfftypes.txt" />
>>       <filter class="solr.LengthFilterFactory" min="1"
>> max="2147483647"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>     </analyzer>
>>   </fieldType>
>>
>>
>>   If we search for a phrase like "Moosburg a.d. Isar" we don't get a
>> match, though it's definitely in our Index.
>>   If we search for "Moosburg a. d. Isar" with a blank between "a."
>> and "d." we get a match.
>>
>>   This also happens for other non-word characters, like ' or , for
>> example.
>>
>>   The strange thing about it is, that the Solr Analysis-Tool reports
>> a match for the first version, but when we send a Solr Query, we get no
>> result Documents.
>>
>>   Has anyone got an idea, what this could be?
>>
>>   Thank you very much in advance,
>>
>>   Doris Peter
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

Michael Gibney
The SpanNearQuery in association with "a.b." input and WDGF is
expected behavior, since WDGF causes the query to search ("ab")|("a"
"b"), as 1 or 2 tokens, respectively. The "a. b." input
(whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so
sticks with the more straightforward PhraseQuery implementation.

That said, the problem you're encountering is related to a couple of issues:
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312

For this case specifically, the problem is that NearSpansOrdered
lazily returns one match per position *for the first subclause*. The
or clause ("ab"|"a" "b"), because positionLength is not indexed, will
always return "ab" first (implicit positionLength of 1). Again because
"ab"'s actual positionLength of 2 from index-time WDGF is not stored
in the index, the implicit positionLength of 1 at query-time gives the
impression of a gap between "ab" and "isar", violating the "slop=0"
constraint.

Because NearSpansOrdered.nextStartPosition() always advances by
calling nextStartPosition() on the first subclause (without exploring
for variant matches in other subclauses), the top-level
NearSpansOrdered advances after one attempt at matching, and the valid
match is missed.

Pending fixes to address the underlying issue (there is a candidate
patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312),
you could mitigate the problem to some extent by either forcing slop>0
(which as of 7.6 will be expanded into MultiPhraseQuery -- see
https://issues.apache.org/jira/browse/LUCENE-8531), or you could set
preserveOriginal=true on both index-time and query-time WDGF and
upgrade to 8.1 (which would prevent the extreme case of an *exact*
character-for-character matching query turning up no results -- see
https://issues.apache.org/jira/browse/LUCENE-8730).

On Fri, May 17, 2019 at 11:47 AM Erick Erickson <[hidden email]> wrote:

>
> I’ll leave that explanation to someone who understands query parsers ;)
>
> > On May 17, 2019, at 7:57 AM, Doris Peter <[hidden email]> wrote:
> >
> > Thanks a lot! I tried the debug parameter, which shows interesting differences:
> >
> > debug": {
> >
> >    "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> >    "querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> >    "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
> >    "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
> >    "QParser": "LuceneQParser"
> > }
> >
> > debug": {
> >        "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> >        "querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> >        "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true))",
> >        "parsedquery_toString": "spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true)",
> >        "QParser": "LuceneQParser"
> >    }
> >
> >
> > Something seems to go wrong here, as the parsedquery contains the SpanNearQuery instead of a PhraseQuery.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >>>> Erick Erickson <[hidden email]> 5/17/2019 4:27 PM >>>
> > Three things:
> >
> > 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after it in the index config
> >
> > 2> It is usually unnecessary to have the exact same parameters at both query and index time for WDGFF. If you’ve split parts up at index time then mashed them all back together, you can usually only split them up at query time.
> >
> > 3> try adding &debug=query to the query and see what the results show for the parsed query. That usually gives you a clue what is really happening .vs. what you think is happening.
> >
> > Best,
> > Erick
> >
> >> On May 17, 2019, at 12:59 AM, Doris Peter <[hidden email]> wrote:
> >>
> >> Hello,
> >>
> >> We use Solr 7.6.0 to build our index, and I have got a Question about
> >> Phrase Queries:
> >>
> >> We use the following configuration in schema.xml:
> >>
> >>   <!-- Text Standard -->
> >>   <fieldType name="text" class="solr.TextField"
> >> positionIncrementGap="1000" sortMissingLast="true"
> >> autoGeneratePhraseQueries="true">
> >>     <analyzer type="index">
> >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>       <charFilter class="solr.MappingCharFilterFactory"
> >> mapping="mapping-FoldToASCII.txt"/>
> >>       <filter class="solr.CJKBigramFilterFactory"/>
> >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> >> protected="protectedword.txt"
> >>            preserveOriginal="0" splitOnNumerics="1"
> >> splitOnCaseChange="0"
> >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> >>            generateWordParts="1" generateNumberParts="1"
> >> stemEnglishPossessive="1"
> >>            types="wdfftypes.txt" />
> >>       <filter class="solr.LengthFilterFactory" min="1"
> >> max="2147483647"/>
> >>       <filter class="solr.LowerCaseFilterFactory"/>
> >>     </analyzer>
> >>     <analyzer type="query">
> >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>       <charFilter class="solr.MappingCharFilterFactory"
> >> mapping="mapping-FoldToASCII.txt"/>
> >>       <filter class="solr.CJKBigramFilterFactory"/>
> >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> >> protected="protectedword.txt"
> >>            preserveOriginal="0" splitOnNumerics="1"
> >> splitOnCaseChange="0"
> >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> >>            generateWordParts="1" generateNumberParts="1"
> >> stemEnglishPossessive="1"
> >>            types="wdfftypes.txt" />
> >>       <filter class="solr.LengthFilterFactory" min="1"
> >> max="2147483647"/>
> >>       <filter class="solr.LowerCaseFilterFactory"/>
> >>     </analyzer>
> >>   </fieldType>
> >>
> >>
> >>   If we search for a phrase like "Moosburg a.d. Isar" we don't get a
> >> match, though it's definitely in our Index.
> >>   If we search for "Moosburg a. d. Isar" with a blank between "a."
> >> and "d." we get a match.
> >>
> >>   This also happens for other non-word characters, like ' or , for
> >> example.
> >>
> >>   The strange thing about it is, that the Solr Analysis-Tool reports
> >> a match for the first version, but when we send a Solr Query, we get no
> >> result Documents.
> >>
> >>   Has anyone got an idea, what this could be?
> >>
> >>   Thank you very much in advance,
> >>
> >>   Doris Peter
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

Michael Gibney
After further reflection, I think that upgrading to 8.1 (LUCENE-8730)
would actually not help in this case. It doesn't matter whether "a.b."
or "ab" would be indexed or evaluated first; they'd both have implied
positionLength 1 (as read from the index at query time), and would
both be evaluated before ("a" "b"), leaving the impression of a gap
between tokens, causing the match to be missed.

On Fri, May 17, 2019 at 12:29 PM Michael Gibney
<[hidden email]> wrote:

>
> The SpanNearQuery in association with "a.b." input and WDGF is
> expected behavior, since WDGF causes the query to search ("ab")|("a"
> "b"), as 1 or 2 tokens, respectively. The "a. b." input
> (whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so
> sticks with the more straightforward PhraseQuery implementation.
>
> That said, the problem you're encountering is related to a couple of issues:
> https://issues.apache.org/jira/browse/LUCENE-7398
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> For this case specifically, the problem is that NearSpansOrdered
> lazily returns one match per position *for the first subclause*. The
> or clause ("ab"|"a" "b"), because positionLength is not indexed, will
> always return "ab" first (implicit positionLength of 1). Again because
> "ab"'s actual positionLength of 2 from index-time WDGF is not stored
> in the index, the implicit positionLength of 1 at query-time gives the
> impression of a gap between "ab" and "isar", violating the "slop=0"
> constraint.
>
> Because NearSpansOrdered.nextStartPosition() always advances by
> calling nextStartPosition() on the first subclause (without exploring
> for variant matches in other subclauses), the top-level
> NearSpansOrdered advances after one attempt at matching, and the valid
> match is missed.
>
> Pending fixes to address the underlying issue (there is a candidate
> patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312),
> you could mitigate the problem to some extent by either forcing slop>0
> (which as of 7.6 will be expanded into MultiPhraseQuery -- see
> https://issues.apache.org/jira/browse/LUCENE-8531), or you could set
> preserveOriginal=true on both index-time and query-time WDGF and
> upgrade to 8.1 (which would prevent the extreme case of an *exact*
> character-for-character matching query turning up no results -- see
> https://issues.apache.org/jira/browse/LUCENE-8730).
>
> On Fri, May 17, 2019 at 11:47 AM Erick Erickson <[hidden email]> wrote:
> >
> > I’ll leave that explanation to someone who understands query parsers ;)
> >
> > > On May 17, 2019, at 7:57 AM, Doris Peter <[hidden email]> wrote:
> > >
> > > Thanks a lot! I tried the debug parameter, which shows interesting differences:
> > >
> > > debug": {
> > >
> > >    "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> > >    "querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> > >    "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
> > >    "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
> > >    "QParser": "LuceneQParser"
> > > }
> > >
> > > debug": {
> > >        "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> > >        "querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> > >        "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true))",
> > >        "parsedquery_toString": "spanNear([all_places_txt:neuburg, spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 0, true)]), all_places_txt:donau], 0, true)",
> > >        "QParser": "LuceneQParser"
> > >    }
> > >
> > >
> > > Something seems to go wrong here, as the parsedquery contains the SpanNearQuery instead of a PhraseQuery.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >>>> Erick Erickson <[hidden email]> 5/17/2019 4:27 PM >>>
> > > Three things:
> > >
> > > 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after it in the index config
> > >
> > > 2> It is usually unnecessary to have the exact same parameters at both query and index time for WDGFF. If you’ve split parts up at index time then mashed them all back together, you can usually only split them up at query time.
> > >
> > > 3> try adding &debug=query to the query and see what the results show for the parsed query. That usually gives you a clue what is really happening .vs. what you think is happening.
> > >
> > > Best,
> > > Erick
> > >
> > >> On May 17, 2019, at 12:59 AM, Doris Peter <[hidden email]> wrote:
> > >>
> > >> Hello,
> > >>
> > >> We use Solr 7.6.0 to build our index, and I have got a Question about
> > >> Phrase Queries:
> > >>
> > >> We use the following configuration in schema.xml:
> > >>
> > >>   <!-- Text Standard -->
> > >>   <fieldType name="text" class="solr.TextField"
> > >> positionIncrementGap="1000" sortMissingLast="true"
> > >> autoGeneratePhraseQueries="true">
> > >>     <analyzer type="index">
> > >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >>       <charFilter class="solr.MappingCharFilterFactory"
> > >> mapping="mapping-FoldToASCII.txt"/>
> > >>       <filter class="solr.CJKBigramFilterFactory"/>
> > >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> > >> protected="protectedword.txt"
> > >>            preserveOriginal="0" splitOnNumerics="1"
> > >> splitOnCaseChange="0"
> > >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> > >>            generateWordParts="1" generateNumberParts="1"
> > >> stemEnglishPossessive="1"
> > >>            types="wdfftypes.txt" />
> > >>       <filter class="solr.LengthFilterFactory" min="1"
> > >> max="2147483647"/>
> > >>       <filter class="solr.LowerCaseFilterFactory"/>
> > >>     </analyzer>
> > >>     <analyzer type="query">
> > >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >>       <charFilter class="solr.MappingCharFilterFactory"
> > >> mapping="mapping-FoldToASCII.txt"/>
> > >>       <filter class="solr.CJKBigramFilterFactory"/>
> > >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> > >> protected="protectedword.txt"
> > >>            preserveOriginal="0" splitOnNumerics="1"
> > >> splitOnCaseChange="0"
> > >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> > >>            generateWordParts="1" generateNumberParts="1"
> > >> stemEnglishPossessive="1"
> > >>            types="wdfftypes.txt" />
> > >>       <filter class="solr.LengthFilterFactory" min="1"
> > >> max="2147483647"/>
> > >>       <filter class="solr.LowerCaseFilterFactory"/>
> > >>     </analyzer>
> > >>   </fieldType>
> > >>
> > >>
> > >>   If we search for a phrase like "Moosburg a.d. Isar" we don't get a
> > >> match, though it's definitely in our Index.
> > >>   If we search for "Moosburg a. d. Isar" with a blank between "a."
> > >> and "d." we get a match.
> > >>
> > >>   This also happens for other non-word characters, like ' or , for
> > >> example.
> > >>
> > >>   The strange thing about it is, that the Solr Analysis-Tool reports
> > >> a match for the first version, but when we send a Solr Query, we get no
> > >> result Documents.
> > >>
> > >>   Has anyone got an idea, what this could be?
> > >>
> > >>   Thank you very much in advance,
> > >>
> > >>   Doris Peter
> > >
> > >
> >