FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

Eric Buss
Hi all,

I have been trying to solve an issue where FlattenGraphFilter (FGF) removes
tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches that
contain the contraction "can't" do not match.

This is on Solr version 7.7.1.

The field in question is defined as follows:

<field name="myField" type="text_general" indexed="true" stored="true"/>

And the relevant fieldType "text_general":

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    </analyzer>
</fieldType>

Finally, the relevant entries in synonyms.txt are:

can,cans
cants,cant

Using the Solr console Analysis and "can't" as the Field Value, the following
tokens are produced (find the verbose output at the bottom of this email):

Index
ST    | can't
SF    | can't
WDGF  | cant | can't | can | t
FGF   | cant | can't | can | t
SGF   | cants | cant | can't | | cans | can | t
ICUFF | cants | cant | can't | | cans | can | t
FGF   | cants | cant | can't | | t

Query
ST    | can't
SF    | can't
WDGF  | can | t
SF    | can | t
ICUFF | can | t

As you can see after the FGF the tokens "can" and "cans" are pruned so the query
does not match. Is there a reasonable way to preserve these tokens?

My key concern is that I want the "fix" for this to have as little impact on
other queries as possible.

Some things I have checked/tried:

Searching for similar problems I found this thread:
https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
Here it is suggested that FGF is not necessary (without any supporting
evidence). This goes directly against the documentation that states "If you use
[the SynonymGraphFilter] during indexing, you must follow it with a Flatten
Graph Filter":
https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
Despite this warning I tried out removing the FGF on a local
cluster and indeed it still runs and this search now works, however I am
paranoid that this will break far more things than it fixes.

I have tried adding the FGF as a filter to the query. This does not eliminate
the "can" term in the query analysis.

I have tested other contracted words. Some have this issue as well - others do
not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
preserve their tokens "won't" does not. I believe the pattern here is that
whenever part of the contraction has synonyms this problem manifests.

Eliminating WDGF is not viable as we rely on this functionality for other uses
of delimiters (such as wi-fi -> wi fi).

Performing WDGF after synonyms is also not viable as in the case that we have
the data "historical-text" we want this to match the search "history text".

The hacky solution I have found is to use the PatternReplaceFilterFactory to
replace "can't" with "cant". Though this technically solves the issue, I hope it
is obvious why this does not feel like an ideal solution.

Has anyone encountered this type of issue before? Any advice on how the filter
use here could be improved to handle this case?

Thanks,
Eric Buss


PS. The verbose output from Analysis of "can't"

Index

ST    | text          | can't            |
      | raw_bytes     | [63 61 6e 27 74] |
      | start         | 0                |
      | end           | 5                |
      | positionLength| 1                |
      | type          | <ALPHANUM>       |
      | termFrequency | 1                |
      | position      | 1                |
SF    | text          | can't            |
      | raw_bytes     | [63 61 6e 27 74] |
      | start         | 0                |
      | end           | 5                |
      | positionLength| 1                |
      | type          | <ALPHANUM>       |
      | termFrequency | 1                |
      | position      | 1                |
WDGF  | text          | cant          | can't            | can        | t          |
      | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
      | start         | 0             | 0                | 0          | 4          |
      | end           | 5             | 5                | 3          | 5          |
      | positionLength| 2             | 2                | 1          | 1          |
      | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | <ALPHANUM> |
      | termFrequency | 1             | 1                | 1          | 1          |
      | position      | 1             | 1                | 1          | 2          |
      | keyword       | false         | false            | false      | false      |
FGF   | text          | cant          | can't            | can        | t          |
      | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
      | start         | 0             | 0                | 0          | 4          |
      | end           | 5             | 5                | 3          | 5          |
      | positionLength| 2             | 2                | 1          | 1          |
      | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | <ALPHANUM> |
      | termFrequency | 1             | 1                | 1          | 1          |
      | position      | 1             | 1                | 1          | 2          |
      | keyword       | false         | false            | false      | false      |
SGF   | text          | cants            | cant          | can't            | cans          | can        | t          |
      | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e 73] | [63 61 6e] | [74]       |
      | start         | 0                | 0             | 0                | 0             | 0          | 4          |
      | end           | 5                | 5             | 5                | 3             | 3          | 5          |
      | positionLength| 1                | 1             | 2                | 1             | 1          | 1          |
      | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | SYNONYM       | <ALPHANUM> | <ALPHANUM> |
      | termFrequency | 1                | 1             | 1                | 1             | 1          | 1          |
      | position      | 1                | 1             | 1                | 3             | 3          | 4          |
      | keyword       | false            | false         | false            | false         | false      | false      |
FGF   | text          | cants            | cant          | can't            | t          |
      | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]       |
      | start         | 0                | 0             | 0                | 4          |
      | end           | 5                | 5             | 5                | 5          |
      | positionLength| 1                | 1             | 1                | 1          |
      | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> |
      | termFrequency | 1                | 1             | 1                | 1          |
      | position      | 1                | 1             | 1                | 3          |
      | keyword       | false            | false         | false            | false      |
ICUFF | text          | cants            | cant          | can't            | t          |
      | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]       |
      | start         | 0                | 0             | 0                | 4          |
      | end           | 5                | 5             | 5                | 5          |
      | positionLength| 1                | 1             | 1                | 1          |
      | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> |
      | termFrequency | 1                | 1             | 1                | 1          |
      | position      | 1                | 1             | 1                | 3          |
      | keyword       | false            | false         | false            | false      |

Query

ST    | text          | can't            |
      | raw_bytes     | [63 61 6e 27 74] |
      | start         | 0                |
      | end           | 5                |
      | positionLength| 1                |
      | type          | <ALPHANUM>       |
      | termFrequency | 1                |
      | position      | 1                |
SF    | text          | can't            |
      | raw_bytes     | [63 61 6e 27 74] |
      | start         | 0                |
      | end           | 5                |
      | positionLength| 1                |
      | type          | <ALPHANUM>       |
      | termFrequency | 1                |
      | position      | 1                |
WDGF  | text          | can        | t          |
      | raw_bytes     | [63 61 6e] | [74]       |
      | start         | 0          | 4          |
      | end           | 3          | 5          |
      | positionLength| 1          | 1          |
      | type          | <ALPHANUM> | <ALPHANUM> |
      | termFrequency | 1          | 1          |
      | position      | 1          | 2          |
      | keyword       | false      | false      |
SF    | text          | can        | t          |
      | raw_bytes     | [63 61 6e] | [74]       |
      | start         | 0          | 4          |
      | end           | 3          | 5          |
      | positionLength| 1          | 1          |
      | type          | <ALPHANUM> | <ALPHANUM> |
      | termFrequency | 1          | 1          |
      | position      | 1          | 2          |
      | keyword       | false      | false      |
ICUFF | text          | can        | t          |
      | raw_bytes     | [63 61 6e] | [74]       |
      | start         | 0          | 4          |
      | end           | 3          | 5          |
      | positionLength| 1          | 1          |
      | type          | <ALPHANUM> | <ALPHANUM> |
      | termFrequency | 1          | 1          |
      | position      | 1          | 2          |
      | keyword       | false      | false      |

Reply | Threaded
Open this post in threaded view
|

Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

Michael Gibney
I wonder if this might be similar/related to the underlying problem
that is intended to be addressed by
https://issues.apache.org/jira/browse/LUCENE-8985?

btw, I think you only want to use FlattenGraphFilter *once* in the
indexing analysis chain, towards the end (after all components that
emit graphs). ...though that's probably *not* what's causing the
problem (based on the fact that the extra FGF doesn't seem to modify
any attributes).



On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <[hidden email]> wrote:

>
> Hi all,
>
> I have been trying to solve an issue where FlattenGraphFilter (FGF) removes
> tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches that
> contain the contraction "can't" do not match.
>
> This is on Solr version 7.7.1.
>
> The field in question is defined as follows:
>
> <field name="myField" type="text_general" indexed="true" stored="true"/>
>
> And the relevant fieldType "text_general":
>
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" splitOnCaseChange="0"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     </analyzer>
> </fieldType>
>
> Finally, the relevant entries in synonyms.txt are:
>
> can,cans
> cants,cant
>
> Using the Solr console Analysis and "can't" as the Field Value, the following
> tokens are produced (find the verbose output at the bottom of this email):
>
> Index
> ST    | can't
> SF    | can't
> WDGF  | cant | can't | can | t
> FGF   | cant | can't | can | t
> SGF   | cants | cant | can't | | cans | can | t
> ICUFF | cants | cant | can't | | cans | can | t
> FGF   | cants | cant | can't | | t
>
> Query
> ST    | can't
> SF    | can't
> WDGF  | can | t
> SF    | can | t
> ICUFF | can | t
>
> As you can see after the FGF the tokens "can" and "cans" are pruned so the query
> does not match. Is there a reasonable way to preserve these tokens?
>
> My key concern is that I want the "fix" for this to have as little impact on
> other queries as possible.
>
> Some things I have checked/tried:
>
> Searching for similar problems I found this thread:
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> Here it is suggested that FGF is not necessary (without any supporting
> evidence). This goes directly against the documentation that states "If you use
> [the SynonymGraphFilter] during indexing, you must follow it with a Flatten
> Graph Filter":
> https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> Despite this warning I tried out removing the FGF on a local
> cluster and indeed it still runs and this search now works, however I am
> paranoid that this will break far more things than it fixes.
>
> I have tried adding the FGF as a filter to the query. This does not eliminate
> the "can" term in the query analysis.
>
> I have tested other contracted words. Some have this issue as well - others do
> not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
> preserve their tokens "won't" does not. I believe the pattern here is that
> whenever part of the contraction has synonyms this problem manifests.
>
> Eliminating WDGF is not viable as we rely on this functionality for other uses
> of delimiters (such as wi-fi -> wi fi).
>
> Performing WDGF after synonyms is also not viable as in the case that we have
> the data "historical-text" we want this to match the search "history text".
>
> The hacky solution I have found is to use the PatternReplaceFilterFactory to
> replace "can't" with "cant". Though this technically solves the issue, I hope it
> is obvious why this does not feel like an ideal solution.
>
> Has anyone encountered this type of issue before? Any advice on how the filter
> use here could be improved to handle this case?
>
> Thanks,
> Eric Buss
>
>
> PS. The verbose output from Analysis of "can't"
>
> Index
>
> ST    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> SF    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> WDGF  | text          | cant          | can't            | can        | t          |
>       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
>       | start         | 0             | 0                | 0          | 4          |
>       | end           | 5             | 5                | 3          | 5          |
>       | positionLength| 2             | 2                | 1          | 1          |
>       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1             | 1                | 1          | 1          |
>       | position      | 1             | 1                | 1          | 2          |
>       | keyword       | false         | false            | false      | false      |
> FGF   | text          | cant          | can't            | can        | t          |
>       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
>       | start         | 0             | 0                | 0          | 4          |
>       | end           | 5             | 5                | 3          | 5          |
>       | positionLength| 2             | 2                | 1          | 1          |
>       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1             | 1                | 1          | 1          |
>       | position      | 1             | 1                | 1          | 2          |
>       | keyword       | false         | false            | false      | false      |
> SGF   | text          | cants            | cant          | can't            | cans          | can        | t          |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e 73] | [63 61 6e] | [74]       |
>       | start         | 0                | 0             | 0                | 0             | 0          | 4          |
>       | end           | 5                | 5             | 5                | 3             | 3          | 5          |
>       | positionLength| 1                | 1             | 2                | 1             | 1          | 1          |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | SYNONYM       | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 1             | 1          | 1          |
>       | position      | 1                | 1             | 1                | 3             | 3          | 4          |
>       | keyword       | false            | false         | false            | false         | false      | false      |
> FGF   | text          | cants            | cant          | can't            | t          |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]       |
>       | start         | 0                | 0             | 0                | 4          |
>       | end           | 5                | 5             | 5                | 5          |
>       | positionLength| 1                | 1             | 1                | 1          |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 1          |
>       | position      | 1                | 1             | 1                | 3          |
>       | keyword       | false            | false         | false            | false      |
> ICUFF | text          | cants            | cant          | can't            | t          |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]       |
>       | start         | 0                | 0             | 0                | 4          |
>       | end           | 5                | 5             | 5                | 5          |
>       | positionLength| 1                | 1             | 1                | 1          |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 1          |
>       | position      | 1                | 1             | 1                | 3          |
>       | keyword       | false            | false         | false            | false      |
>
> Query
>
> ST    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> SF    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> WDGF  | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
> SF    | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
> ICUFF | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
>
Reply | Threaded
Open this post in threaded view
|

Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

Eric Buss
Thanks for the reply,

I wouldn't be surprised if the issue you linked is related, I also found another similar issue: https://issues.apache.org/jira/browse/LUCENE-8723 

You are absolutely right that the FlattenGraphFilter should only be used once, but as you noted the issue I am experiencing seems unrelated.

On 2019-12-05, 10:23 AM, "Michael Gibney" <[hidden email]> wrote:

    I wonder if this might be similar/related to the underlying problem
    that is intended to be addressed by
    https://issues.apache.org/jira/browse/LUCENE-8985?
   
    btw, I think you only want to use FlattenGraphFilter *once* in the
    indexing analysis chain, towards the end (after all components that
    emit graphs). ...though that's probably *not* what's causing the
    problem (based on the fact that the extra FGF doesn't seem to modify
    any attributes).
   
   
   
    On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <[hidden email]> wrote:
    >
    > Hi all,
    >
    > I have been trying to solve an issue where FlattenGraphFilter (FGF) removes
    > tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches that
    > contain the contraction "can't" do not match.
    >
    > This is on Solr version 7.7.1.
    >
    > The field in question is defined as follows:
    >
    > <field name="myField" type="text_general" indexed="true" stored="true"/>
    >
    > And the relevant fieldType "text_general":
    >
    > <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    >     <analyzer type="index">
    >         <tokenizer class="solr.StandardTokenizerFactory"/>
    >         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    >         <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
    >         <filter class="solr.FlattenGraphFilterFactory"/>
    >         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    >         <filter class="solr.FlattenGraphFilterFactory"/>
    >         <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    >     </analyzer>
    >     <analyzer type="query">
    >         <tokenizer class="solr.StandardTokenizerFactory"/>
    >         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    >         <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" splitOnCaseChange="0"/>
    >         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    >         <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    >     </analyzer>
    > </fieldType>
    >
    > Finally, the relevant entries in synonyms.txt are:
    >
    > can,cans
    > cants,cant
    >
    > Using the Solr console Analysis and "can't" as the Field Value, the following
    > tokens are produced (find the verbose output at the bottom of this email):
    >
    > Index
    > ST    | can't
    > SF    | can't
    > WDGF  | cant | can't | can | t
    > FGF   | cant | can't | can | t
    > SGF   | cants | cant | can't | | cans | can | t
    > ICUFF | cants | cant | can't | | cans | can | t
    > FGF   | cants | cant | can't | | t
    >
    > Query
    > ST    | can't
    > SF    | can't
    > WDGF  | can | t
    > SF    | can | t
    > ICUFF | can | t
    >
    > As you can see after the FGF the tokens "can" and "cans" are pruned so the query
    > does not match. Is there a reasonable way to preserve these tokens?
    >
    > My key concern is that I want the "fix" for this to have as little impact on
    > other queries as possible.
    >
    > Some things I have checked/tried:
    >
    > Searching for similar problems I found this thread:
    > https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
    > Here it is suggested that FGF is not necessary (without any supporting
    > evidence). This goes directly against the documentation that states "If you use
    > [the SynonymGraphFilter] during indexing, you must follow it with a Flatten
    > Graph Filter":
    > https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
    > Despite this warning I tried out removing the FGF on a local
    > cluster and indeed it still runs and this search now works, however I am
    > paranoid that this will break far more things than it fixes.
    >
    > I have tried adding the FGF as a filter to the query. This does not eliminate
    > the "can" term in the query analysis.
    >
    > I have tested other contracted words. Some have this issue as well - others do
    > not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
    > preserve their tokens "won't" does not. I believe the pattern here is that
    > whenever part of the contraction has synonyms this problem manifests.
    >
    > Eliminating WDGF is not viable as we rely on this functionality for other uses
    > of delimiters (such as wi-fi -> wi fi).
    >
    > Performing WDGF after synonyms is also not viable as in the case that we have
    > the data "historical-text" we want this to match the search "history text".
    >
    > The hacky solution I have found is to use the PatternReplaceFilterFactory to
    > replace "can't" with "cant". Though this technically solves the issue, I hope it
    > is obvious why this does not feel like an ideal solution.
    >
    > Has anyone encountered this type of issue before? Any advice on how the filter
    > use here could be improved to handle this case?
    >
    > Thanks,
    > Eric Buss
    >
    >
    > PS. The verbose output from Analysis of "can't"
    >
    > Index
    >
    > ST    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > SF    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > WDGF  | text          | cant          | can't            | can        | t          |
    >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
    >       | start         | 0             | 0                | 0          | 4          |
    >       | end           | 5             | 5                | 3          | 5          |
    >       | positionLength| 2             | 2                | 1          | 1          |
    >       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1             | 1                | 1          | 1          |
    >       | position      | 1             | 1                | 1          | 2          |
    >       | keyword       | false         | false            | false      | false      |
    > FGF   | text          | cant          | can't            | can        | t          |
    >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
    >       | start         | 0             | 0                | 0          | 4          |
    >       | end           | 5             | 5                | 3          | 5          |
    >       | positionLength| 2             | 2                | 1          | 1          |
    >       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1             | 1                | 1          | 1          |
    >       | position      | 1             | 1                | 1          | 2          |
    >       | keyword       | false         | false            | false      | false      |
    > SGF   | text          | cants            | cant          | can't            | cans          | can        | t          |
    >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e 73] | [63 61 6e] | [74]       |
    >       | start         | 0                | 0             | 0                | 0             | 0          | 4          |
    >       | end           | 5                | 5             | 5                | 3             | 3          | 5          |
    >       | positionLength| 1                | 1             | 2                | 1             | 1          | 1          |
    >       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | SYNONYM       | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1                | 1             | 1                | 1             | 1          | 1          |
    >       | position      | 1                | 1             | 1                | 3             | 3          | 4          |
    >       | keyword       | false            | false         | false            | false         | false      | false      |
    > FGF   | text          | cants            | cant          | can't            | t          |
    >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]       |
    >       | start         | 0                | 0             | 0                | 4          |
    >       | end           | 5                | 5             | 5                | 5          |
    >       | positionLength| 1                | 1             | 1                | 1          |
    >       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> |
    >       | termFrequency | 1                | 1             | 1                | 1          |
    >       | position      | 1                | 1             | 1                | 3          |
    >       | keyword       | false            | false         | false            | false      |
    > ICUFF | text          | cants            | cant          | can't            | t          |
    >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]       |
    >       | start         | 0                | 0             | 0                | 4          |
    >       | end           | 5                | 5             | 5                | 5          |
    >       | positionLength| 1                | 1             | 1                | 1          |
    >       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> |
    >       | termFrequency | 1                | 1             | 1                | 1          |
    >       | position      | 1                | 1             | 1                | 3          |
    >       | keyword       | false            | false         | false            | false      |
    >
    > Query
    >
    > ST    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > SF    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > WDGF  | text          | can        | t          |
    >       | raw_bytes     | [63 61 6e] | [74]       |
    >       | start         | 0          | 4          |
    >       | end           | 3          | 5          |
    >       | positionLength| 1          | 1          |
    >       | type          | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1          | 1          |
    >       | position      | 1          | 2          |
    >       | keyword       | false      | false      |
    > SF    | text          | can        | t          |
    >       | raw_bytes     | [63 61 6e] | [74]       |
    >       | start         | 0          | 4          |
    >       | end           | 3          | 5          |
    >       | positionLength| 1          | 1          |
    >       | type          | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1          | 1          |
    >       | position      | 1          | 2          |
    >       | keyword       | false      | false      |
    > ICUFF | text          | can        | t          |
    >       | raw_bytes     | [63 61 6e] | [74]       |
    >       | start         | 0          | 4          |
    >       | end           | 3          | 5          |
    >       | positionLength| 1          | 1          |
    >       | type          | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1          | 1          |
    >       | position      | 1          | 2          |
    >       | keyword       | false      | false      |
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

Paras Lehana
Hi Michael,

I think you only want to use FlattenGraphFilter *once* in the indexing
> analysis chain


I had been doing this for a long time before I finally shifted to use FGF
after every GraphFilterFactory. Although I don't know much about it on the
code level, are you sure that all the following filters will be able to
consume graph in case we don't use FGF after a graph factory?

On Fri, 6 Dec 2019 at 01:22, Eric Buss <[hidden email]> wrote:

> Thanks for the reply,
>
> I wouldn't be surprised if the issue you linked is related, I also found
> another similar issue: https://issues.apache.org/jira/browse/LUCENE-8723
>
> You are absolutely right that the FlattenGraphFilter should only be used
> once, but as you noted the issue I am experiencing seems unrelated.
>
> On 2019-12-05, 10:23 AM, "Michael Gibney" <[hidden email]>
> wrote:
>
>     I wonder if this might be similar/related to the underlying problem
>     that is intended to be addressed by
>     https://issues.apache.org/jira/browse/LUCENE-8985?
>
>     btw, I think you only want to use FlattenGraphFilter *once* in the
>     indexing analysis chain, towards the end (after all components that
>     emit graphs). ...though that's probably *not* what's causing the
>     problem (based on the fact that the extra FGF doesn't seem to modify
>     any attributes).
>
>
>
>     On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <[hidden email]>
> wrote:
>     >
>     > Hi all,
>     >
>     > I have been trying to solve an issue where FlattenGraphFilter (FGF)
> removes
>     > tokens produced by WordDelimiterGraphFilter (WDGF) - consequently
> searches that
>     > contain the contraction "can't" do not match.
>     >
>     > This is on Solr version 7.7.1.
>     >
>     > The field in question is defined as follows:
>     >
>     > <field name="myField" type="text_general" indexed="true"
> stored="true"/>
>     >
>     > And the relevant fieldType "text_general":
>     >
>     > <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>     >     <analyzer type="index">
>     >         <tokenizer class="solr.StandardTokenizerFactory"/>
>     >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>     >         <filter class="solr.WordDelimiterGraphFilterFactory"
> stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1"
> splitOnCaseChange="0"/>
>     >         <filter class="solr.FlattenGraphFilterFactory"/>
>     >         <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>     >         <filter class="solr.FlattenGraphFilterFactory"/>
>     >         <filter
> class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     >     </analyzer>
>     >     <analyzer type="query">
>     >         <tokenizer class="solr.StandardTokenizerFactory"/>
>     >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>     >         <filter class="solr.WordDelimiterGraphFilterFactory"
> stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0"
> splitOnCaseChange="0"/>
>     >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>     >         <filter
> class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     >     </analyzer>
>     > </fieldType>
>     >
>     > Finally, the relevant entries in synonyms.txt are:
>     >
>     > can,cans
>     > cants,cant
>     >
>     > Using the Solr console Analysis and "can't" as the Field Value, the
> following
>     > tokens are produced (find the verbose output at the bottom of this
> email):
>     >
>     > Index
>     > ST    | can't
>     > SF    | can't
>     > WDGF  | cant | can't | can | t
>     > FGF   | cant | can't | can | t
>     > SGF   | cants | cant | can't | | cans | can | t
>     > ICUFF | cants | cant | can't | | cans | can | t
>     > FGF   | cants | cant | can't | | t
>     >
>     > Query
>     > ST    | can't
>     > SF    | can't
>     > WDGF  | can | t
>     > SF    | can | t
>     > ICUFF | can | t
>     >
>     > As you can see after the FGF the tokens "can" and "cans" are pruned
> so the query
>     > does not match. Is there a reasonable way to preserve these tokens?
>     >
>     > My key concern is that I want the "fix" for this to have as little
> impact on
>     > other queries as possible.
>     >
>     > Some things I have checked/tried:
>     >
>     > Searching for similar problems I found this thread:
>     >
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
>     > Here it is suggested that FGF is not necessary (without any
> supporting
>     > evidence). This goes directly against the documentation that states
> "If you use
>     > [the SynonymGraphFilter] during indexing, you must follow it with a
> Flatten
>     > Graph Filter":
>     > https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
>     > Despite this warning I tried out removing the FGF on a local
>     > cluster and indeed it still runs and this search now works, however
> I am
>     > paranoid that this will break far more things than it fixes.
>     >
>     > I have tried adding the FGF as a filter to the query. This does not
> eliminate
>     > the "can" term in the query analysis.
>     >
>     > I have tested other contracted words. Some have this issue as well -
> others do
>     > not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't"
> all
>     > preserve their tokens "won't" does not. I believe the pattern here
> is that
>     > whenever part of the contraction has synonyms this problem manifests.
>     >
>     > Eliminating WDGF is not viable as we rely on this functionality for
> other uses
>     > of delimiters (such as wi-fi -> wi fi).
>     >
>     > Performing WDGF after synonyms is also not viable as in the case
> that we have
>     > the data "historical-text" we want this to match the search "history
> text".
>     >
>     > The hacky solution I have found is to use the
> PatternReplaceFilterFactory to
>     > replace "can't" with "cant". Though this technically solves the
> issue, I hope it
>     > is obvious why this does not feel like an ideal solution.
>     >
>     > Has anyone encountered this type of issue before? Any advice on how
> the filter
>     > use here could be improved to handle this case?
>     >
>     > Thanks,
>     > Eric Buss
>     >
>     >
>     > PS. The verbose output from Analysis of "can't"
>     >
>     > Index
>     >
>     > ST    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > SF    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > WDGF  | text          | cant          | can't            | can
>   | t          |
>     >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61
> 6e] | [74]       |
>     >       | start         | 0             | 0                | 0
>   | 4          |
>     >       | end           | 5             | 5                | 3
>   | 5          |
>     >       | positionLength| 2             | 2                | 1
>   | 1          |
>     >       | type          | <ALPHANUM>    | <ALPHANUM>       |
> <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1             | 1                | 1
>   | 1          |
>     >       | position      | 1             | 1                | 1
>   | 2          |
>     >       | keyword       | false         | false            | false
>   | false      |
>     > FGF   | text          | cant          | can't            | can
>   | t          |
>     >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61
> 6e] | [74]       |
>     >       | start         | 0             | 0                | 0
>   | 4          |
>     >       | end           | 5             | 5                | 3
>   | 5          |
>     >       | positionLength| 2             | 2                | 1
>   | 1          |
>     >       | type          | <ALPHANUM>    | <ALPHANUM>       |
> <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1             | 1                | 1
>   | 1          |
>     >       | position      | 1             | 1                | 1
>   | 2          |
>     >       | keyword       | false         | false            | false
>   | false      |
>     > SGF   | text          | cants            | cant          | can't
>         | cans          | can        | t          |
>     >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e
> 27 74] | [63 61 6e 73] | [63 61 6e] | [74]       |
>     >       | start         | 0                | 0             | 0
>         | 0             | 0          | 4          |
>     >       | end           | 5                | 5             | 5
>         | 3             | 3          | 5          |
>     >       | positionLength| 1                | 1             | 2
>         | 1             | 1          | 1          |
>     >       | type          | SYNONYM          | <ALPHANUM>    |
> <ALPHANUM>       | SYNONYM       | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1                | 1             | 1
>         | 1             | 1          | 1          |
>     >       | position      | 1                | 1             | 1
>         | 3             | 3          | 4          |
>     >       | keyword       | false            | false         | false
>         | false         | false      | false      |
>     > FGF   | text          | cants            | cant          | can't
>         | t          |
>     >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e
> 27 74] | [74]       |
>     >       | start         | 0                | 0             | 0
>         | 4          |
>     >       | end           | 5                | 5             | 5
>         | 5          |
>     >       | positionLength| 1                | 1             | 1
>         | 1          |
>     >       | type          | SYNONYM          | <ALPHANUM>    |
> <ALPHANUM>       | <ALPHANUM> |
>     >       | termFrequency | 1                | 1             | 1
>         | 1          |
>     >       | position      | 1                | 1             | 1
>         | 3          |
>     >       | keyword       | false            | false         | false
>         | false      |
>     > ICUFF | text          | cants            | cant          | can't
>         | t          |
>     >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e
> 27 74] | [74]       |
>     >       | start         | 0                | 0             | 0
>         | 4          |
>     >       | end           | 5                | 5             | 5
>         | 5          |
>     >       | positionLength| 1                | 1             | 1
>         | 1          |
>     >       | type          | SYNONYM          | <ALPHANUM>    |
> <ALPHANUM>       | <ALPHANUM> |
>     >       | termFrequency | 1                | 1             | 1
>         | 1          |
>     >       | position      | 1                | 1             | 1
>         | 3          |
>     >       | keyword       | false            | false         | false
>         | false      |
>     >
>     > Query
>     >
>     > ST    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > SF    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > WDGF  | text          | can        | t          |
>     >       | raw_bytes     | [63 61 6e] | [74]       |
>     >       | start         | 0          | 4          |
>     >       | end           | 3          | 5          |
>     >       | positionLength| 1          | 1          |
>     >       | type          | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1          | 1          |
>     >       | position      | 1          | 2          |
>     >       | keyword       | false      | false      |
>     > SF    | text          | can        | t          |
>     >       | raw_bytes     | [63 61 6e] | [74]       |
>     >       | start         | 0          | 4          |
>     >       | end           | 3          | 5          |
>     >       | positionLength| 1          | 1          |
>     >       | type          | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1          | 1          |
>     >       | position      | 1          | 2          |
>     >       | keyword       | false      | false      |
>     > ICUFF | text          | can        | t          |
>     >       | raw_bytes     | [63 61 6e] | [74]       |
>     >       | start         | 0          | 4          |
>     >       | end           | 3          | 5          |
>     >       | positionLength| 1          | 1          |
>     >       | type          | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1          | 1          |
>     >       | position      | 1          | 2          |
>     >       | keyword       | false      | false      |
>     >
>
>
>

--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>