Questions for SynonymGraphFilter and WordDelimiterGraphFilter

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions for SynonymGraphFilter and WordDelimiterGraphFilter

weiwang19
Hello,

We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
WordDelimiterFilter have been deprecated. Solr doc recommends to use
SynonymGraphFilter and WordDelimiterGraphFilter instead.  In current
schema, we have text field type defined as

<fieldType name="text_syn" class="solr.TextField" omitPositions="true"
positionIncrementGap="100" autoGeneratePhraseQueries="true">

      <analyzer type="index">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>


        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>


        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" generateWordParts="0" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

</fieldType>

In the index phase we have both SynonymFilter and WordDelimiterFilter
configured:

        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

Solr documentation states that "graph filters produces correct token
graphs, but cannot consume an input token graph correctly. When use
these two graph filter during indexing, you must follow it with a
FlattenGraphFilter". I am confused as how to replace them with the new
SynonymGraphFilter and WordDelimiterGraphFilter. A few questions:

1. Regarding the FlattenGraphFilter, is it to be used only once or
multiple times after each graph filter? Can we have the configure like
this?

       <filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

       <filter class="solr.FlattenGraphFilterFactory"/>

       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterGraphFilterFactory"
splitOnNumerics="0" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

       <filter class="solr.FlattenGraphFilterFactory"/>

2. Is it possible to we have two graph filters, i.e. both
SynonymGraphFilter and WordDelimiterGraphFilter in the same analysis
chain? If not what's the best option to replace our current config?

3. With the StopFilterFactory in between SynonymGraphFilter and
WordDelimiterGraphFilter, I get a few index errors:

Exception writing document id XXXXXX to the index; possible analysis error

Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1

But if I move StopFilter before the SynonymGraphFilter the errors are gone.

I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
if  it's a solr defect or there is a guideline that StopFilter should
not be put after graph filters.

Thanks in advance for you input.


Thanks,

Wei
Reply | Threaded
Open this post in threaded view
|

Re: Questions for SynonymGraphFilter and WordDelimiterGraphFilter

Thomas Aglassinger
Hi Wei,

here's a fairly simple field type we currently use in a project that seems to do the job with graph synonyms. Maybe this helps as a starting point for you:

        <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de" />
                <filter class="solr.WordDelimiterGraphFilterFactory"  preserveOriginal="1"
                        generateWordParts="1" generateNumberParts="1" catenateWords="1"
                        catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <filter class="solr.GermanStemFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>

As you can see we use the same filters for both indexing and query, so this might have some impact on positional queries but so far it seems negligible for the short synonyms we use in practice. Also there is no need for the FlattenGraphFilter.

The WhitespaceTokenizerFactory ensures that you can define synonyms with hyphens like mac-book -> macbook.

Best regards, Thomas.


On 05.01.19, 02:11, "Wei" <[hidden email]> wrote:

    Hello,
   
    We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
    WordDelimiterFilter have been deprecated. Solr doc recommends to use
    SynonymGraphFilter and WordDelimiterGraphFilter instead
    I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
    if  it's a solr defect or there is a guideline that StopFilter should
    not be put after graph filters.
   
    Thanks in advance for you input.
   
   
    Thanks,
   
    Wei
   

Reply | Threaded
Open this post in threaded view
|

Re: Questions for SynonymGraphFilter and WordDelimiterGraphFilter

weiwang19
Thanks Thomas. You mentioned "Also there is no need for the
FlattenGraphFilter", that's quite interesting because the Solr
documentation says it's mandatory for indexing:
https://lucene.apache.org/solr/guide/7_6/filter-descriptions.html. Is there
any more explanation for this?

Best regards,
Wei


On Mon, Jan 7, 2019 at 7:56 AM Thomas Aglassinger <
[hidden email]> wrote:

> Hi Wei,
>
> here's a fairly simple field type we currently use in a project that seems
> to do the job with graph synonyms. Maybe this helps as a starting point for
> you:
>
>         <fieldType name="text_de" class="solr.TextField"
> positionIncrementGap="100">
>             <analyzer>
>                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
>                 <filter class="solr.ManagedSynonymGraphFilterFactory"
> managed="de" />
>                 <filter class="solr.ManagedStopFilterFactory" managed="de"
> />
>                 <filter class="solr.WordDelimiterGraphFilterFactory"
> preserveOriginal="1"
>                         generateWordParts="1" generateNumberParts="1"
> catenateWords="1"
>                         catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1" />
>                 <filter class="solr.LowerCaseFilterFactory" />
>                 <filter class="solr.ASCIIFoldingFilterFactory" />
>                 <filter class="solr.GermanStemFilterFactory" />
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> As you can see we use the same filters for both indexing and query, so
> this might have some impact on positional queries but so far it seems
> negligible for the short synonyms we use in practice. Also there is no need
> for the FlattenGraphFilter.
>
> The WhitespaceTokenizerFactory ensures that you can define synonyms with
> hyphens like mac-book -> macbook.
>
> Best regards, Thomas.
>
>
> On 05.01.19, 02:11, "Wei" <[hidden email]> wrote:
>
>     Hello,
>
>     We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
>     WordDelimiterFilter have been deprecated. Solr doc recommends to use
>     SynonymGraphFilter and WordDelimiterGraphFilter instead
>     I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
>     if  it's a solr defect or there is a guideline that StopFilter should
>     not be put after graph filters.
>
>     Thanks in advance for you input.
>
>
>     Thanks,
>
>     Wei
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Questions for SynonymGraphFilter and WordDelimiterGraphFilter

weiwang19
bump..

On Mon, Jan 7, 2019 at 11:53 AM Wei <[hidden email]> wrote:

> Thanks Thomas. You mentioned "Also there is no need for the
> FlattenGraphFilter", that's quite interesting because the Solr
> documentation says it's mandatory for indexing:
> https://lucene.apache.org/solr/guide/7_6/filter-descriptions.html. Is
> there any more explanation for this?
>
> Best regards,
> Wei
>
>
> On Mon, Jan 7, 2019 at 7:56 AM Thomas Aglassinger <
> [hidden email]> wrote:
>
>> Hi Wei,
>>
>> here's a fairly simple field type we currently use in a project that
>> seems to do the job with graph synonyms. Maybe this helps as a starting
>> point for you:
>>
>>         <fieldType name="text_de" class="solr.TextField"
>> positionIncrementGap="100">
>>             <analyzer>
>>                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
>>                 <filter class="solr.ManagedSynonymGraphFilterFactory"
>> managed="de" />
>>                 <filter class="solr.ManagedStopFilterFactory"
>> managed="de" />
>>                 <filter class="solr.WordDelimiterGraphFilterFactory"
>> preserveOriginal="1"
>>                         generateWordParts="1" generateNumberParts="1"
>> catenateWords="1"
>>                         catenateNumbers="1" catenateAll="0"
>> splitOnCaseChange="1" />
>>                 <filter class="solr.LowerCaseFilterFactory" />
>>                 <filter class="solr.ASCIIFoldingFilterFactory" />
>>                 <filter class="solr.GermanStemFilterFactory" />
>>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>             </analyzer>
>>         </fieldType>
>>
>> As you can see we use the same filters for both indexing and query, so
>> this might have some impact on positional queries but so far it seems
>> negligible for the short synonyms we use in practice. Also there is no need
>> for the FlattenGraphFilter.
>>
>> The WhitespaceTokenizerFactory ensures that you can define synonyms with
>> hyphens like mac-book -> macbook.
>>
>> Best regards, Thomas.
>>
>>
>> On 05.01.19, 02:11, "Wei" <[hidden email]> wrote:
>>
>>     Hello,
>>
>>     We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
>>     WordDelimiterFilter have been deprecated. Solr doc recommends to use
>>     SynonymGraphFilter and WordDelimiterGraphFilter instead
>>     I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
>>     if  it's a solr defect or there is a guideline that StopFilter should
>>     not be put after graph filters.
>>
>>     Thanks in advance for you input.
>>
>>
>>     Thanks,
>>
>>     Wei
>>
>>
>>