SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Александр Шестак

Hi, I have misunderstanding about usage of SynonymGraphFilterFactory and  WordDelimiterGraphFilterFactory. Can they be used together?

I have solr type configured in next way

<fieldtype name="fulltext_en" class="solr.TextField" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterGraphFilterFactory"
            generateWordParts="1" generateNumberParts="1" splitOnNumerics="1"
            catenateWords="1" catenateNumbers="1" catenateAll="0" preserveOriginal="1" protected="protwords_en.txt"/>
    <filter class="solr.FlattenGraphFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterGraphFilterFactory"
            generateWordParts="1" generateNumberParts="1" splitOnNumerics="1"
            catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="1" protected="protwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymGraphFilterFactory"
            synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
  </analyzer>
</fieldtype>

So on query time it uses SynonymGraphFilterFactory after WordDelimiterGraphFilterFactory.
Synonyms are configured in next way:
b=>b,boron
2=>ii,2

Query in solr analysis tool looks so. It is shown that terms after SGF have positions 3 and 4. Is it correct? I thought that they should had 1 and 2 positions.

Now when a perform such query "my_field:b2"  then parsedQuery looks so "my_field:b2 Synonym(my_field:2 my_field:ii)"
But for such query "my_field:2b"  parsedQuery will be such
  "my_field:2b Synonym(my_field:b my_field:boron)"

Synonym works only with last part of my query. 

Am I doing something wrong? Or maybe this filters are incompatible?





Now when a perform such query "my_field:b2"  then parsedQuery looks so "my_field:b2 Synonym(my_field:2 my_field:ii)"
But for such query "my_field:2b" parsedQuery will be such "my_field:2b Synonym(my_field:b my_field:boron)"
Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Shawn Heisey-2
On 2/5/2018 3:55 AM, Александр Шестак wrote:
>
> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
> and  WordDelimiterGraphFilterFactory. Can they be used together?
>

There should be no problem with using them together.  But it is always
possible that the behavior will surprise you, while working 100% as
designed.

> I have solr type configured in next way
>
> <fieldtype name="fulltext_en" class="solr.TextField"
> autoGeneratePhraseQueries="true">
>   <analyzer type="index">
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>     <filter class="solr.WordDelimiterGraphFilterFactory"
>             generateWordParts="1" generateNumberParts="1"
> splitOnNumerics="1"
>             catenateWords="1" catenateNumbers="1" catenateAll="0"
> preserveOriginal="1" protected="protwords_en.txt"/>
>     <filter class="solr.FlattenGraphFilterFactory"/>
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>     <filter class="solr.WordDelimiterGraphFilterFactory"
>             generateWordParts="1" generateNumberParts="1"
> splitOnNumerics="1"
>             catenateWords="0" catenateNumbers="0" catenateAll="0"
> preserveOriginal="1" protected="protwords_en.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.SynonymGraphFilterFactory"
>             synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>   </analyzer>
> </fieldtype>
>
> So on query time it uses SynonymGraphFilterFactory after
> WordDelimiterGraphFilterFactory.
> Synonyms are configured in next way:
> b=>b,boron
> 2=>ii,2
>
> Query in solr analysis tool looks so. It is shown that terms after SGF
> have positions 3 and 4. Is it correct? I thought that they should had
> 1 and 2 positions.
>

What matters is the *relative* positions.  The exact position number
doesn't matter much.  Something new that the Graph implementations use
is the position length.  That feature is necessary for multi-term
synonyms to function correctly in phrase queries.

In your analysis screenshot, WDGF creates three tokens.  The two tokens
created by splitting the input are at positions 1 and 2, which I think
is 100% as expected.  It also sets the positionLength of the first term
to 2, probably because it has split that term into 2 additional terms.

Then the SGF takes those last two terms and expands them.  Each of the
synonyms is at the same position as the original term, and the relative
positions of the two synonym pairs have not changed -- the second one is
still one higher than the first.  I think the reason that SGF moves the
positions two higher is because the positionLength on the "b2" term is
2, previously set by WDGF.  Someone with more knowledge about the Graph
implementations may have to speak up as to whether this behavior is correct.

Because the relative positions of the split terms don't change when SGF
runs, I think this is probably working as designed.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

sarowe
Hi Александр,

> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
>
> There should be no problem with using them together.

I believe Shawn is wrong.

From <http://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:

> NOTE: this cannot consume an incoming graph; results will be undefined.

Unfortunately, the ref guide entry for Synonym Graph Filter <https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-graph-filter> doesn’t include a warning about this, but it should, like the warning on Word Delimiter Graph Filter <https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:

> Note: although this filter produces correct token graphs, it cannot consume an input token graph correctly.

(I’ve just committed a change to the ref guide source to add this also on the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be included in the ref guide for Solr 7.3.)

In short, the combination of the two filters is not supported, because WDGF produces a token graph, which SGF cannot correctly interpret.

Other filters also have this issue, see e.g. <https://issues.apache.org/jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some attention recently, and hopefully it will inspire fixes elsewhere.

Patches welcome!

--
Steve
www.lucidworks.com


> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
>
> On 2/5/2018 3:55 AM, Александр Шестак wrote:
>>
>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
>> and  WordDelimiterGraphFilterFactory. Can they be used together?
>>
>
> There should be no problem with using them together.  But it is always
> possible that the behavior will surprise you, while working 100% as
> designed.
>
>> I have solr type configured in next way
>>
>> <fieldtype name="fulltext_en" class="solr.TextField"
>> autoGeneratePhraseQueries="true">
>>   <analyzer type="index">
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>             generateWordParts="1" generateNumberParts="1"
>> splitOnNumerics="1"
>>             catenateWords="1" catenateNumbers="1" catenateAll="0"
>> preserveOriginal="1" protected="protwords_en.txt"/>
>>     <filter class="solr.FlattenGraphFilterFactory"/>
>>   </analyzer>
>>   <analyzer type="query">
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>             generateWordParts="1" generateNumberParts="1"
>> splitOnNumerics="1"
>>             catenateWords="0" catenateNumbers="0" catenateAll="0"
>> preserveOriginal="1" protected="protwords_en.txt"/>
>>     <filter class="solr.LowerCaseFilterFactory"/>
>>     <filter class="solr.SynonymGraphFilterFactory"
>>             synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>>   </analyzer>
>> </fieldtype>
>>
>> So on query time it uses SynonymGraphFilterFactory after
>> WordDelimiterGraphFilterFactory.
>> Synonyms are configured in next way:
>> b=>b,boron
>> 2=>ii,2
>>
>> Query in solr analysis tool looks so. It is shown that terms after SGF
>> have positions 3 and 4. Is it correct? I thought that they should had
>> 1 and 2 positions.
>>
>
> What matters is the *relative* positions.  The exact position number
> doesn't matter much.  Something new that the Graph implementations use
> is the position length.  That feature is necessary for multi-term
> synonyms to function correctly in phrase queries.
>
> In your analysis screenshot, WDGF creates three tokens.  The two tokens
> created by splitting the input are at positions 1 and 2, which I think
> is 100% as expected.  It also sets the positionLength of the first term
> to 2, probably because it has split that term into 2 additional terms.
>
> Then the SGF takes those last two terms and expands them.  Each of the
> synonyms is at the same position as the original term, and the relative
> positions of the two synonym pairs have not changed -- the second one is
> still one higher than the first.  I think the reason that SGF moves the
> positions two higher is because the positionLength on the "b2" term is
> 2, previously set by WDGF.  Someone with more knowledge about the Graph
> implementations may have to speak up as to whether this behavior is correct.
>
> Because the relative positions of the split terms don't change when SGF
> runs, I think this is probably working as designed.
>
> Thanks,
> Shawn

Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Shawn Heisey-2
On 2/5/2018 9:27 AM, Steve Rowe wrote:
> I believe Shawn is wrong.

Not happy to be wrong, but glad for your assist and making sure that the
right information is provided.

Sounds like it's just not possible to use multiple graph-aware filters
together.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re[2]: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Александр Шестак
In reply to this post by sarowe

Hi, thank you for your explanation.
I have one more question related to this topic.
I have changed my schema in next way (replaced SynonymGraphFilterFactory with SynonymFilterFactory):
<fieldtype name="fulltext_en" class="solr.TextField" autoGeneratePhraseQueries="true">
   <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" splitOnNumerics="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" preserveOriginal="1" protected="protwords_en.txt"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" splitOnNumerics="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="1" protected="protwords_en.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SynonymFilterFactory"
synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
   </analyzer>
</fieldtype>
Now I have another strange issue.
If I have configured synonyms in next way
b=>b,boron
2=>ii,2
Then for query "my_field:b2" parsedQuery looks so "my_field:b2 Synonym(my_field:2 my_field:ii)"
But when I changed synonyms to
b,boron
ii,2
Then for query "my_field:b2" parsedQuery looks so "my_field:b2 my_field:\"b 2\" my_field:\"b ii\" my_field:\"boron 2\" my_field:\"boron ii\")"
The second query is correct (it uses synonyms for two parts after word split).
May be somebody can explain why synonym behavior depends on kind of synonym mappings?
And generally is it correct to use SynonymFilterFactory after WordDelimiterGraphFilterFactory? We can't use two graph filters together but in another way I am forced to use deprecated SynonymFilterFactory?


>Понедельник,  5 февраля 2018, 19:27 +03:00 от Steve Rowe < [hidden email] >:
>
>Hi Александр,
>
>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <  [hidden email] > wrote:
>>
>> There should be no problem with using them together.
>
>I believe Shawn is wrong.
>
>From <  http://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymGraphFilter.html >:
>
>> NOTE: this cannot consume an incoming graph; results will be undefined.
>
>Unfortunately, the ref guide entry for Synonym Graph Filter <  https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-graph-filter > doesn’t include a warning about this, but it should, like the warning on Word Delimiter Graph Filter <  https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter >:
>
>> Note: although this filter produces correct token graphs, it cannot consume an input token graph correctly.
>
>(I’ve just committed a change to the ref guide source to add this also on the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be included in the ref guide for Solr 7.3.)
>
>In short, the combination of the two filters is not supported, because WDGF produces a token graph, which SGF cannot correctly interpret.
>
>Other filters also have this issue, see e.g. <  https://issues.apache.org/jira/browse/LUCENE-3475 > for ShingleFilter; this issue has gotten some attention recently, and hopefully it will inspire fixes elsewhere.
>
>Patches welcome!
>
>--
>Steve
> www.lucidworks.com
>
>
>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <  [hidden email] > wrote:
>>
>> On 2/5/2018 3:55 AM, Александр Шестак wrote:
>>>
>>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
>>> and  WordDelimiterGraphFilterFactory. Can they be used together?
>>>
>>
>> There should be no problem with using them together.  But it is always
>> possible that the behavior will surprise you, while working 100% as
>> designed.
>>
>>> I have solr type configured in next way
>>>
>>> <fieldtype name="fulltext_en" class="solr.TextField"
>>> autoGeneratePhraseQueries="true">
>>>   <analyzer type="index">
>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>>             generateWordParts="1" generateNumberParts="1"
>>> splitOnNumerics="1"
>>>             catenateWords="1" catenateNumbers="1" catenateAll="0"
>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>     <filter class="solr.FlattenGraphFilterFactory"/>
>>>   </analyzer>
>>>   <analyzer type="query">
>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>>             generateWordParts="1" generateNumberParts="1"
>>> splitOnNumerics="1"
>>>             catenateWords="0" catenateNumbers="0" catenateAll="0"
>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>     <filter class="solr.SynonymGraphFilterFactory"
>>>             synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>>>   </analyzer>
>>> </fieldtype>
>>>
>>> So on query time it uses SynonymGraphFilterFactory after
>>> WordDelimiterGraphFilterFactory.
>>> Synonyms are configured in next way:
>>> b=>b,boron
>>> 2=>ii,2
>>>
>>> Query in solr analysis tool looks so. It is shown that terms after SGF
>>> have positions 3 and 4. Is it correct? I thought that they should had
>>> 1 and 2 positions.
>>>
>>
>> What matters is the *relative* positions.  The exact position number
>> doesn't matter much.  Something new that the Graph implementations use
>> is the position length.  That feature is necessary for multi-term
>> synonyms to function correctly in phrase queries.
>>
>> In your analysis screenshot, WDGF creates three tokens.  The two tokens
>> created by splitting the input are at positions 1 and 2, which I think
>> is 100% as expected.  It also sets the positionLength of the first term
>> to 2, probably because it has split that term into 2 additional terms.
>>
>> Then the SGF takes those last two terms and expands them.  Each of the
>> synonyms is at the same position as the original term, and the relative
>> positions of the two synonym pairs have not changed -- the second one is
>> still one higher than the first.  I think the reason that SGF moves the
>> positions two higher is because the positionLength on the "b2" term is
>> 2, previously set by WDGF.  Someone with more knowledge about the Graph
>> implementations may have to speak up as to whether this behavior is correct.
>>
>> Because the relative positions of the split terms don't change when SGF
>> runs, I think this is probably working as designed.
>>
>> Thanks,
>> Shawn
>


--
Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

sarowe
Hi,

> On Feb 5, 2018, at 10:31 PM, Александр Шестак <[hidden email]> wrote:
>
> Now I have another strange issue.
> If I have configured synonyms in next way
> b=>b,boron
> 2=>ii,2
> Then for query "my_field:b2" parsedQuery looks so "my_field:b2 Synonym(my_field:2 my_field:ii)"
> But when I changed synonyms to
> b,boron
> ii,2
> Then for query "my_field:b2" parsedQuery looks so "my_field:b2 my_field:\"b 2\" my_field:\"b ii\" my_field:\"boron 2\" my_field:\"boron ii\")"
> The second query is correct (it uses synonyms for two parts after word split).
> May be somebody can explain why synonym behavior depends on kind of synonym mappings?

Sorry, I don’t know why the two behave differently.  Seems like they should be the same, since you have expand=“true”.  Would you please create a JIRA?

> And generally is it correct to use SynonymFilterFactory after WordDelimiterGraphFilterFactory? We can't use two graph filters together but in another way I am forced to use deprecated SynonymFilterFactory?

Since most (all?) filters do not correctly interpret input token graphs, you can only expect correct positions if a graph token filter is not followed by any other filter.

--
Steve
www.lucidworks.com

Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

WebsterHomer
In reply to this post by sarowe
I noticed that in some of the current example schemas that are shipped with
Solr, there is a fieldtype, text_en_splitting, that feeds the output
of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if
this isn't supported, the example should probably be updated or removed.

On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe <[hidden email]> wrote:

> Hi Александр,
>
> > On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
> >
> > There should be no problem with using them together.
>
> I believe Shawn is wrong.
>
> From <http://lucene.apache.org/core/7_2_0/analyzers-common/
> org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:
>
> > NOTE: this cannot consume an incoming graph; results will be undefined.
>
> Unfortunately, the ref guide entry for Synonym Graph Filter <
> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-
> graph-filter> doesn’t include a warning about this, but it should, like
> the warning on Word Delimiter Graph Filter <https://lucene.apache.org/
> solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:
>
> > Note: although this filter produces correct token graphs, it cannot
> consume an input token graph correctly.
>
> (I’ve just committed a change to the ref guide source to add this also on
> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be
> included in the ref guide for Solr 7.3.)
>
> In short, the combination of the two filters is not supported, because
> WDGF produces a token graph, which SGF cannot correctly interpret.
>
> Other filters also have this issue, see e.g. <https://issues.apache.org/
> jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some
> attention recently, and hopefully it will inspire fixes elsewhere.
>
> Patches welcome!
>
> --
> Steve
> www.lucidworks.com
>
>
> > On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
> >
> > On 2/5/2018 3:55 AM, Александр Шестак wrote:
> >>
> >> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
> >> and  WordDelimiterGraphFilterFactory. Can they be used together?
> >>
> >
> > There should be no problem with using them together.  But it is always
> > possible that the behavior will surprise you, while working 100% as
> > designed.
> >
> >> I have solr type configured in next way
> >>
> >> <fieldtype name="fulltext_en" class="solr.TextField"
> >> autoGeneratePhraseQueries="true">
> >>   <analyzer type="index">
> >>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>     <filter class="solr.WordDelimiterGraphFilterFactory"
> >>             generateWordParts="1" generateNumberParts="1"
> >> splitOnNumerics="1"
> >>             catenateWords="1" catenateNumbers="1" catenateAll="0"
> >> preserveOriginal="1" protected="protwords_en.txt"/>
> >>     <filter class="solr.FlattenGraphFilterFactory"/>
> >>   </analyzer>
> >>   <analyzer type="query">
> >>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>     <filter class="solr.WordDelimiterGraphFilterFactory"
> >>             generateWordParts="1" generateNumberParts="1"
> >> splitOnNumerics="1"
> >>             catenateWords="0" catenateNumbers="0" catenateAll="0"
> >> preserveOriginal="1" protected="protwords_en.txt"/>
> >>     <filter class="solr.LowerCaseFilterFactory"/>
> >>     <filter class="solr.SynonymGraphFilterFactory"
> >>             synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
> >>   </analyzer>
> >> </fieldtype>
> >>
> >> So on query time it uses SynonymGraphFilterFactory after
> >> WordDelimiterGraphFilterFactory.
> >> Synonyms are configured in next way:
> >> b=>b,boron
> >> 2=>ii,2
> >>
> >> Query in solr analysis tool looks so. It is shown that terms after SGF
> >> have positions 3 and 4. Is it correct? I thought that they should had
> >> 1 and 2 positions.
> >>
> >
> > What matters is the *relative* positions.  The exact position number
> > doesn't matter much.  Something new that the Graph implementations use
> > is the position length.  That feature is necessary for multi-term
> > synonyms to function correctly in phrase queries.
> >
> > In your analysis screenshot, WDGF creates three tokens.  The two tokens
> > created by splitting the input are at positions 1 and 2, which I think
> > is 100% as expected.  It also sets the positionLength of the first term
> > to 2, probably because it has split that term into 2 additional terms.
> >
> > Then the SGF takes those last two terms and expands them.  Each of the
> > synonyms is at the same position as the original term, and the relative
> > positions of the two synonym pairs have not changed -- the second one is
> > still one higher than the first.  I think the reason that SGF moves the
> > positions two higher is because the positionLength on the "b2" term is
> > 2, previously set by WDGF.  Someone with more knowledge about the Graph
> > implementations may have to speak up as to whether this behavior is
> correct.
> >
> > Because the relative positions of the split terms don't change when SGF
> > runs, I think this is probably working as designed.
> >
> > Thanks,
> > Shawn
>
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

sarowe
Thanks Webster,

I created https://issues.apache.org/jira/browse/SOLR-11955 to work on this.

--
Steve
www.lucidworks.com

> On Feb 6, 2018, at 2:47 PM, Webster Homer <[hidden email]> wrote:
>
> I noticed that in some of the current example schemas that are shipped with
> Solr, there is a fieldtype, text_en_splitting, that feeds the output
> of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if
> this isn't supported, the example should probably be updated or removed.
>
> On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe <[hidden email]> wrote:
>
>> Hi Александр,
>>
>>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
>>>
>>> There should be no problem with using them together.
>>
>> I believe Shawn is wrong.
>>
>> From <http://lucene.apache.org/core/7_2_0/analyzers-common/
>> org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:
>>
>>> NOTE: this cannot consume an incoming graph; results will be undefined.
>>
>> Unfortunately, the ref guide entry for Synonym Graph Filter <
>> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-
>> graph-filter> doesn’t include a warning about this, but it should, like
>> the warning on Word Delimiter Graph Filter <https://lucene.apache.org/
>> solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:
>>
>>> Note: although this filter produces correct token graphs, it cannot
>> consume an input token graph correctly.
>>
>> (I’ve just committed a change to the ref guide source to add this also on
>> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be
>> included in the ref guide for Solr 7.3.)
>>
>> In short, the combination of the two filters is not supported, because
>> WDGF produces a token graph, which SGF cannot correctly interpret.
>>
>> Other filters also have this issue, see e.g. <https://issues.apache.org/
>> jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some
>> attention recently, and hopefully it will inspire fixes elsewhere.
>>
>> Patches welcome!
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>>
>>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
>>>
>>> On 2/5/2018 3:55 AM, Александр Шестак wrote:
>>>>
>>>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
>>>> and  WordDelimiterGraphFilterFactory. Can they be used together?
>>>>
>>>
>>> There should be no problem with using them together.  But it is always
>>> possible that the behavior will surprise you, while working 100% as
>>> designed.
>>>
>>>> I have solr type configured in next way
>>>>
>>>> <fieldtype name="fulltext_en" class="solr.TextField"
>>>> autoGeneratePhraseQueries="true">
>>>>  <analyzer type="index">
>>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
>>>>            generateWordParts="1" generateNumberParts="1"
>>>> splitOnNumerics="1"
>>>>            catenateWords="1" catenateNumbers="1" catenateAll="0"
>>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>>    <filter class="solr.FlattenGraphFilterFactory"/>
>>>>  </analyzer>
>>>>  <analyzer type="query">
>>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
>>>>            generateWordParts="1" generateNumberParts="1"
>>>> splitOnNumerics="1"
>>>>            catenateWords="0" catenateNumbers="0" catenateAll="0"
>>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>    <filter class="solr.SynonymGraphFilterFactory"
>>>>            synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>>>>  </analyzer>
>>>> </fieldtype>
>>>>
>>>> So on query time it uses SynonymGraphFilterFactory after
>>>> WordDelimiterGraphFilterFactory.
>>>> Synonyms are configured in next way:
>>>> b=>b,boron
>>>> 2=>ii,2
>>>>
>>>> Query in solr analysis tool looks so. It is shown that terms after SGF
>>>> have positions 3 and 4. Is it correct? I thought that they should had
>>>> 1 and 2 positions.
>>>>
>>>
>>> What matters is the *relative* positions.  The exact position number
>>> doesn't matter much.  Something new that the Graph implementations use
>>> is the position length.  That feature is necessary for multi-term
>>> synonyms to function correctly in phrase queries.
>>>
>>> In your analysis screenshot, WDGF creates three tokens.  The two tokens
>>> created by splitting the input are at positions 1 and 2, which I think
>>> is 100% as expected.  It also sets the positionLength of the first term
>>> to 2, probably because it has split that term into 2 additional terms.
>>>
>>> Then the SGF takes those last two terms and expands them.  Each of the
>>> synonyms is at the same position as the original term, and the relative
>>> positions of the two synonym pairs have not changed -- the second one is
>>> still one higher than the first.  I think the reason that SGF moves the
>>> positions two higher is because the positionLength on the "b2" term is
>>> 2, previously set by WDGF.  Someone with more knowledge about the Graph
>>> implementations may have to speak up as to whether this behavior is
>> correct.
>>>
>>> Because the relative positions of the split terms don't change when SGF
>>> runs, I think this is probably working as designed.
>>>
>>> Thanks,
>>> Shawn
>>
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Jay Potharaju-2
I am upgrading to solr 6.6.3 and one of my fields uses text_en_splitting.
Are there any recommendations on how to adjust the fieldtype definition for
these fields.
Thanks

Thanks
Jay Potharaju


On Wed, Feb 7, 2018 at 5:09 AM, Steve Rowe <[hidden email]> wrote:

> Thanks Webster,
>
> I created https://issues.apache.org/jira/browse/SOLR-11955 to work on
> this.
>
> --
> Steve
> www.lucidworks.com
>
> > On Feb 6, 2018, at 2:47 PM, Webster Homer <[hidden email]>
> wrote:
> >
> > I noticed that in some of the current example schemas that are shipped
> with
> > Solr, there is a fieldtype, text_en_splitting, that feeds the output
> > of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if
> > this isn't supported, the example should probably be updated or removed.
> >
> > On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe <[hidden email]> wrote:
> >
> >> Hi Александр,
> >>
> >>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
> >>>
> >>> There should be no problem with using them together.
> >>
> >> I believe Shawn is wrong.
> >>
> >> From <http://lucene.apache.org/core/7_2_0/analyzers-common/
> >> org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:
> >>
> >>> NOTE: this cannot consume an incoming graph; results will be undefined.
> >>
> >> Unfortunately, the ref guide entry for Synonym Graph Filter <
> >> https://lucene.apache.org/solr/guide/7_2/filter-
> descriptions.html#synonym-
> >> graph-filter> doesn’t include a warning about this, but it should, like
> >> the warning on Word Delimiter Graph Filter <https://lucene.apache.org/
> >> solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:
> >>
> >>> Note: although this filter produces correct token graphs, it cannot
> >> consume an input token graph correctly.
> >>
> >> (I’ve just committed a change to the ref guide source to add this also
> on
> >> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be
> >> included in the ref guide for Solr 7.3.)
> >>
> >> In short, the combination of the two filters is not supported, because
> >> WDGF produces a token graph, which SGF cannot correctly interpret.
> >>
> >> Other filters also have this issue, see e.g. <
> https://issues.apache.org/
> >> jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some
> >> attention recently, and hopefully it will inspire fixes elsewhere.
> >>
> >> Patches welcome!
> >>
> >> --
> >> Steve
> >> www.lucidworks.com
> >>
> >>
> >>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[hidden email]> wrote:
> >>>
> >>> On 2/5/2018 3:55 AM, Александр Шестак wrote:
> >>>>
> >>>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
> >>>> and  WordDelimiterGraphFilterFactory. Can they be used together?
> >>>>
> >>>
> >>> There should be no problem with using them together.  But it is always
> >>> possible that the behavior will surprise you, while working 100% as
> >>> designed.
> >>>
> >>>> I have solr type configured in next way
> >>>>
> >>>> <fieldtype name="fulltext_en" class="solr.TextField"
> >>>> autoGeneratePhraseQueries="true">
> >>>>  <analyzer type="index">
> >>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
> >>>>            generateWordParts="1" generateNumberParts="1"
> >>>> splitOnNumerics="1"
> >>>>            catenateWords="1" catenateNumbers="1" catenateAll="0"
> >>>> preserveOriginal="1" protected="protwords_en.txt"/>
> >>>>    <filter class="solr.FlattenGraphFilterFactory"/>
> >>>>  </analyzer>
> >>>>  <analyzer type="query">
> >>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
> >>>>            generateWordParts="1" generateNumberParts="1"
> >>>> splitOnNumerics="1"
> >>>>            catenateWords="0" catenateNumbers="0" catenateAll="0"
> >>>> preserveOriginal="1" protected="protwords_en.txt"/>
> >>>>    <filter class="solr.LowerCaseFilterFactory"/>
> >>>>    <filter class="solr.SynonymGraphFilterFactory"
> >>>>            synonyms="synonyms_en.txt" ignoreCase="true"
> expand="true"/>
> >>>>  </analyzer>
> >>>> </fieldtype>
> >>>>
> >>>> So on query time it uses SynonymGraphFilterFactory after
> >>>> WordDelimiterGraphFilterFactory.
> >>>> Synonyms are configured in next way:
> >>>> b=>b,boron
> >>>> 2=>ii,2
> >>>>
> >>>> Query in solr analysis tool looks so. It is shown that terms after SGF
> >>>> have positions 3 and 4. Is it correct? I thought that they should had
> >>>> 1 and 2 positions.
> >>>>
> >>>
> >>> What matters is the *relative* positions.  The exact position number
> >>> doesn't matter much.  Something new that the Graph implementations use
> >>> is the position length.  That feature is necessary for multi-term
> >>> synonyms to function correctly in phrase queries.
> >>>
> >>> In your analysis screenshot, WDGF creates three tokens.  The two tokens
> >>> created by splitting the input are at positions 1 and 2, which I think
> >>> is 100% as expected.  It also sets the positionLength of the first term
> >>> to 2, probably because it has split that term into 2 additional terms.
> >>>
> >>> Then the SGF takes those last two terms and expands them.  Each of the
> >>> synonyms is at the same position as the original term, and the relative
> >>> positions of the two synonym pairs have not changed -- the second one
> is
> >>> still one higher than the first.  I think the reason that SGF moves the
> >>> positions two higher is because the positionLength on the "b2" term is
> >>> 2, previously set by WDGF.  Someone with more knowledge about the Graph
> >>> implementations may have to speak up as to whether this behavior is
> >> correct.
> >>>
> >>> Because the relative positions of the split terms don't change when SGF
> >>> runs, I think this is probably working as designed.
> >>>
> >>> Thanks,
> >>> Shawn
> >>
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Rick Leir-2
Jay
Did you try using text_en_splitting copied out of another release?
Though if someone went to the trouble of removing it from the example, there could be something broken in it.
Cheers -- Rick
--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Jay Potharaju-2
Thanks for the response Rick!. I checked 6.6.2 and it has the same issue.
The only work around that I have now is comment out the
SynonymGraphFilterFactory as we are not using synonyms as of now. But would
like to know how to address this issue once we start using it down the line.
Thanks
J

Thanks
Jay Potharaju


On Wed, Mar 14, 2018 at 1:02 PM, Rick Leir <[hidden email]> wrote:

> Jay
> Did you try using text_en_splitting copied out of another release?
> Though if someone went to the trouble of removing it from the example,
> there could be something broken in it.
> Cheers -- Rick
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com