SynonymQuery / Query Expansion Strategies Discussion

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

J. Delgado
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

jim ferenczi
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option would to create alternate query builders. I think there's a couple of enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the generated synonym terms
- Consistent support for phrases 

I think part of my goal too is to help people without the use of plugins. As we often are in scenarios at OpenSource Connections where people won't be able to use a plugin. In this case alternate expansions around hypernyms/hyponyms/?... are a pretty frequent gap that search teams have using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]> wrote:
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

jim ferenczi
Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

I am not sure, I mentioned Solr and ES because I thought it was about adding taxonomies and complex expansion mechanisms to query builders but I wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be a new attribute that token filters would use when they produce stacked tokens and that the QueryBuilder checks when he builds the SynonymQuery. We already have a TermFrequencyAttribute to alter the frequency of a term when indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <[hidden email]> a écrit :
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option would to create alternate query builders. I think there's a couple of enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the generated synonym terms
- Consistent support for phrases 

I think part of my goal too is to help people without the use of plugins. As we often are in scenarios at OpenSource Connections where people won't be able to use a plugin. In this case alternate expansions around hypernyms/hyponyms/?... are a pretty frequent gap that search teams have using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]> wrote:
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

david.w.smiley@gmail.com
+1 great idea Jim!

On Tue, Nov 20, 2018 at 2:19 PM jim ferenczi <[hidden email]> wrote:
Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

I am not sure, I mentioned Solr and ES because I thought it was about adding taxonomies and complex expansion mechanisms to query builders but I wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be a new attribute that token filters would use when they produce stacked tokens and that the QueryBuilder checks when he builds the SynonymQuery. We already have a TermFrequencyAttribute to alter the frequency of a term when indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <[hidden email]> a écrit :
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option would to create alternate query builders. I think there's a couple of enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the generated synonym terms
- Consistent support for phrases 

I think part of my goal too is to help people without the use of plugins. As we often are in scenarios at OpenSource Connections where people won't be able to use a plugin. In this case alternate expansions around hypernyms/hyponyms/?... are a pretty frequent gap that search teams have using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]> wrote:
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Michael Sokolov-4
In reply to this post by jim ferenczi
This is a great idea. It would also be compelling to modify the term frequency using this deboosting so that stacked indexed terms can be weighted according to their closeness to the original term.

On Tue, Nov 20, 2018, 2:19 PM jim ferenczi <[hidden email] wrote:
Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

I am not sure, I mentioned Solr and ES because I thought it was about adding taxonomies and complex expansion mechanisms to query builders but I wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be a new attribute that token filters would use when they produce stacked tokens and that the QueryBuilder checks when he builds the SynonymQuery. We already have a TermFrequencyAttribute to alter the frequency of a term when indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <[hidden email]> a écrit :
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option would to create alternate query builders. I think there's a couple of enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the generated synonym terms
- Consistent support for phrases 

I think part of my goal too is to help people without the use of plugins. As we often are in scenarios at OpenSource Connections where people won't be able to use a plugin. In this case alternate expansions around hypernyms/hyponyms/?... are a pretty frequent gap that search teams have using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]> wrote:
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Alessandro Benedetti
Hi all,
last sunday we spent a bit on this topic, our considerations follow:

N.B. we didn't check the state of the art (thanks Doug for the nice survey shared, I will definitely take a look later on) .
So we just wanted to figure out an initial improvement, that can later be advanced following advanced state of the art formulas.
It is kinda related to Jim idea.
This was the output of our brainstorming:

Introduction
Currently in Apache Solr (and Elastic Search) there is no supported way to manage true synonyms, hypernyms and hyponyms at query time.
A first attempt to add the support for that was done by Doug Turnbull with the approach in the following Pull Requests [1].
We think that approach was a good starting point, but we do believe it could be improved.

Weaknesses of Current Approach
The current approach in our opinion presents the following weaknesses :
- try to guess the hypernym/hyponym/synonym relation from the DF of the terms
- doesn't favour the original query term necessary
- favour rarer hypernym/hyponym/synonym and don't differentiate them.

Proposed Improvements
  • Nym Class Priority Order
  • Nyms within a Class Ranked by Popularity

1 - Onym Class Priority Order
We believe it should be possible to give different priority to different class of nyms (hypernym/hyponym/synonym).
Specifically we do believe that should be possible to model this priority in scoring:

Original Query Term > True Synonym > Hyponym > Hypernym .

Additional benefit could be gained if such inequality could be customised based on user requirements.
i.e.
Adding different shades of nyms and slighly different ordering :
Original Query Term > True Synonym > Hyponym > 2 level hyponym > Hypernym .

2 - Onyms within a Class Ranked by Popularity

Within the same class we believe we need to favour the most popular (highest Document Frequency) onyms.
i.e. within true synonyms we'll favour the most popular one.
The same within hyponyms or hypernyms.
Generally within an Onym class we want to rank higher the terms with higher document frequency.

Proposed Solution
The proposed solution is to score the different onyms in this way :

Original Query Term -> IDFQueryTerm
True Synonym (boost: 1.0) ->  IDFQueryTerm * 1/(1+IDFSynonym)
Hyponym (boost<1.0)->  IDFQueryTerm * 1/(1+IDFHyponym)
Hypernym (boost<1.0) ->  IDFQueryTerm * 1/(1+IDFHypernym)

You may noticed the introduction of the boost factor.
This is the key point of the Onym classification.
All the onyms with the same boost will belong to the same class.
This gives the user the flexibility of ranking the different Onyms classes based on their preference.
The boost solves the problem 1 (Onym Class Priority Order).
Multiplying the original term IDF with the second part of the formula fixes problem 2 (Onyms within a Class Ranked by Popularity) and guarantee the original term to win anyway.

Implementation
The suggested implementation will cover different areas :
- implement the scoring logic through blended DF/ proxy term stats/ proxy similarity (it must be investigated the best path to implement the designed scoring)
- Give the user a configuration file to model the Onyms. A first modality is already available through [2]. A first improvement could be to implement the support for taxonomies such as :
/big cats/lion-panthera leo/simba-kimba.
A final solution will allow an integration with custom knowledge bases, wordnet, ect ect
- what about performance ? you could add a configuration parameters that cut the query expansion based on a boost threshold. We can imagine the boost as the distance from the original concept, so the user should be able to cut down the expanded terms to favour performances.

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director


On Wed, Nov 21, 2018 at 2:34 AM Michael Sokolov <[hidden email]> wrote:
This is a great idea. It would also be compelling to modify the term frequency using this deboosting so that stacked indexed terms can be weighted according to their closeness to the original term.

On Tue, Nov 20, 2018, 2:19 PM jim ferenczi <[hidden email] wrote:
Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

I am not sure, I mentioned Solr and ES because I thought it was about adding taxonomies and complex expansion mechanisms to query builders but I wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be a new attribute that token filters would use when they produce stacked tokens and that the QueryBuilder checks when he builds the SynonymQuery. We already have a TermFrequencyAttribute to alter the frequency of a term when indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <[hidden email]> a écrit :
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option would to create alternate query builders. I think there's a couple of enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the generated synonym terms
- Consistent support for phrases 

I think part of my goal too is to help people without the use of plugins. As we often are in scenarios at OpenSource Connections where people won't be able to use a plugin. In this case alternate expansions around hypernyms/hyponyms/?... are a pretty frequent gap that search teams have using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]> wrote:
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull
In reply to this post by Michael Sokolov-4
Great thoughts Jim - +1 to your idea

One brainstorm I had, is taxonomies have a kind of 'ideal scoring' that I think would lead to a different blending strategy for taxonomies than synonyms. 

If you have a taxonomy: 

\shoes\dress_shoes\oxfords
\shoes\dress_shoes\wingtips
\shoes\lazy_shoes\loafers
\shoes\lazy_shoes\sketchers

This taxonomy states - if a document mentions 'oxfords', it's also discussing the concept of dress shoes. If it only mentions 'wingtips' it also is discussing dress shoes.

Thus ideally, the true document frequency of the parent concept 'dress shoes' is the combination of the children. This is the number of documents that discuss this concept.

You can repeat this for grandparent concepts. The number of documents with 'shoes' really is all the documents mentioning oxfords, wingtips, loafers, sketchers, and the like...

We have implemented this idea at index time, with index-time semantic expansion to inject the parent concepts. (manually put dress_shoes into documents that just mention wingtips). This is mentioned in this blog post https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/ and conference talk https://www.youtube.com/watch?v=90F30PS-884 This is annoying and requires reindexing. Though it's the most accurate. 

BUT I think a blended query-time query would capture the same semantics. You basically want to score a taxonomy like the following. Imagine a user query of wingtips, you could imagine 3 should clauses that blend at different levels

- Search for the term 'wingtips' (lowest doc freq, smallest set) 
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes, highest df)

text:wingtips OR Blended(text:wingtips, text:oxfords, text:dress_shoes) OR Blended(text:wingtips, text:oxfords, text:dress_shoes, text:sketchers, text:loafers, ...)

Right now this can be accomplished by just issuing 3 SHOULD queries with 3 different query-time analyzers each with different synonym expansions (exact user term, child => parent/sibling, child => parent, grandparent, etc...). And maybe it should stay that way.

But this is why I think it's a 'yes AND', yes I think it would be a great addition to have synonym weighting. AND I think there are blending strategies that are specific to the use case. 

-Doug



On Tue, Nov 20, 2018 at 9:34 PM Michael Sokolov <[hidden email]> wrote:
This is a great idea. It would also be compelling to modify the term frequency using this deboosting so that stacked indexed terms can be weighted according to their closeness to the original term.

On Tue, Nov 20, 2018, 2:19 PM jim ferenczi <[hidden email] wrote:
Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

I am not sure, I mentioned Solr and ES because I thought it was about adding taxonomies and complex expansion mechanisms to query builders but I wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be a new attribute that token filters would use when they produce stacked tokens and that the QueryBuilder checks when he builds the SynonymQuery. We already have a TermFrequencyAttribute to alter the frequency of a term when indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <[hidden email]> a écrit :
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option would to create alternate query builders. I think there's a couple of enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the generated synonym terms
- Consistent support for phrases 

I think part of my goal too is to help people without the use of plugins. As we often are in scenarios at OpenSource Connections where people won't be able to use a plugin. In this case alternate expansions around hypernyms/hyponyms/?... are a pretty frequent gap that search teams have using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]> wrote:
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull
Alessandro reading your post, I realized I made a mistake in that you'd need to go both up and down the hierarchy when blending. When a user searches for dress shoes, going down a level (or two) is just as important. If a user searches for 'dress shoes' you also need hyponym terms. 

This works out if you do an index time expansion (child terms get parent terms injected) but doesn't work out if you want a 100% query time blending.

In this case, I think I would revise my blending idea to

- Search for the term 'wingtips' (lowest doc freq, smallest set) 
- Search for the term 'wingtips' blended with all child terms
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes, highest df)

In this case, I don't *think* need any special weighting, as the true doc freq of each concept recreates the priority ordering you guys came up with. That's pretty neat!

-Doug

On Wed, Nov 21, 2018 at 7:20 AM Doug Turnbull <[hidden email]> wrote:
Great thoughts Jim - +1 to your idea

One brainstorm I had, is taxonomies have a kind of 'ideal scoring' that I think would lead to a different blending strategy for taxonomies than synonyms. 

If you have a taxonomy: 

\shoes\dress_shoes\oxfords
\shoes\dress_shoes\wingtips
\shoes\lazy_shoes\loafers
\shoes\lazy_shoes\sketchers

This taxonomy states - if a document mentions 'oxfords', it's also discussing the concept of dress shoes. If it only mentions 'wingtips' it also is discussing dress shoes.

Thus ideally, the true document frequency of the parent concept 'dress shoes' is the combination of the children. This is the number of documents that discuss this concept.

You can repeat this for grandparent concepts. The number of documents with 'shoes' really is all the documents mentioning oxfords, wingtips, loafers, sketchers, and the like...

We have implemented this idea at index time, with index-time semantic expansion to inject the parent concepts. (manually put dress_shoes into documents that just mention wingtips). This is mentioned in this blog post https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/ and conference talk https://www.youtube.com/watch?v=90F30PS-884 This is annoying and requires reindexing. Though it's the most accurate. 

BUT I think a blended query-time query would capture the same semantics. You basically want to score a taxonomy like the following. Imagine a user query of wingtips, you could imagine 3 should clauses that blend at different levels

- Search for the term 'wingtips' (lowest doc freq, smallest set) 
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes, highest df)

text:wingtips OR Blended(text:wingtips, text:oxfords, text:dress_shoes) OR Blended(text:wingtips, text:oxfords, text:dress_shoes, text:sketchers, text:loafers, ...)

Right now this can be accomplished by just issuing 3 SHOULD queries with 3 different query-time analyzers each with different synonym expansions (exact user term, child => parent/sibling, child => parent, grandparent, etc...). And maybe it should stay that way.

But this is why I think it's a 'yes AND', yes I think it would be a great addition to have synonym weighting. AND I think there are blending strategies that are specific to the use case. 

-Doug



On Tue, Nov 20, 2018 at 9:34 PM Michael Sokolov <[hidden email]> wrote:
This is a great idea. It would also be compelling to modify the term frequency using this deboosting so that stacked indexed terms can be weighted according to their closeness to the original term.

On Tue, Nov 20, 2018, 2:19 PM jim ferenczi <[hidden email] wrote:
Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

I am not sure, I mentioned Solr and ES because I thought it was about adding taxonomies and complex expansion mechanisms to query builders but I wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be a new attribute that token filters would use when they produce stacked tokens and that the QueryBuilder checks when he builds the SynonymQuery. We already have a TermFrequencyAttribute to alter the frequency of a term when indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <[hidden email]> a écrit :
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option would to create alternate query builders. I think there's a couple of enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the generated synonym terms
- Consistent support for phrases 

I think part of my goal too is to help people without the use of plugins. As we often are in scenarios at OpenSource Connections where people won't be able to use a plugin. In this case alternate expansions around hypernyms/hyponyms/?... are a pretty frequent gap that search teams have using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is we could implement additional QueryBuilder implementations that provide such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]> wrote:
You can easily customize the query that is used for synonyms in a custom QueryBuilder. The javadocs of the newSynonymQuery says "This is intended for subclasses that wish to customize the generated queries." so I don't think we need to do anything there. I agree that it is sometimes better to use something different than the SynonymQuery but in the general case it works as expected and can be combined with other terms naturally. The kind of customization you want to achieve could be done in a plugin (or in Solr or ES) that extends the QueryBuilder, you can also use custom token filters and alter the query the way you want. My point here is that the QueryBuilder should remain simple, you can add the complexity you want in a subclass.
However I think there is another area we need to fix, the scoring of multi-terms synonyms is broken (compared to the SynonymQuery) and could be improved so we need something similar than the SynonymQuery that handles multi phrases. 


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <[hidden email]> a écrit :
Yes that is another good area (there are many). Although of course embeddings have their own challenges and complexities. (they often capture shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]> wrote:
What about the use of word embeddings (see

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <[hidden email]> wrote:

Hey folks,


I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).  


I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great.


But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper (https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper.


Another insightful set of research is this article on concept scoring (https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors.


Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization.


What I'm proposing is:


  • Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now

  • Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality?


(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)


--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Robert Muir
In reply to this post by jim ferenczi
I don't think we should put scoring stuff into the analysis chain like
this. It already has a laundry list of responsibilities.

Analysis chain can tell you the term is stacked or its a certain type
or occurs a certain number of times, but it shouldn't be supplying
things such as floating point boosts. That kind of scoring
manipulation needs to really happen in query parsing/somewhere else.

On 11/20/18, jim ferenczi <[hidden email]> wrote:

> Sorry for the late reply,
>
>> So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be
> a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> [hidden email]> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option
>> would to create alternate query builders. I think there's a couple of
>> enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is
>> we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]>
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so
>>> I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the
>>> general
>>> case it works as expected and can be combined with other terms
>>> naturally.
>>> The kind of customization you want to achieve could be done in a plugin
>>> (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>>> token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in
>>> a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could
>>> be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> [hidden email]> a écrit :
>>>
>>>> Yes that is another good area (there are many). Although of course
>>>> embeddings have their own challenges and complexities. (they often
>>>> capture
>>>> shared context, but not shared meaning).
>>>>
>>>> It's a data point though of something we'd want to include in such a
>>>> framework, though not sure where it would go on the roadmap...
>>>>
>>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]>
>>>> wrote:
>>>>
>>>>> What about the use of word embeddings (see
>>>>>
>>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>>> to compute word similarity?
>>>>>
>>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I wanted to open up a discussion about a change to the usage of
>>>>>> SynonymQuery. The goal here is to have a broader library of queries
>>>>>> that
>>>>>> can address other cases where related terms occupy the same position
>>>>>> but
>>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>>> ambiguous terms, and other query expansion situations).
>>>>>>
>>>>>>
>>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>>> the pattern of clients jamming any related term into a synonyms file
>>>>>> and
>>>>>> being surprised with odd results. I like the idea of enforcing
>>>>>> "synonyms"
>>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a
>>>>>> client
>>>>>> and setup simple patterns. So for synonyms, I think leaving
>>>>>> SynonymQuery in
>>>>>> place works great.
>>>>>>
>>>>>> But I feel if that's the rule, we need to open up discussion of other
>>>>>> methods of scoring conceptual 'related term' relationships that
>>>>>> usually
>>>>>> comes up in the context of query expansion. This paper (
>>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>>> surveys the current thinking for scoring various query expansion
>>>>>> scenarios
>>>>>> like those we deal with in the messy, ambiguous uses of synonyms in
>>>>>> prod
>>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>>
>>>>>>
>>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>>> existing Lucene index stats. So one might imagine a 'related terms'
>>>>>> token
>>>>>> filter that injected some scoring based on how related it really is
>>>>>> to the original query term using Jaccard, Dice, or other methods
>>>>>> called out
>>>>>> in this paper.
>>>>>>
>>>>>>
>>>>>> Another insightful set of research is this article on concept scoring
>>>>>> (
>>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>>> ), which prioritizes related terms by connectedness and other
>>>>>> factors.
>>>>>>
>>>>>> Needless to say, it's an open area how two terms someone has asserted
>>>>>> are related to a query term 'should be' scored. It's one of those
>>>>>> things
>>>>>> that likely will forever depend on a number of domain and application
>>>>>> specific factors. It's possibly a big opportunity of improvement for
>>>>>> Lucene
>>>>>> - but likely is about putting the right framework in place to allow
>>>>>> for
>>>>>> good default set of query-expansion scoring scenarios with options
>>>>>> for
>>>>>> customization.
>>>>>>
>>>>>> What I'm proposing is:
>>>>>>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Submit a small patch that restricts SynonymQuery to tokens of type
>>>>>>    "SYNONYM" in the same posn, which allows some short term work to be
>>>>>> done
>>>>>>    with the current Lucene QueryBuilder. Any additional non-synonym
>>>>>> terms
>>>>>>    would be appended as a boolean query for now
>>>>>>    -
>>>>>>
>>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>>    key off the token type in QueryBuilder to create custom scoring
>>>>>> using
>>>>>>    built-in term stats. The possibilities here are endless, up to
>>>>>> weighted
>>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>>>>>> relevance
>>>>>>    feedback, etc
>>>>>>
>>>>>>
>>>>>> I'm curious what folks would think of a patch for bullet one followed
>>>>>> by other patches down the road for additional functionality?
>>>>>>
>>>>>> (related to discussion in this Elasticsearch PR
>>>>>>
>>>>>>
>>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>>> )
>>>>>>
>>>>>> --
>>>>>> CTO, OpenSource Connections
>>>>>> Author, Relevant Search
>>>>>> http://o19s.com/doug
>>>>>>
>>>>> --
>>>> CTO, OpenSource Connections
>>>> Author, Relevant Search
>>>> http://o19s.com/doug
>>>>
>>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull
I agree there is a tension between analysis and query parser responsibilities (or external to how queries are constructed). I wonder what you'd think of making QueryBuilder more easily subclassible by passing more term metadata to newSynonymQuery (such as types etc). This would let you select an alt strategy (such as some of the scoring systems used in the query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing something with a term labeled a hyponym/hypernym in a QueryBuilder subclass..

-Doug

On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <[hidden email]> wrote:
I don't think we should put scoring stuff into the analysis chain like
this. It already has a laundry list of responsibilities.

Analysis chain can tell you the term is stacked or its a certain type
or occurs a certain number of times, but it shouldn't be supplying
things such as floating point boosts. That kind of scoring
manipulation needs to really happen in query parsing/somewhere else.

On 11/20/18, jim ferenczi <[hidden email]> wrote:
> Sorry for the late reply,
>
>> So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be
> a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> [hidden email]> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option
>> would to create alternate query builders. I think there's a couple of
>> enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is
>> we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]>
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so
>>> I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the
>>> general
>>> case it works as expected and can be combined with other terms
>>> naturally.
>>> The kind of customization you want to achieve could be done in a plugin
>>> (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>>> token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in
>>> a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could
>>> be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> [hidden email]> a écrit :
>>>
>>>> Yes that is another good area (there are many). Although of course
>>>> embeddings have their own challenges and complexities. (they often
>>>> capture
>>>> shared context, but not shared meaning).
>>>>
>>>> It's a data point though of something we'd want to include in such a
>>>> framework, though not sure where it would go on the roadmap...
>>>>
>>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]>
>>>> wrote:
>>>>
>>>>> What about the use of word embeddings (see
>>>>>
>>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>>> to compute word similarity?
>>>>>
>>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I wanted to open up a discussion about a change to the usage of
>>>>>> SynonymQuery. The goal here is to have a broader library of queries
>>>>>> that
>>>>>> can address other cases where related terms occupy the same position
>>>>>> but
>>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>>> ambiguous terms, and other query expansion situations).
>>>>>>
>>>>>>
>>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>>> the pattern of clients jamming any related term into a synonyms file
>>>>>> and
>>>>>> being surprised with odd results. I like the idea of enforcing
>>>>>> "synonyms"
>>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a
>>>>>> client
>>>>>> and setup simple patterns. So for synonyms, I think leaving
>>>>>> SynonymQuery in
>>>>>> place works great.
>>>>>>
>>>>>> But I feel if that's the rule, we need to open up discussion of other
>>>>>> methods of scoring conceptual 'related term' relationships that
>>>>>> usually
>>>>>> comes up in the context of query expansion. This paper (
>>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>>> surveys the current thinking for scoring various query expansion
>>>>>> scenarios
>>>>>> like those we deal with in the messy, ambiguous uses of synonyms in
>>>>>> prod
>>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>>
>>>>>>
>>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>>> existing Lucene index stats. So one might imagine a 'related terms'
>>>>>> token
>>>>>> filter that injected some scoring based on how related it really is
>>>>>> to the original query term using Jaccard, Dice, or other methods
>>>>>> called out
>>>>>> in this paper.
>>>>>>
>>>>>>
>>>>>> Another insightful set of research is this article on concept scoring
>>>>>> (
>>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>>> ), which prioritizes related terms by connectedness and other
>>>>>> factors.
>>>>>>
>>>>>> Needless to say, it's an open area how two terms someone has asserted
>>>>>> are related to a query term 'should be' scored. It's one of those
>>>>>> things
>>>>>> that likely will forever depend on a number of domain and application
>>>>>> specific factors. It's possibly a big opportunity of improvement for
>>>>>> Lucene
>>>>>> - but likely is about putting the right framework in place to allow
>>>>>> for
>>>>>> good default set of query-expansion scoring scenarios with options
>>>>>> for
>>>>>> customization.
>>>>>>
>>>>>> What I'm proposing is:
>>>>>>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Submit a small patch that restricts SynonymQuery to tokens of type
>>>>>>    "SYNONYM" in the same posn, which allows some short term work to be
>>>>>> done
>>>>>>    with the current Lucene QueryBuilder. Any additional non-synonym
>>>>>> terms
>>>>>>    would be appended as a boolean query for now
>>>>>>    -
>>>>>>
>>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>>    key off the token type in QueryBuilder to create custom scoring
>>>>>> using
>>>>>>    built-in term stats. The possibilities here are endless, up to
>>>>>> weighted
>>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>>>>>> relevance
>>>>>>    feedback, etc
>>>>>>
>>>>>>
>>>>>> I'm curious what folks would think of a patch for bullet one followed
>>>>>> by other patches down the road for additional functionality?
>>>>>>
>>>>>> (related to discussion in this Elasticsearch PR
>>>>>>
>>>>>>
>>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>>> )
>>>>>>
>>>>>> --
>>>>>> CTO, OpenSource Connections
>>>>>> Author, Relevant Search
>>>>>> http://o19s.com/doug
>>>>>>
>>>>> --
>>>> CTO, OpenSource Connections
>>>> Author, Relevant Search
>>>> http://o19s.com/doug
>>>>
>>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Michael Gibney
On the analysis chain side, could the desired functionality be scoped to: providing a framework (Attribute?) to express information about the relationship between a derived token and its corresponding input? For example, one might include information about:
1. corresponding input token (i.e., input token text?)
2. relationship between derived token and input (e.g., synonym, hyponym, hypernym ... but perhaps not limited to these)
3. degree of confidence/weight in the derived token? This would represent a concept distinct from "weight" for the purpose of scoring, and could thus be appropriate to the analysis chain.
4. source/reason of token derivation relationship (e.g., specific ontology, taxonomy, etc...)
5. ....

This could provide all the information necessary to support custom indexing strategies and/or query strategies, while remaining strictly focused on analysis per se. This type of approach (if relationship info were recorded in index, e.g. via Payload) could also support explicitly navigable facets that are ontology-aware, or other potentially interesting things ...

Michael


On Wed, Nov 21, 2018 at 9:24 AM Doug Turnbull <[hidden email]> wrote:
I agree there is a tension between analysis and query parser responsibilities (or external to how queries are constructed). I wonder what you'd think of making QueryBuilder more easily subclassible by passing more term metadata to newSynonymQuery (such as types etc). This would let you select an alt strategy (such as some of the scoring systems used in the query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing something with a term labeled a hyponym/hypernym in a QueryBuilder subclass..

-Doug

On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <[hidden email]> wrote:
I don't think we should put scoring stuff into the analysis chain like
this. It already has a laundry list of responsibilities.

Analysis chain can tell you the term is stacked or its a certain type
or occurs a certain number of times, but it shouldn't be supplying
things such as floating point boosts. That kind of scoring
manipulation needs to really happen in query parsing/somewhere else.

On 11/20/18, jim ferenczi <[hidden email]> wrote:
> Sorry for the late reply,
>
>> So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be
> a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> [hidden email]> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option
>> would to create alternate query builders. I think there's a couple of
>> enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is
>> we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]>
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so
>>> I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the
>>> general
>>> case it works as expected and can be combined with other terms
>>> naturally.
>>> The kind of customization you want to achieve could be done in a plugin
>>> (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>>> token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in
>>> a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could
>>> be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> [hidden email]> a écrit :
>>>
>>>> Yes that is another good area (there are many). Although of course
>>>> embeddings have their own challenges and complexities. (they often
>>>> capture
>>>> shared context, but not shared meaning).
>>>>
>>>> It's a data point though of something we'd want to include in such a
>>>> framework, though not sure where it would go on the roadmap...
>>>>
>>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]>
>>>> wrote:
>>>>
>>>>> What about the use of word embeddings (see
>>>>>
>>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>>> to compute word similarity?
>>>>>
>>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I wanted to open up a discussion about a change to the usage of
>>>>>> SynonymQuery. The goal here is to have a broader library of queries
>>>>>> that
>>>>>> can address other cases where related terms occupy the same position
>>>>>> but
>>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>>> ambiguous terms, and other query expansion situations).
>>>>>>
>>>>>>
>>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>>> the pattern of clients jamming any related term into a synonyms file
>>>>>> and
>>>>>> being surprised with odd results. I like the idea of enforcing
>>>>>> "synonyms"
>>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a
>>>>>> client
>>>>>> and setup simple patterns. So for synonyms, I think leaving
>>>>>> SynonymQuery in
>>>>>> place works great.
>>>>>>
>>>>>> But I feel if that's the rule, we need to open up discussion of other
>>>>>> methods of scoring conceptual 'related term' relationships that
>>>>>> usually
>>>>>> comes up in the context of query expansion. This paper (
>>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>>> surveys the current thinking for scoring various query expansion
>>>>>> scenarios
>>>>>> like those we deal with in the messy, ambiguous uses of synonyms in
>>>>>> prod
>>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>>
>>>>>>
>>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>>> existing Lucene index stats. So one might imagine a 'related terms'
>>>>>> token
>>>>>> filter that injected some scoring based on how related it really is
>>>>>> to the original query term using Jaccard, Dice, or other methods
>>>>>> called out
>>>>>> in this paper.
>>>>>>
>>>>>>
>>>>>> Another insightful set of research is this article on concept scoring
>>>>>> (
>>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>>> ), which prioritizes related terms by connectedness and other
>>>>>> factors.
>>>>>>
>>>>>> Needless to say, it's an open area how two terms someone has asserted
>>>>>> are related to a query term 'should be' scored. It's one of those
>>>>>> things
>>>>>> that likely will forever depend on a number of domain and application
>>>>>> specific factors. It's possibly a big opportunity of improvement for
>>>>>> Lucene
>>>>>> - but likely is about putting the right framework in place to allow
>>>>>> for
>>>>>> good default set of query-expansion scoring scenarios with options
>>>>>> for
>>>>>> customization.
>>>>>>
>>>>>> What I'm proposing is:
>>>>>>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Submit a small patch that restricts SynonymQuery to tokens of type
>>>>>>    "SYNONYM" in the same posn, which allows some short term work to be
>>>>>> done
>>>>>>    with the current Lucene QueryBuilder. Any additional non-synonym
>>>>>> terms
>>>>>>    would be appended as a boolean query for now
>>>>>>    -
>>>>>>
>>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>>    key off the token type in QueryBuilder to create custom scoring
>>>>>> using
>>>>>>    built-in term stats. The possibilities here are endless, up to
>>>>>> weighted
>>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>>>>>> relevance
>>>>>>    feedback, etc
>>>>>>
>>>>>>
>>>>>> I'm curious what folks would think of a patch for bullet one followed
>>>>>> by other patches down the road for additional functionality?
>>>>>>
>>>>>> (related to discussion in this Elasticsearch PR
>>>>>>
>>>>>>
>>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>>> )
>>>>>>
>>>>>> --
>>>>>> CTO, OpenSource Connections
>>>>>> Author, Relevant Search
>>>>>> http://o19s.com/doug
>>>>>>
>>>>> --
>>>> CTO, OpenSource Connections
>>>> Author, Relevant Search
>>>> http://o19s.com/doug
>>>>
>>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull
In reply to this post by Doug Turnbull
There's a lot of different topics here and ideas, so we captured the use cases we see being discussed here as in this google doc

Basically, we've seen 5 high-level use cases discussed
- Alt Labels (what SynonymQuery does well now)
- Synonyms (looser synonyms with close meaning that need to be scored somehow - `notebook,laptop`)
- Taxonomies (hierarchies of concepts/terms `dress shoes\oxfords`)
- Ontologies / Knowledge Graphs (networks of concepts)
- Embeddings (distributed representations of a term)

It's a doc in progress, embeddings needs more work, and is probably the hardest thing on the list. There's possible other

The goal isn't so much to make Lucene implement all of these (it would create a lot of maintenance headaches to shove this all in), but some of it is just defining practices / patterns / tools that enable these things in Lucene-based search. Some may require no work, or some may require supporting functionality.

-Doug

On Wed, Nov 21, 2018 at 9:23 AM Doug Turnbull <[hidden email]> wrote:
I agree there is a tension between analysis and query parser responsibilities (or external to how queries are constructed). I wonder what you'd think of making QueryBuilder more easily subclassible by passing more term metadata to newSynonymQuery (such as types etc). This would let you select an alt strategy (such as some of the scoring systems used in the query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing something with a term labeled a hyponym/hypernym in a QueryBuilder subclass..

-Doug

On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <[hidden email]> wrote:
I don't think we should put scoring stuff into the analysis chain like
this. It already has a laundry list of responsibilities.

Analysis chain can tell you the term is stacked or its a certain type
or occurs a certain number of times, but it shouldn't be supplying
things such as floating point boosts. That kind of scoring
manipulation needs to really happen in query parsing/somewhere else.

On 11/20/18, jim ferenczi <[hidden email]> wrote:
> Sorry for the late reply,
>
>> So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be
> a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> [hidden email]> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option
>> would to create alternate query builders. I think there's a couple of
>> enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is
>> we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]>
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so
>>> I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the
>>> general
>>> case it works as expected and can be combined with other terms
>>> naturally.
>>> The kind of customization you want to achieve could be done in a plugin
>>> (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>>> token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in
>>> a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could
>>> be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> [hidden email]> a écrit :
>>>
>>>> Yes that is another good area (there are many). Although of course
>>>> embeddings have their own challenges and complexities. (they often
>>>> capture
>>>> shared context, but not shared meaning).
>>>>
>>>> It's a data point though of something we'd want to include in such a
>>>> framework, though not sure where it would go on the roadmap...
>>>>
>>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[hidden email]>
>>>> wrote:
>>>>
>>>>> What about the use of word embeddings (see
>>>>>
>>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>>> to compute word similarity?
>>>>>
>>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I wanted to open up a discussion about a change to the usage of
>>>>>> SynonymQuery. The goal here is to have a broader library of queries
>>>>>> that
>>>>>> can address other cases where related terms occupy the same position
>>>>>> but
>>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>>> ambiguous terms, and other query expansion situations).
>>>>>>
>>>>>>
>>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>>> the pattern of clients jamming any related term into a synonyms file
>>>>>> and
>>>>>> being surprised with odd results. I like the idea of enforcing
>>>>>> "synonyms"
>>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a
>>>>>> client
>>>>>> and setup simple patterns. So for synonyms, I think leaving
>>>>>> SynonymQuery in
>>>>>> place works great.
>>>>>>
>>>>>> But I feel if that's the rule, we need to open up discussion of other
>>>>>> methods of scoring conceptual 'related term' relationships that
>>>>>> usually
>>>>>> comes up in the context of query expansion. This paper (
>>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>>> surveys the current thinking for scoring various query expansion
>>>>>> scenarios
>>>>>> like those we deal with in the messy, ambiguous uses of synonyms in
>>>>>> prod
>>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>>
>>>>>>
>>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>>> existing Lucene index stats. So one might imagine a 'related terms'
>>>>>> token
>>>>>> filter that injected some scoring based on how related it really is
>>>>>> to the original query term using Jaccard, Dice, or other methods
>>>>>> called out
>>>>>> in this paper.
>>>>>>
>>>>>>
>>>>>> Another insightful set of research is this article on concept scoring
>>>>>> (
>>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>>> ), which prioritizes related terms by connectedness and other
>>>>>> factors.
>>>>>>
>>>>>> Needless to say, it's an open area how two terms someone has asserted
>>>>>> are related to a query term 'should be' scored. It's one of those
>>>>>> things
>>>>>> that likely will forever depend on a number of domain and application
>>>>>> specific factors. It's possibly a big opportunity of improvement for
>>>>>> Lucene
>>>>>> - but likely is about putting the right framework in place to allow
>>>>>> for
>>>>>> good default set of query-expansion scoring scenarios with options
>>>>>> for
>>>>>> customization.
>>>>>>
>>>>>> What I'm proposing is:
>>>>>>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Submit a small patch that restricts SynonymQuery to tokens of type
>>>>>>    "SYNONYM" in the same posn, which allows some short term work to be
>>>>>> done
>>>>>>    with the current Lucene QueryBuilder. Any additional non-synonym
>>>>>> terms
>>>>>>    would be appended as a boolean query for now
>>>>>>    -
>>>>>>
>>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>>    key off the token type in QueryBuilder to create custom scoring
>>>>>> using
>>>>>>    built-in term stats. The possibilities here are endless, up to
>>>>>> weighted
>>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>>>>>> relevance
>>>>>>    feedback, etc
>>>>>>
>>>>>>
>>>>>> I'm curious what folks would think of a patch for bullet one followed
>>>>>> by other patches down the road for additional functionality?
>>>>>>
>>>>>> (related to discussion in this Elasticsearch PR
>>>>>>
>>>>>>
>>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>>> )
>>>>>>
>>>>>> --
>>>>>> CTO, OpenSource Connections
>>>>>> Author, Relevant Search
>>>>>> http://o19s.com/doug
>>>>>>
>>>>> --
>>>> CTO, OpenSource Connections
>>>> Author, Relevant Search
>>>> http://o19s.com/doug
>>>>
>>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

--
CTO, OpenSource Connections
Author, Relevant Search
--
CTO, OpenSource Connections
Author, Relevant Search
Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Robert Muir
In reply to this post by Doug Turnbull
There is already analyzeBoolean/analyzeMultiBoolean there that you can
use for this. You can look at any attribute on the tokenstream you
want. I don't see any need to add any more API.

On 11/21/18, Doug Turnbull <[hidden email]> wrote:

> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <[hidden email]> wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi <[hidden email]> wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> > but
>> I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> > could
>> be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> > SynonymQuery.
>> We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > [hidden email]> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> >> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that
>> >> provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]>
>> >> wrote:
>> >>
>> >>> You can easily customize the query that is used for synonyms in a
>> custom
>> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> >>> intended for subclasses that wish to customize the generated
>> >>> queries."
>> so
>> >>> I
>> >>> don't think we need to do anything there. I agree that it is
>> >>> sometimes
>> >>> better to use something different than the SynonymQuery but in the
>> >>> general
>> >>> case it works as expected and can be combined with other terms
>> >>> naturally.
>> >>> The kind of customization you want to achieve could be done in a
>> >>> plugin
>> >>> (or
>> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>> >>> token
>> >>> filters and alter the query the way you want. My point here is that
>> >>> the
>> >>> QueryBuilder should remain simple, you can add the complexity you
>> >>> want
>> in
>> >>> a
>> >>> subclass.
>> >>> However I think there is another area we need to fix, the scoring of
>> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and
>> >>> could
>> >>> be
>> >>> improved so we need something similar than the SynonymQuery that
>> handles
>> >>> multi phrases.
>> >>>
>> >>>
>> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>> >>> [hidden email]> a écrit :
>> >>>
>> >>>> Yes that is another good area (there are many). Although of course
>> >>>> embeddings have their own challenges and complexities. (they often
>> >>>> capture
>> >>>> shared context, but not shared meaning).
>> >>>>
>> >>>> It's a data point though of something we'd want to include in such a
>> >>>> framework, though not sure where it would go on the roadmap...
>> >>>>
>> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado
>> >>>> <[hidden email]
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> What about the use of word embeddings (see
>> >>>>>
>> >>>>>
>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
>> )
>> >>>>> to compute word similarity?
>> >>>>>
>> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>> >>>>> [hidden email]> wrote:
>> >>>>>
>> >>>>>> Hey folks,
>> >>>>>>
>> >>>>>> I wanted to open up a discussion about a change to the usage of
>> >>>>>> SynonymQuery. The goal here is to have a broader library of
>> >>>>>> queries
>> >>>>>> that
>> >>>>>> can address other cases where related terms occupy the same
>> >>>>>> position
>> >>>>>> but
>> >>>>>> don't have the same meaning (such as hypernyms, hyponyms,
>> >>>>>> meronyms,
>> >>>>>> ambiguous terms, and other query expansion situations).
>> >>>>>>
>> >>>>>>
>> >>>>>> I bring this up because we've noticed (as I'm sure many of you
>> >>>>>> have)
>> >>>>>> the pattern of clients jamming any related term into a synonyms
>> >>>>>> file
>> >>>>>> and
>> >>>>>> being surprised with odd results. I like the idea of enforcing
>> >>>>>> "synonyms"
>> >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell
>> >>>>>> a
>> >>>>>> client
>> >>>>>> and setup simple patterns. So for synonyms, I think leaving
>> >>>>>> SynonymQuery in
>> >>>>>> place works great.
>> >>>>>>
>> >>>>>> But I feel if that's the rule, we need to open up discussion of
>> other
>> >>>>>> methods of scoring conceptual 'related term' relationships that
>> >>>>>> usually
>> >>>>>> comes up in the context of query expansion. This paper (
>> >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>> >>>>>> surveys the current thinking for scoring various query expansion
>> >>>>>> scenarios
>> >>>>>> like those we deal with in the messy, ambiguous uses of synonyms
>> >>>>>> in
>> >>>>>> prod
>> >>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>> >>>>>>
>> >>>>>>
>> >>>>>> The cool thing is many of the ideas in this paper seem doable with
>> >>>>>> existing Lucene index stats. So one might imagine a 'related
>> >>>>>> terms'
>> >>>>>> token
>> >>>>>> filter that injected some scoring based on how related it really
>> >>>>>> is
>> >>>>>> to the original query term using Jaccard, Dice, or other methods
>> >>>>>> called out
>> >>>>>> in this paper.
>> >>>>>>
>> >>>>>>
>> >>>>>> Another insightful set of research is this article on concept
>> scoring
>> >>>>>> (
>> >>>>>>
>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>> >>>>>> ), which prioritizes related terms by connectedness and other
>> >>>>>> factors.
>> >>>>>>
>> >>>>>> Needless to say, it's an open area how two terms someone has
>> asserted
>> >>>>>> are related to a query term 'should be' scored. It's one of those
>> >>>>>> things
>> >>>>>> that likely will forever depend on a number of domain and
>> application
>> >>>>>> specific factors. It's possibly a big opportunity of improvement
>> >>>>>> for
>> >>>>>> Lucene
>> >>>>>> - but likely is about putting the right framework in place to
>> >>>>>> allow
>> >>>>>> for
>> >>>>>> good default set of query-expansion scoring scenarios with options
>> >>>>>> for
>> >>>>>> customization.
>> >>>>>>
>> >>>>>> What I'm proposing is:
>> >>>>>>
>> >>>>>>
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
>> type
>> >>>>>>    "SYNONYM" in the same posn, which allows some short term work
>> >>>>>> to
>> be
>> >>>>>> done
>> >>>>>>    with the current Lucene QueryBuilder. Any additional
>> >>>>>> non-synonym
>> >>>>>> terms
>> >>>>>>    would be appended as a boolean query for now
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Begin work on alternate 'related-term' scoring systems that
>> >>>>>> also
>> >>>>>>    key off the token type in QueryBuilder to create custom scoring
>> >>>>>> using
>> >>>>>>    built-in term stats. The possibilities here are endless, up to
>> >>>>>> weighted
>> >>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>> >>>>>> relevance
>> >>>>>>    feedback, etc
>> >>>>>>
>> >>>>>>
>> >>>>>> I'm curious what folks would think of a patch for bullet one
>> followed
>> >>>>>> by other patches down the road for additional functionality?
>> >>>>>>
>> >>>>>> (related to discussion in this Elasticsearch PR
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>> >>>>>> )
>> >>>>>>
>> >>>>>> --
>> >>>>>> CTO, OpenSource Connections
>> >>>>>> Author, Relevant Search
>> >>>>>> http://o19s.com/doug
>> >>>>>>
>> >>>>> --
>> >>>> CTO, OpenSource Connections
>> >>>> Author, Relevant Search
>> >>>> http://o19s.com/doug
>> >>>>
>> >>> --
>> >> CTO, OpenSource Connections
>> >> Author, Relevant Search
>> >> http://o19s.com/doug
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

jim ferenczi
My proposal was to tweak the boosting directly in the token filters through a single Attribute but if we feel that it is too much to add to the analysis chain I agree that we don't need to add any API. If you rely on abstract attributes (type, ...) then it should be easy to sub-class the query builder to access them and implement the logic you want there.

Le jeu. 22 nov. 2018 à 13:18, Robert Muir <[hidden email]> a écrit :
There is already analyzeBoolean/analyzeMultiBoolean there that you can
use for this. You can look at any attribute on the tokenstream you
want. I don't see any need to add any more API.

On 11/21/18, Doug Turnbull <[hidden email]> wrote:
> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <[hidden email]> wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi <[hidden email]> wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> > but
>> I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> > could
>> be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> > SynonymQuery.
>> We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > [hidden email]> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> >> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that
>> >> provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[hidden email]>
>> >> wrote:
>> >>
>> >>> You can easily customize the query that is used for synonyms in a
>> custom
>> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> >>> intended for subclasses that wish to customize the generated
>> >>> queries."
>> so
>> >>> I
>> >>> don't think we need to do anything there. I agree that it is
>> >>> sometimes
>> >>> better to use something different than the SynonymQuery but in the
>> >>> general
>> >>> case it works as expected and can be combined with other terms
>> >>> naturally.
>> >>> The kind of customization you want to achieve could be done in a
>> >>> plugin
>> >>> (or
>> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>> >>> token
>> >>> filters and alter the query the way you want. My point here is that
>> >>> the
>> >>> QueryBuilder should remain simple, you can add the complexity you
>> >>> want
>> in
>> >>> a
>> >>> subclass.
>> >>> However I think there is another area we need to fix, the scoring of
>> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and
>> >>> could
>> >>> be
>> >>> improved so we need something similar than the SynonymQuery that
>> handles
>> >>> multi phrases.
>> >>>
>> >>>
>> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>> >>> [hidden email]> a écrit :
>> >>>
>> >>>> Yes that is another good area (there are many). Although of course
>> >>>> embeddings have their own challenges and complexities. (they often
>> >>>> capture
>> >>>> shared context, but not shared meaning).
>> >>>>
>> >>>> It's a data point though of something we'd want to include in such a
>> >>>> framework, though not sure where it would go on the roadmap...
>> >>>>
>> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado
>> >>>> <[hidden email]
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> What about the use of word embeddings (see
>> >>>>>
>> >>>>>
>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
>> )
>> >>>>> to compute word similarity?
>> >>>>>
>> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>> >>>>> [hidden email]> wrote:
>> >>>>>
>> >>>>>> Hey folks,
>> >>>>>>
>> >>>>>> I wanted to open up a discussion about a change to the usage of
>> >>>>>> SynonymQuery. The goal here is to have a broader library of
>> >>>>>> queries
>> >>>>>> that
>> >>>>>> can address other cases where related terms occupy the same
>> >>>>>> position
>> >>>>>> but
>> >>>>>> don't have the same meaning (such as hypernyms, hyponyms,
>> >>>>>> meronyms,
>> >>>>>> ambiguous terms, and other query expansion situations).
>> >>>>>>
>> >>>>>>
>> >>>>>> I bring this up because we've noticed (as I'm sure many of you
>> >>>>>> have)
>> >>>>>> the pattern of clients jamming any related term into a synonyms
>> >>>>>> file
>> >>>>>> and
>> >>>>>> being surprised with odd results. I like the idea of enforcing
>> >>>>>> "synonyms"
>> >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell
>> >>>>>> a
>> >>>>>> client
>> >>>>>> and setup simple patterns. So for synonyms, I think leaving
>> >>>>>> SynonymQuery in
>> >>>>>> place works great.
>> >>>>>>
>> >>>>>> But I feel if that's the rule, we need to open up discussion of
>> other
>> >>>>>> methods of scoring conceptual 'related term' relationships that
>> >>>>>> usually
>> >>>>>> comes up in the context of query expansion. This paper (
>> >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>> >>>>>> surveys the current thinking for scoring various query expansion
>> >>>>>> scenarios
>> >>>>>> like those we deal with in the messy, ambiguous uses of synonyms
>> >>>>>> in
>> >>>>>> prod
>> >>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>> >>>>>>
>> >>>>>>
>> >>>>>> The cool thing is many of the ideas in this paper seem doable with
>> >>>>>> existing Lucene index stats. So one might imagine a 'related
>> >>>>>> terms'
>> >>>>>> token
>> >>>>>> filter that injected some scoring based on how related it really
>> >>>>>> is
>> >>>>>> to the original query term using Jaccard, Dice, or other methods
>> >>>>>> called out
>> >>>>>> in this paper.
>> >>>>>>
>> >>>>>>
>> >>>>>> Another insightful set of research is this article on concept
>> scoring
>> >>>>>> (
>> >>>>>>
>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>> >>>>>> ), which prioritizes related terms by connectedness and other
>> >>>>>> factors.
>> >>>>>>
>> >>>>>> Needless to say, it's an open area how two terms someone has
>> asserted
>> >>>>>> are related to a query term 'should be' scored. It's one of those
>> >>>>>> things
>> >>>>>> that likely will forever depend on a number of domain and
>> application
>> >>>>>> specific factors. It's possibly a big opportunity of improvement
>> >>>>>> for
>> >>>>>> Lucene
>> >>>>>> - but likely is about putting the right framework in place to
>> >>>>>> allow
>> >>>>>> for
>> >>>>>> good default set of query-expansion scoring scenarios with options
>> >>>>>> for
>> >>>>>> customization.
>> >>>>>>
>> >>>>>> What I'm proposing is:
>> >>>>>>
>> >>>>>>
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
>> type
>> >>>>>>    "SYNONYM" in the same posn, which allows some short term work
>> >>>>>> to
>> be
>> >>>>>> done
>> >>>>>>    with the current Lucene QueryBuilder. Any additional
>> >>>>>> non-synonym
>> >>>>>> terms
>> >>>>>>    would be appended as a boolean query for now
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Begin work on alternate 'related-term' scoring systems that
>> >>>>>> also
>> >>>>>>    key off the token type in QueryBuilder to create custom scoring
>> >>>>>> using
>> >>>>>>    built-in term stats. The possibilities here are endless, up to
>> >>>>>> weighted
>> >>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>> >>>>>> relevance
>> >>>>>>    feedback, etc
>> >>>>>>
>> >>>>>>
>> >>>>>> I'm curious what folks would think of a patch for bullet one
>> followed
>> >>>>>> by other patches down the road for additional functionality?
>> >>>>>>
>> >>>>>> (related to discussion in this Elasticsearch PR
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>> >>>>>> )
>> >>>>>>
>> >>>>>> --
>> >>>>>> CTO, OpenSource Connections
>> >>>>>> Author, Relevant Search
>> >>>>>> http://o19s.com/doug
>> >>>>>>
>> >>>>> --
>> >>>> CTO, OpenSource Connections
>> >>>> Author, Relevant Search
>> >>>> http://o19s.com/doug
>> >>>>
>> >>> --
>> >> CTO, OpenSource Connections
>> >> Author, Relevant Search
>> >> http://o19s.com/doug
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Michael Gibney
I think the objection to "boosting" in token filters isn't because it
is "too much", but rather because it breaks the abstraction of the
analysis chain to directly target scoring (as implied by
characterizing as "boosting").

That said, I'm sympathetic to an approach that would establish an
Attribute to expose the kind of information that would be useful in
the context of synonyms (or other sorts of derived tokens discussed
here, where it could be useful to express information about token
derivation). Such an Attribute would not be directly related to
scoring/boosting, but would be related to analysis per se, (e.g.,
source token text, thesaurus, degree of confidence, etc.); support
could be selectively implemented by TokenFilters, and optionally
leveraged by query builders (e.g., translated to boosts) or even
recorded to index Payloads by a final custom analysis component ....

"You can look at any attribute on the tokenstream you want", "rely on
abstract attributes (type, ...) then it should be easy to sub-class
the query builder to access them".  Obviously that works iff analysis
components record the relevant information in attributes on the
tokenstream, which I think they currently don't (for much of the
information that has been discussed here) ... and I know of no
standard way to express the relevant information on the tokenstream.

I can see that such an Attribute would be out of place (too
specialized) in the context of the Attributes in lucene/core; but
there are lots of more specialized Attributes in the various
submodules under lucene/analysis/* (SynonymGraphFilter lives in
analysis-common, FWIW). Again, this doesn't strike me as terribly
specialized, if one thinks of it more generally as a
"derivation/relationship" Attribute.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Alan Woodward-3
I think we can expose this information now with a small tweak to the SynonymGraphFilter, using the already-existing TypeAttribute.

SGF is hard-coded to set the type attribute to “SYNONYM” on all tokens that it inserts into the stream.  It should be simple to add another constructor parameter allowing users to change this; then you can chain synonym filters, one for each type of expansion you want: synonym, hyponym, hypernym, whatever, each setting the type attribute differently.

> On 28 Nov 2018, at 15:59, Michael Gibney <[hidden email]> wrote:
>
> I think the objection to "boosting" in token filters isn't because it
> is "too much", but rather because it breaks the abstraction of the
> analysis chain to directly target scoring (as implied by
> characterizing as "boosting").
>
> That said, I'm sympathetic to an approach that would establish an
> Attribute to expose the kind of information that would be useful in
> the context of synonyms (or other sorts of derived tokens discussed
> here, where it could be useful to express information about token
> derivation). Such an Attribute would not be directly related to
> scoring/boosting, but would be related to analysis per se, (e.g.,
> source token text, thesaurus, degree of confidence, etc.); support
> could be selectively implemented by TokenFilters, and optionally
> leveraged by query builders (e.g., translated to boosts) or even
> recorded to index Payloads by a final custom analysis component ....
>
> "You can look at any attribute on the tokenstream you want", "rely on
> abstract attributes (type, ...) then it should be easy to sub-class
> the query builder to access them".  Obviously that works iff analysis
> components record the relevant information in attributes on the
> tokenstream, which I think they currently don't (for much of the
> information that has been discussed here) ... and I know of no
> standard way to express the relevant information on the tokenstream.
>
> I can see that such an Attribute would be out of place (too
> specialized) in the context of the Attributes in lucene/core; but
> there are lots of more specialized Attributes in the various
> submodules under lucene/analysis/* (SynonymGraphFilter lives in
> analysis-common, FWIW). Again, this doesn't strike me as terribly
> specialized, if one thinks of it more generally as a
> "derivation/relationship" Attribute.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SynonymQuery / Query Expansion Strategies Discussion

Doug Turnbull
I like that idea Alan. The trick is for QueryBuilder's 'newSynonymQuery' to be useful in that context, you need to pass terms with metadata down to the subclass. This is what I started working on a few weeks ago:

https://github.com/o19s/lucene-solr/commit/0fc3930671ef002cfbb5e3d52b6f8edc3715bf14

I don't think it's as simple as overriding analyzeBoolean/analyzeMultiBoolean as Rob suggests, as there's also analyzeGraphBoolean and the  that would also need to collect this metadata. I wouldn't want to copy paste all this code into a subclass just to add one token attribute. 

-Doug



On Wed, Nov 28, 2018 at 12:25 PM Alan Woodward <[hidden email]> wrote:
I think we can expose this information now with a small tweak to the SynonymGraphFilter, using the already-existing TypeAttribute.

SGF is hard-coded to set the type attribute to “SYNONYM” on all tokens that it inserts into the stream.  It should be simple to add another constructor parameter allowing users to change this; then you can chain synonym filters, one for each type of expansion you want: synonym, hyponym, hypernym, whatever, each setting the type attribute differently.

> On 28 Nov 2018, at 15:59, Michael Gibney <[hidden email]> wrote:
>
> I think the objection to "boosting" in token filters isn't because it
> is "too much", but rather because it breaks the abstraction of the
> analysis chain to directly target scoring (as implied by
> characterizing as "boosting").
>
> That said, I'm sympathetic to an approach that would establish an
> Attribute to expose the kind of information that would be useful in
> the context of synonyms (or other sorts of derived tokens discussed
> here, where it could be useful to express information about token
> derivation). Such an Attribute would not be directly related to
> scoring/boosting, but would be related to analysis per se, (e.g.,
> source token text, thesaurus, degree of confidence, etc.); support
> could be selectively implemented by TokenFilters, and optionally
> leveraged by query builders (e.g., translated to boosts) or even
> recorded to index Payloads by a final custom analysis component ....
>
> "You can look at any attribute on the tokenstream you want", "rely on
> abstract attributes (type, ...) then it should be easy to sub-class
> the query builder to access them".  Obviously that works iff analysis
> components record the relevant information in attributes on the
> tokenstream, which I think they currently don't (for much of the
> information that has been discussed here) ... and I know of no
> standard way to express the relevant information on the tokenstream.
>
> I can see that such an Attribute would be out of place (too
> specialized) in the context of the Attributes in lucene/core; but
> there are lots of more specialized Attributes in the various
> submodules under lucene/analysis/* (SynonymGraphFilter lives in
> analysis-common, FWIW). Again, this doesn't strike me as terribly
> specialized, if one thinks of it more generally as a
> "derivation/relationship" Attribute.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

--
CTO, OpenSource Connections
Author, Relevant Search