Proposal: Full support for multi-word synonyms at query time

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal: Full support for multi-word synonyms at query time

Jack Krupansky-2
One of the ongoing potholes of Solr and Lucene is lack of full support for multi-word synonyms at query time. The root of the problem is twofold: individual terms are presented for analysis which precludes recognition of multi-term synonyms, and the output stream from the analyis process is a single, linear stream without regard to any graph/lattice structure for multiple synonyms.
 
I intend to file a Jira, but wanted to get some wide attention and feedback on whether people are ready to finally tackle this ongoing thorn in the side of an otherwise fantastic enterprise search tool.
 
My proposed solution is fourfold:
 
1. Add an attribute, call it “path” for now, to the analysis process so that tokens coming out of the analysis in a linear stream can be easily reconstituted into the graph/lattice for multiple synonyms (single or multi-term) at the same position in a token sequence. There could be multiple paths at a position and paths can be nested, possibly using a dot notation such as “1.3.2”. There may be better ways to do this – this is just an initial proposal to get the ball rolling.
2. Add a utility class and method for analysis for query parsers to present a sequence of adjacent terms, rather than a single term at a time, so that multiword synonyms can be recognized. Query parsers would be expected to present a “term sequence” – sequence of adjacent terms without intervening operators – at one time.
3. Add a Query generation class and method that can take the graph/lattice for a token sequence containing nested synonym alternatives and generate the appropriate Query structure with BooleanQuery SHOULD or SpanOrQuery to implement synonym alternatives at a given position.
4. Modify the most popular query parsers to use the new analysis/generation.
 
Obviously there are lots of fine details to resolve.
 
What I wanted to do right now is see if there is general support for pushing forward with such a radical change, say for Lucene and Solr 5.0, or I suppose some 4.x > 4.0.
 
If I get enough support, I’ll file the Jira. Otherwise, I’ll just wait a year and then try again.
 
I’m not personally committing to do the actual work, but simply to get the ball rolling and keep it rolling. I’ll do work to the extent that nobody else is jumping in first. And I certainly don’t want to propose some giant patch that never gets approved and has to be constantly updated as the rest of Lucene/Solr changes. I would home that pieces of this large task could be carved off and committed incrementally to avoid having a monster patch at the end.
 
So, the questions (primarily for committers) for now are:
 
1. Do people want to see this go forward now (reasonably near future as opposed to more than a year away)?
2. Does the overall approach seem feasible and low enough risk?
3. Will this approach provide people with search results they expect?
4. Is this a high enough value feature change to justify the effort?
 
As far as support for multi-word synonyms at index time... uhhhhh... that’s another story. I think the two (query vs. index) can be separated. The basic problem at index time is that if you index “heart attack” and “myocardial infarction” at the same positions, queries of “heart infarction” and “myocardial attack” will have false matches. And if the list of synonyms have varying lengths, the position of the next term will be off for phrase queries. In any case, I am proposing moving forward with a full solution at query time only, for now.

-- Jack Krupansky
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Full support for multi-word synonyms at query time

Robert Muir
On Fri, Aug 10, 2012 at 1:36 PM, Jack Krupansky <[hidden email]> wrote:
> One of the ongoing potholes of Solr and Lucene is lack of full support for
> multi-word synonyms at query time. The root of the problem is twofold:
> individual terms are presented for analysis which precludes recognition of
> multi-term synonyms, and the output stream from the analyis process is a
> single, linear stream without regard to any graph/lattice structure for
> multiple synonyms.

But this is not true. PositionLengthAttribute was already added, which
makes it a graph.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Full support for multi-word synonyms at query time

Jack Krupansky-2
The Javadoc says "positionLength determines how many positions this token
spans". It's not obvious from the documentation how the full graph structure
for nested multi-word synonyms can be expressed merely using that attribute.
Is this detailed anywhere? (Maybe in Jira... but it is still down.) I mean,
a multi-word synonym is multiple tokens. How does any of the "tokens" span
more than one position?

-- Jack Krupansky

-----Original Message-----
From: Robert Muir
Sent: Friday, August 10, 2012 1:44 PM
To: [hidden email]
Subject: Re: Proposal: Full support for multi-word synonyms at query time

On Fri, Aug 10, 2012 at 1:36 PM, Jack Krupansky <[hidden email]>
wrote:
> One of the ongoing potholes of Solr and Lucene is lack of full support for
> multi-word synonyms at query time. The root of the problem is twofold:
> individual terms are presented for analysis which precludes recognition of
> multi-term synonyms, and the output stream from the analyis process is a
> single, linear stream without regard to any graph/lattice structure for
> multiple synonyms.

But this is not true. PositionLengthAttribute was already added, which
makes it a graph.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Full support for multi-word synonyms at query time

Yonik Seeley-2-2
On Fri, Aug 10, 2012 at 2:10 PM, Jack Krupansky <[hidden email]> wrote:
> The Javadoc says "positionLength determines how many positions this token
> spans". It's not obvious from the documentation how the full graph structure
> for nested multi-word synonyms can be expressed merely using that attribute.
> Is this detailed anywhere? (Maybe in Jira... but it is still down.) I mean,
> a multi-word synonym is multiple tokens. How does any of the "tokens" span
> more than one position?

You sort of do it in reverse I think... make the small token take up a
bigger amount of space.

so for
 (US | united states) gold medals

"US" would have a length of 2 so it would skip ahead to "gold", while
"united" and "states" would both have normal values of 1.

-Yonik
http://lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Full support for multi-word synonyms at query time

Jack Krupansky-2
In reply to this post by Robert Muir
I just noticed this in SynonymFilter in trunk:

// TODO: we should set PositionLengthAttr too...

It looks like the code does in fact set the PositionLengthAttribute, so
maybe it is just a dead TODO.

And, I see the following comment (which I had seen before and was the basis
for my assertion that arbitrary graphs were not supported:

* <p><b>NOTE</b>: when a match occurs, the output tokens
* associated with the matching rule are "stacked" on top of
* the input stream (if the rule had
* <code>keepOrig=true</code>) and also on top of another
* matched rule's output tokens.  This is not a correct
* solution, as really the output should be an arbitrary
* graph/lattice.  For example, with the above match, you
* would expect an exact <code>PhraseQuery</code> <code>"y b
* c"</code> to match the parsed tokens, but it will fail to
* do so.  This limitation is necessary because Lucene's
* TokenStream (and index) cannot yet represent an arbitrary
* graph.</p>

Granted, some of that is specific to index-time support for synonyms, which
I am avoiding, but it is a source for some confusion. If full graphs are
somehow supported at query time (or in the TokenStream in general), that
should be stated more clearly.

-- Jack Krupansky

-----Original Message-----
From: Robert Muir
Sent: Friday, August 10, 2012 1:44 PM
To: [hidden email]
Subject: Re: Proposal: Full support for multi-word synonyms at query time

On Fri, Aug 10, 2012 at 1:36 PM, Jack Krupansky <[hidden email]>
wrote:
> One of the ongoing potholes of Solr and Lucene is lack of full support for
> multi-word synonyms at query time. The root of the problem is twofold:
> individual terms are presented for analysis which precludes recognition of
> multi-term synonyms, and the output stream from the analyis process is a
> single, linear stream without regard to any graph/lattice structure for
> multiple synonyms.

But this is not true. PositionLengthAttribute was already added, which
makes it a graph.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Full support for multi-word synonyms at query time

Lance Norskog-2
I would do the query parser part first, without the graph part. This
would allow two words without quotes to match a two-word synonym. This
would be a great improvement on the current behavior. Suggested
behavior:

one two three
- "one two", "two three" and "one two three" will checked against synonyms
one two "three"
- "one two" can be a synonym
one two OR three
- "one two" can be a synonym
one OR two OR three
- no multi-word synonyms

This would be a clear intuitive behavior. I'm sure there are other use
cases that may not make sense, but these are the common use case.

On Fri, Aug 10, 2012 at 2:21 PM, Jack Krupansky <[hidden email]> wrote:

> I just noticed this in SynonymFilter in trunk:
>
> // TODO: we should set PositionLengthAttr too...
>
> It looks like the code does in fact set the PositionLengthAttribute, so
> maybe it is just a dead TODO.
>
> And, I see the following comment (which I had seen before and was the basis
> for my assertion that arbitrary graphs were not supported:
>
> * <p><b>NOTE</b>: when a match occurs, the output tokens
> * associated with the matching rule are "stacked" on top of
> * the input stream (if the rule had
> * <code>keepOrig=true</code>) and also on top of another
> * matched rule's output tokens.  This is not a correct
> * solution, as really the output should be an arbitrary
> * graph/lattice.  For example, with the above match, you
> * would expect an exact <code>PhraseQuery</code> <code>"y b
> * c"</code> to match the parsed tokens, but it will fail to
> * do so.  This limitation is necessary because Lucene's
> * TokenStream (and index) cannot yet represent an arbitrary
> * graph.</p>
>
> Granted, some of that is specific to index-time support for synonyms, which
> I am avoiding, but it is a source for some confusion. If full graphs are
> somehow supported at query time (or in the TokenStream in general), that
> should be stated more clearly.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Robert Muir
> Sent: Friday, August 10, 2012 1:44 PM
> To: [hidden email]
> Subject: Re: Proposal: Full support for multi-word synonyms at query time
>
>
> On Fri, Aug 10, 2012 at 1:36 PM, Jack Krupansky <[hidden email]>
> wrote:
>>
>> One of the ongoing potholes of Solr and Lucene is lack of full support for
>> multi-word synonyms at query time. The root of the problem is twofold:
>> individual terms are presented for analysis which precludes recognition of
>> multi-term synonyms, and the output stream from the analyis process is a
>> single, linear stream without regard to any graph/lattice structure for
>> multiple synonyms.
>
>
> But this is not true. PositionLengthAttribute was already added, which
> makes it a graph.
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



--
Lance Norskog
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Full support for multi-word synonyms at query time

mimimimi
In reply to this post by Jack Krupansky-2
While dealing with synonym at query time, solr failed to work with multi-word synonyms due to some reasons:

    First the lucene queryparser tokenizes user query by space so it split multi-word term into two terms before feeding to synonym filter, so synonym filter can't recognized multi-word term to do expansion
    Second, if synonym filter expand into multiple terms which contains multi-word synonym, The SolrQueryParseBase currently use MultiPhraseQuery to handle synonyms. But MultiPhraseQuery don't work with term have different number of words.
For the first one, we can extend quoted all multi-word synonym in user query so that lucene queryparser don't split it. There are a jira task related to this one https://issues.apache.org/jira/browse/LUCENE-2605.

For the second, we can replace MultiPhraseQuery by an appropriate BoleanQuery SHOULD which contains multiple PhraseQuery in case tokens stream have multi-word synonym

barcode java