Practical usages of arbitrary Shingles when using a query parser?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Practical usages of arbitrary Shingles when using a query parser?

Chris Hostetter-3

Although I've been aware of Shings and some of the useful applications for
a long time, today is the first tiem i really sat down and tried to do
something non-trivial with them myself.

My objective seems realatively straight forard: given a corpus of text and
some analyzer (for sake of discussion let's assume simple whitespace
tokenization w/lowercasing) i want to be able to say "I am happy to trade
index time/size for faster queries of shorter phrases"

So instead of just indexing "the quick brown fox jumped over the lazy dog"
as a field with 9 terms, I might want to add ShingleFilterFactory to the
end of my analyzer using [[minShingleSize="2" maxShingleSize="2"
outputUnigrams="true"]] and now I have a field w/17 terms, but if I get a
query for a "phrase" of 2 words/terms, i should in theory be able to just
use a TermQuery under the covers -- making just as "fast" as query for a
single word/term.  But meanwhile longer phrases should still "just work"
as if i didn't have any shingles.

So far so good...

If I actually index a corpus as described above, and then at query time I
use ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2"
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get the
expected TemQuery for either a single word input or two-word input ...
for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit one
composed of bi-shingles instead of individual unigrams, but AFAICT the
position info is set correctly so that it will only match the documents
thta would have been matched w/o any shingles (and IIUC the term stats
for the shingles seem like should probably result in subjectively "better"
scores? not certain on this bit, but also not overly concerend about it)

The problem is that (unless I'm missing something) this doesn't really
work if I want to use an arbitrary 'maxShingleSize="N"' where N>2.

If i change my index time ShingleFilterFactory uses [[minShingleSize="2"
maxShingleSize="N" outputUnigrams="true"]] the equivilent change to the
query time analyzer would be [[minShingleSize="2" maxShingleSize="N"
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and while
that does seem to cause "phrase" input of all sizes to be converted by the
analyzer+QueryParser into a query that (AFAICT) will match the correct
documents (compared to using no shingles) it's only "optimized" as a
TermQuery for one & two word phrases.  For input phrasees longer then 2
terms it generates a SpanOrQuery wrapping multiple SpanNearQueries,
i believe because of the overlapping positions of the bi/tri/quad-etc..
shingles.

There just doesn't seem to be any good/generic way to leverage a field
built with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]]
(where X != Y) at query time using an QueryParser configured with out of
the box analyzer components.

It seems like what's missing is a ShingleFilter(Factory) configuration
that means "output the maximum possible shingle size between MIN and
MAX based on the size of the input stream" ... but that doesn't seem to
exist.

Does anyone have any advice/suggestions on how to approach this type of
problem based on their own experiences?  Does anyone have first hand
experience using maxShingleSize > 2 with a QueryParser (and w/o any
preconcieved assumptions about the length of the input) ?

  ?

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Practical usages of arbitrary Shingles when using a query parser?

Adrien Grand
Hi Hoss,

The query parser is confused by these overlapping positions indeed, which
it interprets as synonyms. I was going to write that you should set the
same min and max shingle sizes at query time, but while writing that I
realized that you probably wanted to keep outputing shorter shingles so
that a phrase query on 2 terms with a max shingle size of 3 would still use
shingles? Maybe 'outputUnigramsIfNoShingles' should really be something
like 'outputShinglesOfTheMaximumSizeOnly'?

For the record, in addition to the problems that you mentioned,
ShingleFilter proved very hard to be fixed in order to work correctly on
top of synonyms when X != Y[1], which encouraged Alan work on a new
FixedShingleFilter[2] that deals with index-time synonyms (ie. ignores
position length) just fine but only allows X == Y. Also instead of feeding
an analyzer with shingles to the query parser, we found it more
user-friendly to add an option to text fields in order to index 2-shingles
into a separate field and redirect phrase queries to it.[3] We did
something similar for edge-ngrams[4] to optimize prefix queries based on
the same problem that you need more than appending an EdgeNGramTokenFilter
to you analysis chain to make prefix queries efficient. In the end we might
remove the ability to set shingle or ngram filters in analyzers and just
make them implementation details of the aforementioned options.

[1] https://issues.apache.org/jira/browse/LUCENE-3475
[2] https://issues.apache.org/jira/browse/LUCENE-8202
[3] https://github.com/elastic/elasticsearch/pull/30450
[4] https://github.com/elastic/elasticsearch/pull/28290

Le mar. 31 juil. 2018 à 00:46, Chris Hostetter <[hidden email]> a
écrit :

>
> Although I've been aware of Shings and some of the useful applications for
> a long time, today is the first tiem i really sat down and tried to do
> something non-trivial with them myself.
>
> My objective seems realatively straight forard: given a corpus of text and
> some analyzer (for sake of discussion let's assume simple whitespace
> tokenization w/lowercasing) i want to be able to say "I am happy to trade
> index time/size for faster queries of shorter phrases"
>
> So instead of just indexing "the quick brown fox jumped over the lazy dog"
> as a field with 9 terms, I might want to add ShingleFilterFactory to the
> end of my analyzer using [[minShingleSize="2" maxShingleSize="2"
> outputUnigrams="true"]] and now I have a field w/17 terms, but if I get a
> query for a "phrase" of 2 words/terms, i should in theory be able to just
> use a TermQuery under the covers -- making just as "fast" as query for a
> single word/term.  But meanwhile longer phrases should still "just work"
> as if i didn't have any shingles.
>
> So far so good...
>
> If I actually index a corpus as described above, and then at query time I
> use ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2"
> outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get the
> expected TemQuery for either a single word input or two-word input ...
> for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit one
> composed of bi-shingles instead of individual unigrams, but AFAICT the
> position info is set correctly so that it will only match the documents
> thta would have been matched w/o any shingles (and IIUC the term stats
> for the shingles seem like should probably result in subjectively "better"
> scores? not certain on this bit, but also not overly concerend about it)
>
> The problem is that (unless I'm missing something) this doesn't really
> work if I want to use an arbitrary 'maxShingleSize="N"' where N>2.
>
> If i change my index time ShingleFilterFactory uses [[minShingleSize="2"
> maxShingleSize="N" outputUnigrams="true"]] the equivilent change to the
> query time analyzer would be [[minShingleSize="2" maxShingleSize="N"
> outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and while
> that does seem to cause "phrase" input of all sizes to be converted by the
> analyzer+QueryParser into a query that (AFAICT) will match the correct
> documents (compared to using no shingles) it's only "optimized" as a
> TermQuery for one & two word phrases.  For input phrasees longer then 2
> terms it generates a SpanOrQuery wrapping multiple SpanNearQueries,
> i believe because of the overlapping positions of the bi/tri/quad-etc..
> shingles.
>
> There just doesn't seem to be any good/generic way to leverage a field
> built with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]]
> (where X != Y) at query time using an QueryParser configured with out of
> the box analyzer components.
>
> It seems like what's missing is a ShingleFilter(Factory) configuration
> that means "output the maximum possible shingle size between MIN and
> MAX based on the size of the input stream" ... but that doesn't seem to
> exist.
>
> Does anyone have any advice/suggestions on how to approach this type of
> problem based on their own experiences?  Does anyone have first hand
> experience using maxShingleSize > 2 with a QueryParser (and w/o any
> preconcieved assumptions about the length of the input) ?
>
>         ?
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Practical usages of arbitrary Shingles when using a query parser?

Chris Hostetter-3

: The query parser is confused by these overlapping positions indeed, which
: it interprets as synonyms. I was going to write that you should set the

Sure -- i'm not blaming the QueryParser, what it does with the
Shingles output makes sense (and actual works! .. just not as efficiently
as possible).  I'm trying to figure out how to make the ShingleFilter
output more useful in the query time analyzer usecase.

: it interprets as synonyms. I was going to write that you should set the
: same min and max shingle sizes at query time, but while writing that I
: realized that you probably wanted to keep outputing shorter shingles so
: that a phrase query on 2 terms with a max shingle size of 3 would still use

Yes exactly ... if at index time you output both unigrams and shingles of
sizes 2-5, and at query time you have a "phrase" of only 2 words, ideally
the filter should output a simple Token so you can make a single TermQuery
-- likewise if you have a phrase of 3 words, or 4, words, or 5 words
thouse should ideally all produces single tokens.

Your suggestion of "same min & max at query time" where min=max=X is
something i briefly considered, but that means you're only optimizing the
"phrases" of length "X", all shorter phrases just use unigrams, and in
fact there is no point in building shingles of any size othe then X at
index time.

: shingles? Maybe 'outputUnigramsIfNoShingles' should really be something
: like 'outputShinglesOfTheMaximumSizeOnly'?

That's what i was thinking -- but i haven't dug into the code enough to
understand how complex that would be. (i was starting with "Am i missing
something about how/why this shouldn't/doesn't already exist?")

: For the record, in addition to the problems that you mentioned,
: ShingleFilter proved very hard to be fixed in order to work correctly on
: top of synonyms when X != Y[1], which encouraged Alan work on a new
: FixedShingleFilter[2] that deals with index-time synonyms (ie. ignores

Yeah ... i can't even imagine the complexity of dealing with "graph" based
synonyms and shinles (didn't read your link for fear of my own sanity)

: position length) just fine but only allows X == Y. Also instead of feeding
: an analyzer with shingles to the query parser, we found it more
: user-friendly to add an option to text fields in order to index 2-shingles
: into a separate field and redirect phrase queries to it.[3] We did

Right ... i'm actually looking at a system know that puts uni-shingles,
bi-shingles, and tri-shingles in 3 diff fields, and then pre-parses the
input to figure out how long it is to decide which field to query ... i'm
trying to simplify that.

Ideally what I'd like to be able to say is "give me a phrase, if the
field is configured w/o any shingles at all it will work fine (via
PhraseQuery), but if the analyzer is configured with shingles it will be
even faster (via term query) if/when the query phrase is "shorter" then
the max shingles length.


-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]