Improving performance for use-case where large (200) number of phrase queries are used?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Improving performance for use-case where large (200) number of phrase queries are used?

Aaron Daubman
Greetings,

We have a solr instance in use that gets some perhaps atypical queries
and suffers from poor (>2 second) QTimes.

Documents (~2,350,000) in this instance are mainly comprised of
various "descriptive fields", such as multi-word (phrase) tags - an
average document contains 200-400 phrases like this across several
different multi-valued field types.

A custom QueryComponent has been built that functions somewhat like a
very specific MoreLikeThis. A seed document is specified via the
incoming query, its terms are retrieved, boosted both by query
parameters as well as fields within the document that specify term
weighting, sorted by this custom boosting, and then a second query is
crafted by taking the top 200 (sorted by the custom boosting)
resulting field values paired with their fields and searching for
documents matching these 200 values.

For many searches, 25-50% of the documents match the query of 200
terms (so 600,000 to 1,200,000).

After doing some profiling, it seems that a majority of the QTime
comes from dealing with phrases and resulting term positions, since a
majority of the search terms are actually multi-word tokenized
phrases. (processing is dominated by ExactPhraseScorer on down,
particularly: SegmentTermPositions, readVInt)

I have thought of a few ways to improve performance for this use case,
and am looking for feedback as to which seems best, as well as any
insight into other ways to approach this problem that I haven't
considered (or things to look into to help better understand the slow
QTimes more fully):

1) Shard the index - since there is no key to really specify which
shard queries would go to, this would only be of benefit if scoring is
done in parallel. Is there documentation I have so far missed that
describes distributed searching for this case? (I haven't found
anything that really describes the differences in scoring for
distributed vs. non-distributed indices, aside from the warnings that
IDF doesn't work - which I don't think we really care about).

2) Implement "Common Grams" as described here:
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
It's not clear how many individual words in the phrases being used
are, in fact, common, but given that 25-50% of the documents in the
index match many queries, it seems this may be of value

3) Try and make mm (minimum terms should match) work for the custom
query. I haven't been able to figure out how exactly this parameter
works, but, my thinking is along the lines of "if only 2 of those 200
terms match a document, it doesn't need to get scored". What I don't
currently understand is at what point failing the mm requirement
short-circuits - e.g. does the doc still get scored? If it does
short-circuit prior to scoring, this may help somewhat, although it's
not clear this would still prevent the many many gets against term
positions that is still killing QTime

4) Set a dynamic number (rather than the currently fixed 200) of terms
based on the custom boosting/weighting value - e.g. only use terms
whose calculated value is above some threshold. I'm not keen on this
since some documents may be dominated by many weak terms and not have
any great ones, it it might break for those (finding the "sweet spot"
cutoff would not be straightforward).

5) *This is my current favorite*: stop tokenizing/analyzing these
terms and just use KeywordTokenizer. Most of these phrases are
pre-vetted, and it may be possible to clean/process any others before
creating the docs. My main worry here is that, currently, if I
understand correctly, a document with the phrase "brazilian pop" would
still be returned as a match to a seed document containing only the
phrase "brazilian" (not the other way around, but that is not
necessary), however, with KeywordTokenizer, this would no longer be
the case. If I switched from the current dubious tokenize/stem/etc...
and just used Keyword, would this allow queries like "this used to be
a long phrase query" to match documents that have "this used to be a
long phrase query" as one of the multivalued values in the field
without having to pull term positions? (and thus significantly speed
up performance).

Thanks,
     Aaron
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance for use-case where large (200) number of phrase queries are used?

Robert Muir
On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman <[hidden email]> wrote:

> Greetings,
>
> We have a solr instance in use that gets some perhaps atypical queries
> and suffers from poor (>2 second) QTimes.
>
> Documents (~2,350,000) in this instance are mainly comprised of
> various "descriptive fields", such as multi-word (phrase) tags - an
> average document contains 200-400 phrases like this across several
> different multi-valued field types.
>
> A custom QueryComponent has been built that functions somewhat like a
> very specific MoreLikeThis. A seed document is specified via the
> incoming query, its terms are retrieved, boosted both by query
> parameters as well as fields within the document that specify term
> weighting, sorted by this custom boosting, and then a second query is
> crafted by taking the top 200 (sorted by the custom boosting)
> resulting field values paired with their fields and searching for
> documents matching these 200 values.

a few more ideas:
* use shingles e.g. to turn two-word phrases into single terms (how
long is your average phrase?).
* in addition to the above, maybe for phrases with > 2 terms, consider
just a boolean conjunction of the shingled phrases instead of a "real"
phrase query: e.g. "more like this" -> (more_like AND like_this). This
would have some false positives.
* use a more aggressive stopwords list for your "MorePhrasesLikeThis".
* reduce this number 200, and instead work harder to prune out which
phrases are the "most descriptive" from the seed document, e.g. based
on some heuristics like their frequency or location within that seed
document, so your query isnt so massive.
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance for use-case where large (200) number of phrase queries are used?

Peter Keegan
Could you index your 'phrase tags' as single tokens? Then your phrase
queries become simple TermQuerys.

On Wed, Oct 24, 2012 at 12:26 PM, Robert Muir <[hidden email]> wrote:

> On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman <[hidden email]> wrote:
> > Greetings,
> >
> > We have a solr instance in use that gets some perhaps atypical queries
> > and suffers from poor (>2 second) QTimes.
> >
> > Documents (~2,350,000) in this instance are mainly comprised of
> > various "descriptive fields", such as multi-word (phrase) tags - an
> > average document contains 200-400 phrases like this across several
> > different multi-valued field types.
> >
> > A custom QueryComponent has been built that functions somewhat like a
> > very specific MoreLikeThis. A seed document is specified via the
> > incoming query, its terms are retrieved, boosted both by query
> > parameters as well as fields within the document that specify term
> > weighting, sorted by this custom boosting, and then a second query is
> > crafted by taking the top 200 (sorted by the custom boosting)
> > resulting field values paired with their fields and searching for
> > documents matching these 200 values.
>
> a few more ideas:
> * use shingles e.g. to turn two-word phrases into single terms (how
> long is your average phrase?).
> * in addition to the above, maybe for phrases with > 2 terms, consider
> just a boolean conjunction of the shingled phrases instead of a "real"
> phrase query: e.g. "more like this" -> (more_like AND like_this). This
> would have some false positives.
> * use a more aggressive stopwords list for your "MorePhrasesLikeThis".
> * reduce this number 200, and instead work harder to prune out which
> phrases are the "most descriptive" from the seed document, e.g. based
> on some heuristics like their frequency or location within that seed
> document, so your query isnt so massive.
>
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance for use-case where large (200) number of phrase queries are used?

Aaron Daubman
In reply to this post by Robert Muir
Thanks for the ideas - some followup questions in-line below:


> * use shingles e.g. to turn two-word phrases into single terms (how
> long is your average phrase?).

Would this be different than what I was calling "common grams"? (other
than shingling every two words, rather than just common ones?)


> * in addition to the above, maybe for phrases with > 2 terms, consider
> just a boolean conjunction of the shingled phrases instead of a "real"
> phrase query: e.g. "more like this" -> (more_like AND like_this). This
> would have some false positives.

This would definitely help, but, IIRC, we moved to phrase queries due
to too many false positives, it would be an interesting experiment to
see how many false positives were left when shingling and then just
doing conjunctive queries.


> * use a more aggressive stopwords list for your "MorePhrasesLikeThis".
> * reduce this number 200, and instead work harder to prune out which
> phrases are the "most descriptive" from the seed document, e.g. based
> on some heuristics like their frequency or location within that seed
> document, so your query isnt so massive.

This is something I've been asking for (perform some sort of PCA /
feature selection on the actual terms used) but is of questionable
value and hard to do "right" so hasn't happened yet (it's not clear
that there will be terms that are very common that are not also very
descriptive, so the extent to which this would help is unknown).

Thanks again for the ideas!
     Aaron
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance for use-case where large (200) number of phrase queries are used?

Aaron Daubman
In reply to this post by Peter Keegan
Hi Peter,

Thanks for the recommendation - I believe we are thinking along the
same lines, but wanted to check to make sure. Are you suggesting
something different than my #5 (below) or are we essentially
suggesting the same thing?

On Wed, Oct 24, 2012 at 1:20 PM, Peter Keegan <[hidden email]> wrote:
> Could you index your 'phrase tags' as single tokens? Then your phrase
> queries become simple TermQuerys.

>>
>> 5) *This is my current favorite*: stop tokenizing/analyzing these
>> terms and just use KeywordTokenizer. Most of these phrases are
>> pre-vetted, and it may be possible to clean/process any others before
>> creating the docs. My main worry here is that, currently, if I
>> understand correctly, a document with the phrase "brazilian pop" would
>> still be returned as a match to a seed document containing only the
>> phrase "brazilian" (not the other way around, but that is not
>> necessary), however, with KeywordTokenizer, this would no longer be
>> the case. If I switched from the current dubious tokenize/stem/etc...
>> and just used Keyword, would this allow queries like "this used to be
>> a long phrase query" to match documents that have "this used to be a
>> long phrase query" as one of the multivalued values in the field
>> without having to pull term positions? (and thus significantly speed
>> up performance).
>>

Thanks again,
     Aaron
Reply | Threaded
Open this post in threaded view
|

Re: Improving performance for use-case where large (200) number of phrase queries are used?

Peter Keegan
Yes #5 is the same thing (sorry, I didn't read them all thoroughly). Your
description of the phrases being 'tags' suggests that you don't need term
positions for matching, and as you noted, you would get unwanted partial
matches. And, the TermQuerys would be much faster.

Peter


On Wed, Oct 24, 2012 at 8:33 PM, Aaron Daubman <[hidden email]> wrote:

> Hi Peter,
>
> Thanks for the recommendation - I believe we are thinking along the
> same lines, but wanted to check to make sure. Are you suggesting
> something different than my #5 (below) or are we essentially
> suggesting the same thing?
>
> On Wed, Oct 24, 2012 at 1:20 PM, Peter Keegan <[hidden email]>
> wrote:
> > Could you index your 'phrase tags' as single tokens? Then your phrase
> > queries become simple TermQuerys.
>
> >>
> >> 5) *This is my current favorite*: stop tokenizing/analyzing these
> >> terms and just use KeywordTokenizer. Most of these phrases are
> >> pre-vetted, and it may be possible to clean/process any others before
> >> creating the docs. My main worry here is that, currently, if I
> >> understand correctly, a document with the phrase "brazilian pop" would
> >> still be returned as a match to a seed document containing only the
> >> phrase "brazilian" (not the other way around, but that is not
> >> necessary), however, with KeywordTokenizer, this would no longer be
> >> the case. If I switched from the current dubious tokenize/stem/etc...
> >> and just used Keyword, would this allow queries like "this used to be
> >> a long phrase query" to match documents that have "this used to be a
> >> long phrase query" as one of the multivalued values in the field
> >> without having to pull term positions? (and thus significantly speed
> >> up performance).
> >>
>
> Thanks again,
>      Aaron
>