Performance problems with extremely common terms in collection (Solr 7.4)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance problems with extremely common terms in collection (Solr 7.4)

Ash Ramesh
Hi everybody,

We have a corpus of 50+ million documents in our collection. I've noticed
that some queries with specific keywords tend to be extremely slow. E.g.
the q=`photography' or q='background'. After digging into the raw
documents, I could see that these two terms appear in greater than 90% of
all documents, which means solr has to score each of those documents.

Is there a best practise to deal with these sort of queries? Should solr be
able to handle these queries normally quickly (we have 8 shards). The
average, reproducible time for the response on these queries is between
1.5-2.5 seconds.

Please let me know if more information is required.

Regards.

Ash

--
*P.S. We've launched a new blog to share the latest ideas and case studies
from our team. Check it out here: product.canva.com
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the
world to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
 <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://twitter.com/canva>  <https://facebook.com/canva
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>






Reply | Threaded
Open this post in threaded view
|

Re: Performance problems with extremely common terms in collection (Solr 7.4)

Toke Eskildsen-2
On Mon, 2019-04-08 at 09:58 +1000, Ash Ramesh wrote:
> We have a corpus of 50+ million documents in our collection. I've
> noticed that some queries with specific keywords tend to be extremely
> slow.
> E.g. the q=`photography' or q='background'. After digging into the
> raw documents, I could see that these two terms appear in greater
> than 90% of all documents, which means solr has to score each of
> those documents.

That is known behaviour, which can be remedied somewhat. Stop words is
a common approach, but your samples does not seem to fit well with
that. Instead you can look at Common Grams, where your high-frequency
words gets concatenated with surrounding words. This only works with
phrases though. There's a nice article at

https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

- Toke Eskildsen, Royal Danish Library


Reply | Threaded
Open this post in threaded view
|

Re: Performance problems with extremely common terms in collection (Solr 7.4)

Ash Ramesh
Hi Toke,

Thanks for the prompt reply. I'm glad to hear that this is a common
problem. In regards to stop words, I've been thinking about trying that
out. In our business case, most of these terms are keywords related to
stock photography, therefore it's natural for 'photography' or 'background'
to appear commonly in a document's keyword list. it seems unlikely we can
use the common grams solution with our business case.

Regards,

Ash

On Mon, Apr 8, 2019 at 5:01 PM Toke Eskildsen <[hidden email]> wrote:

> On Mon, 2019-04-08 at 09:58 +1000, Ash Ramesh wrote:
> > We have a corpus of 50+ million documents in our collection. I've
> > noticed that some queries with specific keywords tend to be extremely
> > slow.
> > E.g. the q=`photography' or q='background'. After digging into the
> > raw documents, I could see that these two terms appear in greater
> > than 90% of all documents, which means solr has to score each of
> > those documents.
>
> That is known behaviour, which can be remedied somewhat. Stop words is
> a common approach, but your samples does not seem to fit well with
> that. Instead you can look at Common Grams, where your high-frequency
> words gets concatenated with surrounding words. This only works with
> phrases though. There's a nice article at
>
>
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>
> - Toke Eskildsen, Royal Danish Library
>
>
>

--
*P.S. We've launched a new blog to share the latest ideas and case studies
from our team. Check it out here: product.canva.com
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the
world to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
 <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://twitter.com/canva>  <https://facebook.com/canva
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>






Reply | Threaded
Open this post in threaded view
|

Re: Performance problems with extremely common terms in collection (Solr 7.4)

Michael Gibney
In addition to Toke's suggestions (and those in the linked article), some
more ideas:
If single-term, bare queries are slow, it might be productive to check
config/performance of your queryResultCache (I realize this doesn't
directly address the concern of slow queries, but might nonetheless be
helpful in practice).
If multi-term queries that include these terms are slow, maybe check your
mm config to make sure it's not more inclusive than necessary for your use
case (scoring over union of docSets/clauses). If multi-term queries get
faster by disabling pf, you could try disabling main-query pf, and invoke
implicit phrase search (pseudo-pf) using ReRankQParser?
If you're able to share your configs (built queries, indexing/fieldType
config (positions, payloads?), etc.), that might enable more specific
advice.
I'm assuming the query-times posted are for queries that isolate the
performance of main query only (i.e., no other components, like facets,
etc.)?
Michael

On Mon, Apr 8, 2019 at 3:28 AM Ash Ramesh <[hidden email]> wrote:

> Hi Toke,
>
> Thanks for the prompt reply. I'm glad to hear that this is a common
> problem. In regards to stop words, I've been thinking about trying that
> out. In our business case, most of these terms are keywords related to
> stock photography, therefore it's natural for 'photography' or 'background'
> to appear commonly in a document's keyword list. it seems unlikely we can
> use the common grams solution with our business case.
>
> Regards,
>
> Ash
>
> On Mon, Apr 8, 2019 at 5:01 PM Toke Eskildsen <[hidden email]> wrote:
>
> > On Mon, 2019-04-08 at 09:58 +1000, Ash Ramesh wrote:
> > > We have a corpus of 50+ million documents in our collection. I've
> > > noticed that some queries with specific keywords tend to be extremely
> > > slow.
> > > E.g. the q=`photography' or q='background'. After digging into the
> > > raw documents, I could see that these two terms appear in greater
> > > than 90% of all documents, which means solr has to score each of
> > > those documents.
> >
> > That is known behaviour, which can be remedied somewhat. Stop words is
> > a common approach, but your samples does not seem to fit well with
> > that. Instead you can look at Common Grams, where your high-frequency
> > words gets concatenated with surrounding words. This only works with
> > phrases though. There's a nice article at
> >
> >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> >
> > - Toke Eskildsen, Royal Danish Library
> >
> >
> >
>
> --
> *P.S. We've launched a new blog to share the latest ideas and case studies
> from our team. Check it out here: product.canva.com
> <https://product.canva.com/>. ***
> ** <https://www.canva.com/>Empowering the
> world to design
> Also, we're hiring. Apply here!
> <https://about.canva.com/careers/>
>  <https://twitter.com/canva>
> <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> <https://twitter.com/canva>  <https://facebook.com/canva>
> <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance problems with extremely common terms in collection (Solr 7.4)

Diego Ceccarelli
Another way to make queries faster is, if you can, identify a subset of
documents that are in general relevant for the users (most recent ones,
most browsed etc etc), index those documents into a separate collection and
then query the small collection and back out to the full one if the small
one didn't have enough documents (caveat: the small collection could affect
the ranking because all terms stats will be different..)

Cheers,
Diego

On Mon, Apr 8, 2019, 15:59 Michael Gibney <[hidden email]> wrote:

> In addition to Toke's suggestions (and those in the linked article), some
> more ideas:
> If single-term, bare queries are slow, it might be productive to check
> config/performance of your queryResultCache (I realize this doesn't
> directly address the concern of slow queries, but might nonetheless be
> helpful in practice).
> If multi-term queries that include these terms are slow, maybe check your
> mm config to make sure it's not more inclusive than necessary for your use
> case (scoring over union of docSets/clauses). If multi-term queries get
> faster by disabling pf, you could try disabling main-query pf, and invoke
> implicit phrase search (pseudo-pf) using ReRankQParser?
> If you're able to share your configs (built queries, indexing/fieldType
> config (positions, payloads?), etc.), that might enable more specific
> advice.
> I'm assuming the query-times posted are for queries that isolate the
> performance of main query only (i.e., no other components, like facets,
> etc.)?
> Michael
>
> On Mon, Apr 8, 2019 at 3:28 AM Ash Ramesh <[hidden email]> wrote:
>
> > Hi Toke,
> >
> > Thanks for the prompt reply. I'm glad to hear that this is a common
> > problem. In regards to stop words, I've been thinking about trying that
> > out. In our business case, most of these terms are keywords related to
> > stock photography, therefore it's natural for 'photography' or
> 'background'
> > to appear commonly in a document's keyword list. it seems unlikely we can
> > use the common grams solution with our business case.
> >
> > Regards,
> >
> > Ash
> >
> > On Mon, Apr 8, 2019 at 5:01 PM Toke Eskildsen <[hidden email]> wrote:
> >
> > > On Mon, 2019-04-08 at 09:58 +1000, Ash Ramesh wrote:
> > > > We have a corpus of 50+ million documents in our collection. I've
> > > > noticed that some queries with specific keywords tend to be extremely
> > > > slow.
> > > > E.g. the q=`photography' or q='background'. After digging into the
> > > > raw documents, I could see that these two terms appear in greater
> > > > than 90% of all documents, which means solr has to score each of
> > > > those documents.
> > >
> > > That is known behaviour, which can be remedied somewhat. Stop words is
> > > a common approach, but your samples does not seem to fit well with
> > > that. Instead you can look at Common Grams, where your high-frequency
> > > words gets concatenated with surrounding words. This only works with
> > > phrases though. There's a nice article at
> > >
> > >
> > >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> > >
> > > - Toke Eskildsen, Royal Danish Library
> > >
> > >
> > >
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > <https://product.canva.com/>. ***
> > ** <https://www.canva.com/>Empowering the
> > world to design
> > Also, we're hiring. Apply here!
> > <https://about.canva.com/careers/>
> >  <https://twitter.com/canva>
> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> > <https://twitter.com/canva>  <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
>