CommonTerms & slow queries

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

CommonTerms & slow queries

Erie Data Systems
Using Solr 8.0.0, single instance, single core, 50m records (38gb  index)
on one SSD, 96gb ram, 16 cores CPU

Most queries run very very fast <1 sec however we have noticed queries
containing "common" words are quite slow sometimes 10+sec , currently using
edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0

I came across these which describe the issue.
https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html

Test queries with issues :
1. things to do in seattle with eric
2. year of the cat
3. time of my life
4. when will i be loved
5. once upon a time in the west

Stopwords are not an option as in the case of #2, if of and the are removed
it essentially destroys relevance.  Is there a common suggested solution to
what would seem to be a common issue besides adding stopwords.

Thank you.
Craig Stadler
Reply | Threaded
Open this post in threaded view
|

Re: CommonTerms & slow queries

Michael Gibney
Can you post the query that's actually built for some of these inputs
("parsedquery" or "parsedquery_toString" output included for requests with
"debug=query" parameter)? What is performance like if you turn off pf
(i.e., no implicit phrase searching)?
Michael

On Fri, Mar 29, 2019 at 11:53 AM Erie Data Systems <[hidden email]>
wrote:

> Using Solr 8.0.0, single instance, single core, 50m records (38gb  index)
> on one SSD, 96gb ram, 16 cores CPU
>
> Most queries run very very fast <1 sec however we have noticed queries
> containing "common" words are quite slow sometimes 10+sec , currently using
> edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0
>
> I came across these which describe the issue.
>
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>
>
> https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html
>
> Test queries with issues :
> 1. things to do in seattle with eric
> 2. year of the cat
> 3. time of my life
> 4. when will i be loved
> 5. once upon a time in the west
>
> Stopwords are not an option as in the case of #2, if of and the are removed
> it essentially destroys relevance.  Is there a common suggested solution to
> what would seem to be a common issue besides adding stopwords.
>
> Thank you.
> Craig Stadler
>
Reply | Threaded
Open this post in threaded view
|

Re: CommonTerms & slow queries

Erie Data Systems
Michael,

select/?&rows=12&qf=title+description&q=once+upon+a+time+in+the+west&fl=*&hl=true&hl.field=desc&hl.fragsize=250&hl.maxAnalyzedChars=200000&ps=1&qs=1&df=title&mm=2&defType=edismax&debugQuery=off&indent=on&wt=json&debug=true
    "rawquerystring":"once upon a time in the west",
    "querystring":"once upon a time in the west",
    "parsedquery":"+(DisjunctionMaxQuery((description:once | title:once))
DisjunctionMaxQuery((description:upon | title:upon))
DisjunctionMaxQuery((description:a | title:a))
DisjunctionMaxQuery((description:time | title:time))
DisjunctionMaxQuery((description:in | title:in))
DisjunctionMaxQuery((description:the | title:the))
DisjunctionMaxQuery((description:west | title:west)))~2",
    "parsedquery_toString":"+(((description:once | title:once)
(description:upon | title:upon) (description:a | title:a) (description:time
| title:time) (description:in | title:in) (description:the | title:the)
(description:west | title:west))~2)"

Removing pf cuts time almost half but its still 5+sec

Thank you for your help, more than happy to include more output..
-Craig


On Fri, Mar 29, 2019 at 12:24 PM Michael Gibney <[hidden email]>
wrote:

> Can you post the query that's actually built for some of these inputs
> ("parsedquery" or "parsedquery_toString" output included for requests with
> "debug=query" parameter)? What is performance like if you turn off pf
> (i.e., no implicit phrase searching)?
> Michael
>
> On Fri, Mar 29, 2019 at 11:53 AM Erie Data Systems <[hidden email]>
> wrote:
>
> > Using Solr 8.0.0, single instance, single core, 50m records (38gb  index)
> > on one SSD, 96gb ram, 16 cores CPU
> >
> > Most queries run very very fast <1 sec however we have noticed queries
> > containing "common" words are quite slow sometimes 10+sec , currently
> using
> > edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0
> >
> > I came across these which describe the issue.
> >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> >
> >
> >
> https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html
> >
> > Test queries with issues :
> > 1. things to do in seattle with eric
> > 2. year of the cat
> > 3. time of my life
> > 4. when will i be loved
> > 5. once upon a time in the west
> >
> > Stopwords are not an option as in the case of #2, if of and the are
> removed
> > it essentially destroys relevance.  Is there a common suggested solution
> to
> > what would seem to be a common issue besides adding stopwords.
> >
> > Thank you.
> > Craig Stadler
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: CommonTerms & slow queries

Michael Gibney
You might take a look at CommonGramsFilter (
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-CommonGramsFilter),
especially if you're either not using pf, or if ps=0. An absolute setting
of mm=2 strikes me as unusual (though quite possibly appropriate for your
use case). mm=2 would force scoring of all docs for which >=2 terms match,
which for any query containing the words "a" and "the" for example, could
easily be the majority of the index.
Another thought, re: single-core: sharding would allow you to effectively
parallelize query processing to a certain extent, which I expect might
speed things up for your use case.

On Fri, Mar 29, 2019 at 1:13 PM Erie Data Systems <[hidden email]>
wrote:

> Michael,
>
>
> select/?&rows=12&qf=title+description&q=once+upon+a+time+in+the+west&fl=*&hl=true&hl.field=desc&hl.fragsize=250&hl.maxAnalyzedChars=200000&ps=1&qs=1&df=title&mm=2&defType=edismax&debugQuery=off&indent=on&wt=json&debug=true
>     "rawquerystring":"once upon a time in the west",
>     "querystring":"once upon a time in the west",
>     "parsedquery":"+(DisjunctionMaxQuery((description:once | title:once))
> DisjunctionMaxQuery((description:upon | title:upon))
> DisjunctionMaxQuery((description:a | title:a))
> DisjunctionMaxQuery((description:time | title:time))
> DisjunctionMaxQuery((description:in | title:in))
> DisjunctionMaxQuery((description:the | title:the))
> DisjunctionMaxQuery((description:west | title:west)))~2",
>     "parsedquery_toString":"+(((description:once | title:once)
> (description:upon | title:upon) (description:a | title:a) (description:time
> | title:time) (description:in | title:in) (description:the | title:the)
> (description:west | title:west))~2)"
>
> Removing pf cuts time almost half but its still 5+sec
>
> Thank you for your help, more than happy to include more output..
> -Craig
>
>
> On Fri, Mar 29, 2019 at 12:24 PM Michael Gibney <[hidden email]
> >
> wrote:
>
> > Can you post the query that's actually built for some of these inputs
> > ("parsedquery" or "parsedquery_toString" output included for requests
> with
> > "debug=query" parameter)? What is performance like if you turn off pf
> > (i.e., no implicit phrase searching)?
> > Michael
> >
> > On Fri, Mar 29, 2019 at 11:53 AM Erie Data Systems <
> [hidden email]>
> > wrote:
> >
> > > Using Solr 8.0.0, single instance, single core, 50m records (38gb
> index)
> > > on one SSD, 96gb ram, 16 cores CPU
> > >
> > > Most queries run very very fast <1 sec however we have noticed queries
> > > containing "common" words are quite slow sometimes 10+sec , currently
> > using
> > > edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0
> > >
> > > I came across these which describe the issue.
> > >
> > >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> > >
> > >
> > >
> >
> https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html
> > >
> > > Test queries with issues :
> > > 1. things to do in seattle with eric
> > > 2. year of the cat
> > > 3. time of my life
> > > 4. when will i be loved
> > > 5. once upon a time in the west
> > >
> > > Stopwords are not an option as in the case of #2, if of and the are
> > removed
> > > it essentially destroys relevance.  Is there a common suggested
> solution
> > to
> > > what would seem to be a common issue besides adding stopwords.
> > >
> > > Thank you.
> > > Craig Stadler
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: CommonTerms & slow queries

Erie Data Systems
>
> All great advice thanks Michael, have an excellent weekend! Testing the
> common grams
>
-Craig