Boolean Query search performance

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Boolean Query search performance

Beard, Brian

I'm using lucene 2.2.0.

I'm in the process of re-writing some queries to build BooleanQueries
instead of using query parser.

Bypassing query parser provides almost an order of magnitude improvement
for very large queries, but then the search performance takes 20-30%
longer. I'm adding boost values as a way to sort (long story, no
workaround I can see for now).

If I do a query.toString(), both queries give different results, which
is probably a clue (additional paren's with the BooleanQuery)

Query.toString the old way using queryParser:
    +(id:1^2.0 id:2 ... ) +type:CORE

Query.toString the new way using BooleanQuery:
    +((id:1^2.0) (id:2) ... ) +type:CORE

Does anyone have any ideas as to why the performance difference in
searching. I would think I could get the search time to be the same,
since after the query is formed everything remains the same. The same
number of hits are going through the hitCollector that gets called after
this, so all variables seem constant. QueryParser must be doing some
optimization I'm not taking advantage of. Code snippet follows.....

The previous way using queryParser was:

  String queryStr = (id:1^2 OR id:2 .... id:n) AND type:CORE
  Query query = parser.parse(queryStr);


The new way using BooleanQuery is...

  BooleanQuery totalBooleanQuery = new BooleanQuery();
  BooleanClause docTypeClause =
    new BooleanClause( new TermQuery(
    new Term("type", "CORE")), BooleanClause.Occur.MUST);

  BooleanQuery idBooleanQuery = new BooleanQuery();
  for (String uid : uniqueIdMap.keySet()) {
        TermQuery termQuery =
         new TermQuery(new Term("id", uid));
        termQuery.setBoost(loopBoostValue);
        idBooleanQuery.add(
         new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));
  }

  totalBooleanQuery.add(docTypeClause);
  totalBooleanQuery.add(idBooleanQuery, Occur.MUST);




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Boolean Query search performance

Karl Wettin
Beard, Brian skrev:
> I'm using lucene 2.2.0.
>
> I'm in the process of re-writing some queries to build BooleanQueries
> instead of using query parser.
>
> Bypassing query parser provides almost an order of magnitude improvement
> for very large queries, but then the search performance takes 20-30%
> longer.

Have you benchmarked the time spent creating the query and the time
spent searching using the query seperatly? QueryParser can be sort of
expensive. It also seems to me that you are creating a very large query
given you add one criteria per entity id.

     karl



> I'm adding boost values as a way to sort (long story, no
> workaround I can see for now).
>
> If I do a query.toString(), both queries give different results, which
> is probably a clue (additional paren's with the BooleanQuery)
>
> Query.toString the old way using queryParser:
>     +(id:1^2.0 id:2 ... ) +type:CORE
>
> Query.toString the new way using BooleanQuery:
>     +((id:1^2.0) (id:2) ... ) +type:CORE
>
> Does anyone have any ideas as to why the performance difference in
> searching. I would think I could get the search time to be the same,
> since after the query is formed everything remains the same. The same
> number of hits are going through the hitCollector that gets called after
> this, so all variables seem constant. QueryParser must be doing some
> optimization I'm not taking advantage of. Code snippet follows.....
>
> The previous way using queryParser was:
>
>   String queryStr = (id:1^2 OR id:2 .... id:n) AND type:CORE
>   Query query = parser.parse(queryStr);
>
>
> The new way using BooleanQuery is...
>
>   BooleanQuery totalBooleanQuery = new BooleanQuery();
>   BooleanClause docTypeClause =
>     new BooleanClause( new TermQuery(
>     new Term("type", "CORE")), BooleanClause.Occur.MUST);
>
>   BooleanQuery idBooleanQuery = new BooleanQuery();
>   for (String uid : uniqueIdMap.keySet()) {
> TermQuery termQuery =
>          new TermQuery(new Term("id", uid));
> termQuery.setBoost(loopBoostValue);
> idBooleanQuery.add(
>          new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));
>   }
>
>   totalBooleanQuery.add(docTypeClause);
>   totalBooleanQuery.add(idBooleanQuery, Occur.MUST);
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Boolean Query search performance

hossman
In reply to this post by Beard, Brian

: If I do a query.toString(), both queries give different results, which
: is probably a clue (additional paren's with the BooleanQuery)
:
: Query.toString the old way using queryParser:
:     +(id:1^2.0 id:2 ... ) +type:CORE
:
: Query.toString the new way using BooleanQuery:
:     +((id:1^2.0) (id:2) ... ) +type:CORE

i didn't look too closely at the psuedo code you posted, but the
additional parens normally indicates that you are actually creating an
extra layer of BooleanQueries (ie: a BooleanQuery with only one clause for
each term) ... but the rewrite method should optimize those away (even
back in lucene 2.2) ... if you look at query.rewrite(reader).toString()
then the queries *really* should be the same, if they aren't, then that
may be your culprit.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Boolean Query search performance

Eric Th
2008/3/6, Chris Hostetter <[hidden email]>:

>
>
> : If I do a query.toString(), both queries give different results, which
>
> : is probably a clue (additional paren's with the BooleanQuery)
> :
> : Query.toString the old way using queryParser:
> :     +(id:1^2.0 id:2 ... ) +type:CORE
> :
> : Query.toString the new way using BooleanQuery:
> :     +((id:1^2.0) (id:2) ... ) +type:CORE
>
>
> i didn't look too closely at the psuedo code you posted, but the
> additional parens normally indicates that you are actually creating an
> extra layer of BooleanQueries (ie: a BooleanQuery with only one clause for
> each term) ... but the rewrite method should optimize those away (even
> back in lucene 2.2) ... if you look at query.rewrite(reader).toString()
> then the queries *really* should be the same, if they aren't, then that
> may be your culprit.


look here,
parens will also be add is each term has a boost value larger than 1.0.

public String toString(String field) {
    StringBuffer buffer = new StringBuffer();
    boolean needParens=(getBoost() != 1.0) ||
(getMinimumNumberShouldMatch()>0) ;
    if (needParens) {
      buffer.append("(");
    }

    for (int i = 0 ; i < clauses.size(); i++) {
      BooleanClause c = (BooleanClause)clauses.get(i);
      if (c.isProhibited())
        buffer.append("-");
      else if (c.isRequired())
        buffer.append("+");

      Query subQuery = c.getQuery();
      if (subQuery instanceof BooleanQuery) {      // wrap sub-bools in
parens
        buffer.append("(");
        buffer.append(c.getQuery().toString(field));
        buffer.append(")");
      } else
        buffer.append(c.getQuery().toString(field));

      if (i != clauses.size()-1)
        buffer.append(" ");
    }

    if (needParens) {
      buffer.append(")");
    }

    if (getMinimumNumberShouldMatch()>0) {
      buffer.append('~');
      buffer.append(getMinimumNumberShouldMatch());
    }

    if (getBoost() != 1.0f)
    {
      buffer.append(ToStringUtils.boost(getBoost()));
    }

    return buffer.toString();
  }




-Hoss




---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Boolean Query search performance

Beard, Brian
In reply to this post by Beard, Brian
Thanks for all replies.

Today when I printed out the query that's generated it does not have the
extra paren's. And query.rewrite(reader).toString() now gives the same
result as query.toString(). All I can figure is I must have changed
something between starting the email and sending it out. The other
oddity is the performance degradation is not as apparent. I'm wondering
if part of the problem is generating consistent data for comparing
search performance. I do a warmup before actually running the test, but
maybe it's not a good enough way to test.

One additional thing - from an earlier suggestion - is it possible to
add multiple terms per BooleanClause? I tried using TermQuery.combine()
to add in an array of them into one query and making a clause from that,
but there was no difference in performance.

Brian



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Boolean Query search performance

hossman
In reply to this post by Eric Th

: > additional parens normally indicates that you are actually creating an
: > extra layer of BooleanQueries (ie: a BooleanQuery with only one clause for

: look here,
: parens will also be add is each term has a boost value larger than 1.0.

i think you are missreading that code.  the "needParens" variable adds
parens arroudn the *entire* "this" BooleanQuery if "this" has a boost or a
non 0 minShouldMatch value ... BooleanQuery.toString only adds parens
arround individual clauses if those clauses are themselves boolean
queries ... which is the point i was making.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Boolean Query search performance

Beard, Brian
In reply to this post by Beard, Brian
AHA! That is consistent with what is happening now, and explains the
discrepancy.

The original post of parens around each term was because I was adding
them as separate boolean queries, but now with using just the clause the
parens is around the entire clause with the boost.

-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Friday, March 07, 2008 3:23 PM
To: [hidden email]
Subject: Re: Boolean Query search performance


: > additional parens normally indicates that you are actually creating
an
: > extra layer of BooleanQueries (ie: a BooleanQuery with only one
clause for

: look here,
: parens will also be add is each term has a boost value larger than
1.0.

i think you are missreading that code.  the "needParens" variable adds
parens arroudn the *entire* "this" BooleanQuery if "this" has a boost or
a
non 0 minShouldMatch value ... BooleanQuery.toString only adds parens
arround individual clauses if those clauses are themselves boolean
queries ... which is the point i was making.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]