Quantcast

DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

Bugzilla from bugzilla@apache.org
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG?
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=31841>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND?
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=31841





------- Additional Comments From [hidden email]  2005-04-27 17:15 -------
Wolf's revisions to my changes to Query.combine() look fine.  The single-query
optimization is good -- my oversight to have not included it originally.  I
don't believe either of the other two changes is necessary, but they are correct:
  1.  Using a flag instead of the labelled loop is a matter of style as Wolf
says, and it's a little less efficent (the biggest effect could be remedied by
one more if (splittable) to avoid unnecessarily copying the clauses of a
BooleanQuery where coord is not disabled).
  2.  Changing BooleanQuery equality to be independent of clause order is
semantically correct, although again it is a little less efficient.  It's only
purpose is to stop a false-negative in the new tests.

Regarding additional test cases, it would be helpful to add the cases I was
concerned about, which are situations where a query can rewrite into different
kinds of fundamental queries depending on the reader.  I believe the only case
where this occurs with the built-in queries is with MultiTermQuery's and
RangeQuery's (where the rewrite depends on how many query clauses are generated
by each reader), and we have covered those cases.  The coord testing in
Query.combine() is designed to handle the case where some query rewrites into a
different kind of BooleanQuery (e.g., an AND), in some readers and not others.
Nothing tests this at present.  A single-term BooleanQuery OR could rewrite into
a BooleanQuery AND, but this would be independent of reader.

Many additional optimizations could be added.  It seems redundant to have
optimizations here and in the rewrite mechanism.  Since we are down to just
Query.combine(), only called from one place, I think a better fix is to change
MultiSearcher to pass the reader as well.  Then Query.combine() could construct
the straightforward BooleanQuery and rewrite it.  All the optimizations would
then go into a single place, the rewrite methods.  Wolf, what do you think of
that approach?


--
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

Wolf Siberski
> ------- Additional Comments From [hidden email]  2005-04-27 17:15 -------
> Wolf's revisions to my changes to Query.combine() look fine.  The single-query
> optimization is good -- my oversight to have not included it originally.  I
> don't believe either of the other two changes is necessary, but they are correct:
>   1.  Using a flag instead of the labelled loop is a matter of style as Wolf
> says, and it's a little less efficent (the biggest effect could be remedied by
> one more if (splittable) to avoid unnecessarily copying the clauses of a
> BooleanQuery where coord is not disabled).
Yep, the additional if... should be added.

>   2.  Changing BooleanQuery equality to be independent of clause order is
> semantically correct, although again it is a little less efficient.  It's only
> purpose is to stop a false-negative in the new tests.
Here I don't agree. The previous implementation was incorrect, and the new
tests did discover that bug. I also considered to correct this by ensuring
a defined order of clauses, or by replacing the vector with a set. That
would have been a bit more performant, but would have needed much more
effort and may have caused unwanted side effects.

In general, IMHO query processing performance is nearly always dominated by
index accesses, and in the few cases where query preparation takes a significant
share, the whole processing will be fast enough anyway. So I don't see a
need to squeeze out the last few processing cycles from query preparation.

> Many additional optimizations could be added.  It seems redundant to have
> optimizations here and in the rewrite mechanism.  Since we are down to just
> Query.combine(), only called from one place, I think a better fix is to change
> MultiSearcher to pass the reader as well.  Then Query.combine() could construct
> the straightforward BooleanQuery and rewrite it.  All the optimizations would
> then go into a single place, the rewrite methods.  Wolf, what do you think of
> that approach?
Yes, there is a problem of code duplication. But I don't yet understand your
proposal. Which reader could the MultiSearcher pass? We only have Searchables
inside of MultiSearcher which don't (and probably shouldn't) expose their readers.

Another way to approach the problem would be to split the rewriting process
into two phases: in the first phase the query is rewritten into a combination
of term queries, and in the second phase this combination is optimized.
The second phase doesn't need the reader anymore. Then the MultiSearcher could
delegate the first phase to its Searchables (as before), combine the resulting
queries by just joining them, and then call the optimization method on the
combined query. If there are no objections I could try if that works.

--Wolf



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

Chuck Williams
Wolf Siberski wrote:

>> ------- Additional Comments From [hidden email]  2005-04-27 17:15
>> -------
>> Wolf's revisions to my changes to Query.combine() look fine.  The
>> single-query
>> optimization is good -- my oversight to have not included it
>> originally.  I
>> don't believe either of the other two changes is necessary, but they
>> are correct:
>>   1.  Using a flag instead of the labelled loop is a matter of style
>> as Wolf
>> says, and it's a little less efficent (the biggest effect could be
>> remedied by
>> one more if (splittable) to avoid unnecessarily copying the clauses of a
>> BooleanQuery where coord is not disabled).
>
> Yep, the additional if... should be added.
>
>>   2.  Changing BooleanQuery equality to be independent of clause
>> order is
>> semantically correct, although again it is a little less efficient.  
>> It's only
>> purpose is to stop a false-negative in the new tests.
>
> Here I don't agree. The previous implementation was incorrect, and the
> new
> tests did discover that bug.

I agree that defining BooleanQuery equality as set equality rather than
sequence equality on the clauses is the semantically superior
definition.  It has the pracitcal benefit of simplifying optimization
and/or making a simple optimization implementation as we have here more
complete.  However, I don't agree with the test's requirement that
MultiSearcher rewritten queries be equal to single-index rewritten
queries when the collections being searched are the same (with the only
difference being that the the collection is distributed across indices
in the MultiSearcher case).  This is a desirable property, but it is not
part of the definition of "correctness".  Correctness should require
that the results be the same (same hits, rankings and scores).  In any
case where the optimizations are not complete, the rewritten queries
could be different.  If we go with this more stringent definition of
correctness then the current implementation is still not correct as I
can construct queries where the rewriting will not be the same due to
missing optimizations.  I think this argument is  academic however and
shouldn't hold us up.

> I also considered to correct this by ensuring
> a defined order of clauses, or by replacing the vector with a set. That
> would have been a bit more performant, but would have needed much more
> effort and may have caused unwanted side effects.
>
> In general, IMHO query processing performance is nearly always
> dominated by
> index accesses, and in the few cases where query preparation takes a
> significant
> share, the whole processing will be fast enough anyway. So I don't see a
> need to squeeze out the last few processing cycles from query
> preparation.
>
>> Many additional optimizations could be added.  It seems redundant to
>> have
>> optimizations here and in the rewrite mechanism.  Since we are down
>> to just
>> Query.combine(), only called from one place, I think a better fix is
>> to change
>> MultiSearcher to pass the reader as well.  Then Query.combine() could
>> construct
>> the straightforward BooleanQuery and rewrite it.  All the
>> optimizations would
>> then go into a single place, the rewrite methods.  Wolf, what do you
>> think of
>> that approach?
>
> Yes, there is a problem of code duplication. But I don't yet
> understand your
> proposal. Which reader could the MultiSearcher pass? We only have
> Searchables
> inside of MultiSearcher which don't (and probably shouldn't) expose
> their readers.
>
> Another way to approach the problem would be to split the rewriting
> process
> into two phases: in the first phase the query is rewritten into a
> combination
> of term queries, and in the second phase this combination is optimized.
> The second phase doesn't need the reader anymore. Then the
> MultiSearcher could
> delegate the first phase to its Searchables (as before), combine the
> resulting
> queries by just joining them, and then call the optimization method on
> the
> combined query. If there are no objections I could try if that works.

This is a good idea.  I think it is a better factoring of the code,
separating the logically distinct processes of query expansion into
primitive queries and query optimization.  It is important to maintain
backward compatibility, so the current rewrite methods would need to
maintain their current semantics, calling new methods to implement each
phase of processing.  To work with the new MultiSearcher, user written
Query subclasses would have to refactor their rewrite methods, but I
think this is acceptable as there are similar requirements already
(e.g., user-written primitive Query subclasses have to now provide an
extractTerms() method).  Things could be made a little simpler by
defining default implementations of the two new methods in Query:  query
expansion would call rewrite() and query optimization would be a no-op.  
Then things would work with MultiSearcher -- it would simply be an
optimization to factor out the optimization parts of rewriting for more
efficient combined queries.

What do the developers think of Wolf's proposal?

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...