debugging query execution plan

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

debugging query execution plan

Michael Sokolov-4
Do we have a way to understand how BooleanQuery (and other composite
queries) are advancing their child queries? For example, a simple
conjunction of two queries advances the more restrictive (lower
cost()) query first, enabling the more costly query to skip over more
documents. But we may not be making the best choice in every case, and
I would like to know, for some query, how we are doing. For example,
we could execute in a debugging mode, interposing something that wraps
or observes the Scorers in some way, gathering statistics about how
many documents are visited by each Scorer, which can be aggregated for
later analysis.

This is motivated by a use case we have in which we currently
post-filter our query results in a custom collector using some filters
that we know to be expensive (they must be evaluated on every
document), but we would rather express these post-filters as Queries
and have them advanced during the main Query execution. However when
we tried to do that, we saw some slowdowns (in spite of marking these
Queries as high-cost) and I suspect it is due to the iteration order,
but I'm not sure how to debug.

Suggestions welcome!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Tim Casey

Generally you go over the least frequent to most frequent required terms as the pivot.  As you get more and more complicated queries, the ordering of the query results tends to dominate.  This also means there are two types of measurement.  One for the running query and one for query + results in order.

tim

On Thu, May 6, 2021 at 12:20 PM Michael Sokolov <[hidden email]> wrote:
Do we have a way to understand how BooleanQuery (and other composite
queries) are advancing their child queries? For example, a simple
conjunction of two queries advances the more restrictive (lower
cost()) query first, enabling the more costly query to skip over more
documents. But we may not be making the best choice in every case, and
I would like to know, for some query, how we are doing. For example,
we could execute in a debugging mode, interposing something that wraps
or observes the Scorers in some way, gathering statistics about how
many documents are visited by each Scorer, which can be aggregated for
later analysis.

This is motivated by a use case we have in which we currently
post-filter our query results in a custom collector using some filters
that we know to be expensive (they must be evaluated on every
document), but we would rather express these post-filters as Queries
and have them advanced during the main Query execution. However when
we tried to do that, we saw some slowdowns (in spite of marking these
Queries as high-cost) and I suspect it is due to the iteration order,
but I'm not sure how to debug.

Suggestions welcome!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Adrien Grand
In reply to this post by Michael Sokolov-4
We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.

It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?

For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.

Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
Do we have a way to understand how BooleanQuery (and other composite
queries) are advancing their child queries? For example, a simple
conjunction of two queries advances the more restrictive (lower
cost()) query first, enabling the more costly query to skip over more
documents. But we may not be making the best choice in every case, and
I would like to know, for some query, how we are doing. For example,
we could execute in a debugging mode, interposing something that wraps
or observes the Scorers in some way, gathering statistics about how
many documents are visited by each Scorer, which can be aggregated for
later analysis.

This is motivated by a use case we have in which we currently
post-filter our query results in a custom collector using some filters
that we know to be expensive (they must be evaluated on every
document), but we would rather express these post-filters as Queries
and have them advanced during the main Query execution. However when
we tried to do that, we saw some slowdowns (in spite of marking these
Queries as high-cost) and I suspect it is due to the iteration order,
but I'm not sure how to debug.

Suggestions welcome!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Michael Sokolov-4
Thanks Adrien, that is something like what I had in mind. If you are
able to share, that could be very helpful. And -- deleted docs is not
something I had considered, it's possibly a problem here. I'd have to
go check - I think these "filter" Queries were implemented in the
second part of the two-phase iteration.

On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:

>
> We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>
> It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>
> For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>
> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
>>
>> Do we have a way to understand how BooleanQuery (and other composite
>> queries) are advancing their child queries? For example, a simple
>> conjunction of two queries advances the more restrictive (lower
>> cost()) query first, enabling the more costly query to skip over more
>> documents. But we may not be making the best choice in every case, and
>> I would like to know, for some query, how we are doing. For example,
>> we could execute in a debugging mode, interposing something that wraps
>> or observes the Scorers in some way, gathering statistics about how
>> many documents are visited by each Scorer, which can be aggregated for
>> later analysis.
>>
>> This is motivated by a use case we have in which we currently
>> post-filter our query results in a custom collector using some filters
>> that we know to be expensive (they must be evaluated on every
>> document), but we would rather express these post-filters as Queries
>> and have them advanced during the main Query execution. However when
>> we tried to do that, we saw some slowdowns (in spite of marking these
>> Queries as high-cost) and I suspect it is due to the iteration order,
>> but I'm not sure how to debug.
>>
>> Suggestions welcome!
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

David Smiley
Instead of a Collector, why isn't this a TwoPhaseIterator with a high matchCost?

~ David Smiley
Apache Lucene/Solr Search Developer


On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[hidden email]> wrote:
Thanks Adrien, that is something like what I had in mind. If you are
able to share, that could be very helpful. And -- deleted docs is not
something I had considered, it's possibly a problem here. I'd have to
go check - I think these "filter" Queries were implemented in the
second part of the two-phase iteration.

On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
>
> We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>
> It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>
> For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>
> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
>>
>> Do we have a way to understand how BooleanQuery (and other composite
>> queries) are advancing their child queries? For example, a simple
>> conjunction of two queries advances the more restrictive (lower
>> cost()) query first, enabling the more costly query to skip over more
>> documents. But we may not be making the best choice in every case, and
>> I would like to know, for some query, how we are doing. For example,
>> we could execute in a debugging mode, interposing something that wraps
>> or observes the Scorers in some way, gathering statistics about how
>> many documents are visited by each Scorer, which can be aggregated for
>> later analysis.
>>
>> This is motivated by a use case we have in which we currently
>> post-filter our query results in a custom collector using some filters
>> that we know to be expensive (they must be evaluated on every
>> document), but we would rather express these post-filters as Queries
>> and have them advanced during the main Query execution. However when
>> we tried to do that, we saw some slowdowns (in spite of marking these
>> Queries as high-cost) and I suspect it is due to the iteration order,
>> but I'm not sure how to debug.
>>
>> Suggestions welcome!
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Greg Miller
Just chiming in here to answer David's question since I have some familiarity:

In this specific case, the logic was implemented inside a Collector
and we tried to move it into a Query abstraction using a
TwoPhaseIterator with a high matchCost. The first-phase would match on
all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
second phase would do the costly check. The matchCost was advertised
as reader.maxDoc(). ("reader" in this example is from the
LeafReaderContext).

Moving the logic behind a Query abstraction caused performance
regressions. So one theory is that it was somehow leading iteration
with an expensive "match all docs" DISI, but we don't actually know if
that's true right now.

Cheers,
-Greg

On Fri, May 7, 2021 at 8:41 AM David Smiley <[hidden email]> wrote:

>
> Instead of a Collector, why isn't this a TwoPhaseIterator with a high matchCost?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[hidden email]> wrote:
>>
>> Thanks Adrien, that is something like what I had in mind. If you are
>> able to share, that could be very helpful. And -- deleted docs is not
>> something I had considered, it's possibly a problem here. I'd have to
>> go check - I think these "filter" Queries were implemented in the
>> second part of the two-phase iteration.
>>
>> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
>> >
>> > We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>> >
>> > It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>> >
>> > For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>> >
>> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
>> >>
>> >> Do we have a way to understand how BooleanQuery (and other composite
>> >> queries) are advancing their child queries? For example, a simple
>> >> conjunction of two queries advances the more restrictive (lower
>> >> cost()) query first, enabling the more costly query to skip over more
>> >> documents. But we may not be making the best choice in every case, and
>> >> I would like to know, for some query, how we are doing. For example,
>> >> we could execute in a debugging mode, interposing something that wraps
>> >> or observes the Scorers in some way, gathering statistics about how
>> >> many documents are visited by each Scorer, which can be aggregated for
>> >> later analysis.
>> >>
>> >> This is motivated by a use case we have in which we currently
>> >> post-filter our query results in a custom collector using some filters
>> >> that we know to be expensive (they must be evaluated on every
>> >> document), but we would rather express these post-filters as Queries
>> >> and have them advanced during the main Query execution. However when
>> >> we tried to do that, we saw some slowdowns (in spite of marking these
>> >> Queries as high-cost) and I suspect it is due to the iteration order,
>> >> but I'm not sure how to debug.
>> >>
>> >> Suggestions welcome!
>> >>
>> >> -Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

David Smiley
Thanks for the clarification Greg. I've been looking into this recently and filed https://issues.apache.org/jira/browse/LUCENE-9938 based on a hunch that these DocIdSetIterator.all(maxDoc) iterators have a non-negligible cost inside ConjunctionDISI.  Ultimately I closed the issue because the TPI design seems to prohibit removing them  :-(.  Feel free to comment there nonetheless if you have any thoughts on the matter.  For my part, I have some benchmarking to do in Solr for a related matter that would move certain queries that work at the collector stage to be TPIs -- SOLR-14164.

~ David Smiley
Apache Lucene/Solr Search Developer


On Fri, May 7, 2021 at 6:06 PM Greg Miller <[hidden email]> wrote:
Just chiming in here to answer David's question since I have some familiarity:

In this specific case, the logic was implemented inside a Collector
and we tried to move it into a Query abstraction using a
TwoPhaseIterator with a high matchCost. The first-phase would match on
all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
second phase would do the costly check. The matchCost was advertised
as reader.maxDoc(). ("reader" in this example is from the
LeafReaderContext).

Moving the logic behind a Query abstraction caused performance
regressions. So one theory is that it was somehow leading iteration
with an expensive "match all docs" DISI, but we don't actually know if
that's true right now.

Cheers,
-Greg

On Fri, May 7, 2021 at 8:41 AM David Smiley <[hidden email]> wrote:
>
> Instead of a Collector, why isn't this a TwoPhaseIterator with a high matchCost?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[hidden email]> wrote:
>>
>> Thanks Adrien, that is something like what I had in mind. If you are
>> able to share, that could be very helpful. And -- deleted docs is not
>> something I had considered, it's possibly a problem here. I'd have to
>> go check - I think these "filter" Queries were implemented in the
>> second part of the two-phase iteration.
>>
>> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
>> >
>> > We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>> >
>> > It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>> >
>> > For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>> >
>> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
>> >>
>> >> Do we have a way to understand how BooleanQuery (and other composite
>> >> queries) are advancing their child queries? For example, a simple
>> >> conjunction of two queries advances the more restrictive (lower
>> >> cost()) query first, enabling the more costly query to skip over more
>> >> documents. But we may not be making the best choice in every case, and
>> >> I would like to know, for some query, how we are doing. For example,
>> >> we could execute in a debugging mode, interposing something that wraps
>> >> or observes the Scorers in some way, gathering statistics about how
>> >> many documents are visited by each Scorer, which can be aggregated for
>> >> later analysis.
>> >>
>> >> This is motivated by a use case we have in which we currently
>> >> post-filter our query results in a custom collector using some filters
>> >> that we know to be expensive (they must be evaluated on every
>> >> document), but we would rather express these post-filters as Queries
>> >> and have them advanced during the main Query execution. However when
>> >> we tried to do that, we saw some slowdowns (in spite of marking these
>> >> Queries as high-cost) and I suspect it is due to the iteration order,
>> >> but I'm not sure how to debug.
>> >>
>> >> Suggestions welcome!
>> >>
>> >> -Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Greg Miller
Thanks David, I took a look at the Jira and see the problem with
trying to do away with the "all docs" approximation. We'll be
revisiting our performance regression on our side soon, so hopefully
we'll get to the bottom of what's going on, particularly around
whether-or-not we're somehow leading iteration with this "all docs"
approximation. If we find something interesting, I'll report back.
Thanks again!

Cheers,
-Greg

On Fri, May 7, 2021 at 6:33 PM David Smiley <[hidden email]> wrote:

>
> Thanks for the clarification Greg. I've been looking into this recently and filed https://issues.apache.org/jira/browse/LUCENE-9938 based on a hunch that these DocIdSetIterator.all(maxDoc) iterators have a non-negligible cost inside ConjunctionDISI.  Ultimately I closed the issue because the TPI design seems to prohibit removing them  :-(.  Feel free to comment there nonetheless if you have any thoughts on the matter.  For my part, I have some benchmarking to do in Solr for a related matter that would move certain queries that work at the collector stage to be TPIs -- SOLR-14164.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Fri, May 7, 2021 at 6:06 PM Greg Miller <[hidden email]> wrote:
>>
>> Just chiming in here to answer David's question since I have some familiarity:
>>
>> In this specific case, the logic was implemented inside a Collector
>> and we tried to move it into a Query abstraction using a
>> TwoPhaseIterator with a high matchCost. The first-phase would match on
>> all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
>> second phase would do the costly check. The matchCost was advertised
>> as reader.maxDoc(). ("reader" in this example is from the
>> LeafReaderContext).
>>
>> Moving the logic behind a Query abstraction caused performance
>> regressions. So one theory is that it was somehow leading iteration
>> with an expensive "match all docs" DISI, but we don't actually know if
>> that's true right now.
>>
>> Cheers,
>> -Greg
>>
>> On Fri, May 7, 2021 at 8:41 AM David Smiley <[hidden email]> wrote:
>> >
>> > Instead of a Collector, why isn't this a TwoPhaseIterator with a high matchCost?
>> >
>> > ~ David Smiley
>> > Apache Lucene/Solr Search Developer
>> > http://www.linkedin.com/in/davidwsmiley
>> >
>> >
>> > On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[hidden email]> wrote:
>> >>
>> >> Thanks Adrien, that is something like what I had in mind. If you are
>> >> able to share, that could be very helpful. And -- deleted docs is not
>> >> something I had considered, it's possibly a problem here. I'd have to
>> >> go check - I think these "filter" Queries were implemented in the
>> >> second part of the two-phase iteration.
>> >>
>> >> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
>> >> >
>> >> > We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>> >> >
>> >> > It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>> >> >
>> >> > For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>> >> >
>> >> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
>> >> >>
>> >> >> Do we have a way to understand how BooleanQuery (and other composite
>> >> >> queries) are advancing their child queries? For example, a simple
>> >> >> conjunction of two queries advances the more restrictive (lower
>> >> >> cost()) query first, enabling the more costly query to skip over more
>> >> >> documents. But we may not be making the best choice in every case, and
>> >> >> I would like to know, for some query, how we are doing. For example,
>> >> >> we could execute in a debugging mode, interposing something that wraps
>> >> >> or observes the Scorers in some way, gathering statistics about how
>> >> >> many documents are visited by each Scorer, which can be aggregated for
>> >> >> later analysis.
>> >> >>
>> >> >> This is motivated by a use case we have in which we currently
>> >> >> post-filter our query results in a custom collector using some filters
>> >> >> that we know to be expensive (they must be evaluated on every
>> >> >> document), but we would rather express these post-filters as Queries
>> >> >> and have them advanced during the main Query execution. However when
>> >> >> we tried to do that, we saw some slowdowns (in spite of marking these
>> >> >> Queries as high-cost) and I suspect it is due to the iteration order,
>> >> >> but I'm not sure how to debug.
>> >> >>
>> >> >> Suggestions welcome!
>> >> >>
>> >> >> -Mike
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: [hidden email]
>> >> >> For additional commands, e-mail: [hidden email]
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Michael McCandless-2
In reply to this post by Adrien Grand
+1 to start from the Elasticsearch implementation for low-level query execution tracing, which I think is from (pre-7.10) ASL2 licensed code?

That sounds helpful, even with the Heisenberg caveats.

On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.

It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?

For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.

Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
Do we have a way to understand how BooleanQuery (and other composite
queries) are advancing their child queries? For example, a simple
conjunction of two queries advances the more restrictive (lower
cost()) query first, enabling the more costly query to skip over more
documents. But we may not be making the best choice in every case, and
I would like to know, for some query, how we are doing. For example,
we could execute in a debugging mode, interposing something that wraps
or observes the Scorers in some way, gathering statistics about how
many documents are visited by each Scorer, which can be aggregated for
later analysis.

This is motivated by a use case we have in which we currently
post-filter our query results in a custom collector using some filters
that we know to be expensive (they must be evaluated on every
document), but we would rather express these post-filters as Queries
and have them advanced during the main Query execution. However when
we tried to do that, we saw some slowdowns (in spite of marking these
Queries as high-cost) and I suspect it is due to the iteration order,
but I'm not sure how to debug.

Suggestions welcome!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Adrien Grand
Indeed this is code is ASL2 pre-7.10, but I wouldn't have expected any concerns regardless. Jack volunteered to bring this code to Lucene by removing the Elasticsearch-specific bits.

On Mon, May 10, 2021 at 4:55 PM Michael McCandless <[hidden email]> wrote:
+1 to start from the Elasticsearch implementation for low-level query execution tracing, which I think is from (pre-7.10) ASL2 licensed code?

That sounds helpful, even with the Heisenberg caveats.

On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.

It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?

For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.

Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
Do we have a way to understand how BooleanQuery (and other composite
queries) are advancing their child queries? For example, a simple
conjunction of two queries advances the more restrictive (lower
cost()) query first, enabling the more costly query to skip over more
documents. But we may not be making the best choice in every case, and
I would like to know, for some query, how we are doing. For example,
we could execute in a debugging mode, interposing something that wraps
or observes the Scorers in some way, gathering statistics about how
many documents are visited by each Scorer, which can be aggregated for
later analysis.

This is motivated by a use case we have in which we currently
post-filter our query results in a custom collector using some filters
that we know to be expensive (they must be evaluated on every
document), but we would rather express these post-filters as Queries
and have them advanced during the main Query execution. However when
we tried to do that, we saw some slowdowns (in spite of marking these
Queries as high-cost) and I suspect it is due to the iteration order,
but I'm not sure how to debug.

Suggestions welcome!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



--
Adrien
Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Adrien Grand
FYI this got just checked in: https://issues.apache.org/jira/browse/LUCENE-9965.

I'd be curious to know if it helps with your problem, Mike.

On Wed, May 12, 2021 at 1:54 PM Adrien Grand <[hidden email]> wrote:
Indeed this is code is ASL2 pre-7.10, but I wouldn't have expected any concerns regardless. Jack volunteered to bring this code to Lucene by removing the Elasticsearch-specific bits.

On Mon, May 10, 2021 at 4:55 PM Michael McCandless <[hidden email]> wrote:
+1 to start from the Elasticsearch implementation for low-level query execution tracing, which I think is from (pre-7.10) ASL2 licensed code?

That sounds helpful, even with the Heisenberg caveats.

On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.

It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?

For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.

Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
Do we have a way to understand how BooleanQuery (and other composite
queries) are advancing their child queries? For example, a simple
conjunction of two queries advances the more restrictive (lower
cost()) query first, enabling the more costly query to skip over more
documents. But we may not be making the best choice in every case, and
I would like to know, for some query, how we are doing. For example,
we could execute in a debugging mode, interposing something that wraps
or observes the Scorers in some way, gathering statistics about how
many documents are visited by each Scorer, which can be aggregated for
later analysis.

This is motivated by a use case we have in which we currently
post-filter our query results in a custom collector using some filters
that we know to be expensive (they must be evaluated on every
document), but we would rather express these post-filters as Queries
and have them advanced during the main Query execution. However when
we tried to do that, we saw some slowdowns (in spite of marking these
Queries as high-cost) and I suspect it is due to the iteration order,
but I'm not sure how to debug.

Suggestions welcome!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



--
Adrien


--
Adrien
Reply | Threaded
Open this post in threaded view
|

Re: debugging query execution plan

Michael Sokolov-4
Yes, I'm looking forward to checking it out, and really appreciate the
work to bring it here! I'm going to be out in the woods for a few
days, but it's on my list to try this out when I get back

On Wed, Jun 9, 2021 at 5:41 PM Adrien Grand <[hidden email]> wrote:

>
> FYI this got just checked in: https://issues.apache.org/jira/browse/LUCENE-9965.
>
> I'd be curious to know if it helps with your problem, Mike.
>
> On Wed, May 12, 2021 at 1:54 PM Adrien Grand <[hidden email]> wrote:
>>
>> Indeed this is code is ASL2 pre-7.10, but I wouldn't have expected any concerns regardless. Jack volunteered to bring this code to Lucene by removing the Elasticsearch-specific bits.
>>
>> On Mon, May 10, 2021 at 4:55 PM Michael McCandless <[hidden email]> wrote:
>>>
>>> +1 to start from the Elasticsearch implementation for low-level query execution tracing, which I think is from (pre-7.10) ASL2 licensed code?
>>>
>>> That sounds helpful, even with the Heisenberg caveats.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[hidden email]> wrote:
>>>>
>>>> We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>>>>
>>>> It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>>>>
>>>> For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>>>>
>>>> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[hidden email]> a écrit :
>>>>>
>>>>> Do we have a way to understand how BooleanQuery (and other composite
>>>>> queries) are advancing their child queries? For example, a simple
>>>>> conjunction of two queries advances the more restrictive (lower
>>>>> cost()) query first, enabling the more costly query to skip over more
>>>>> documents. But we may not be making the best choice in every case, and
>>>>> I would like to know, for some query, how we are doing. For example,
>>>>> we could execute in a debugging mode, interposing something that wraps
>>>>> or observes the Scorers in some way, gathering statistics about how
>>>>> many documents are visited by each Scorer, which can be aggregated for
>>>>> later analysis.
>>>>>
>>>>> This is motivated by a use case we have in which we currently
>>>>> post-filter our query results in a custom collector using some filters
>>>>> that we know to be expensive (they must be evaluated on every
>>>>> document), but we would rather express these post-filters as Queries
>>>>> and have them advanced during the main Query execution. However when
>>>>> we tried to do that, we saw some slowdowns (in spite of marking these
>>>>> Queries as high-cost) and I suspect it is due to the iteration order,
>>>>> but I'm not sure how to debug.
>>>>>
>>>>> Suggestions welcome!
>>>>>
>>>>> -Mike
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>
>>
>> --
>> Adrien
>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]