CompiledAutomaton performance issue

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

CompiledAutomaton performance issue

Poppe, Thomas (IP&Science)
Hello,

We're using the automaton package as part of Elasticsearch for doing regexp queries.  Our business requires us to process rather complex regular expressions, for example (we have more complex examples, but this one illustrates the problem):

        (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d

With a large enough value of maxDeterminizedStates, this works.  The problem we're having is that the conversion of this regular expression to a CompiledAutomaton takes very long.  Almost all of the time goes into determining the common suffix for the Automaton (which is "d" in this example) - calculated with a call to Operations.getCommonSuffixBytesRef.  

If my understanding is correct, this suffix is only used as an optimization (is this correct?).  Skipping the calculation of this suffix allows us to process these kinds of queries.

So here are my questions:
- Would it be possible to introduce a way to skip the calculation of this common suffix (ideally something we control from within our query to Elasticsearch)?
- Or would it be possible to take a look at this getCommonSuffixBytesRef operation, to see if it can be optimized?  Most of the time goes to determinizing the reversed automaton - maybe this can be avoided somehow?
- Does anyone have any other suggestions?  We've tried to reduce the complexity of the query, but it is something we would really like to support.

Thanks,
Thomas Poppe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: CompiledAutomaton performance issue

Michael McCandless-2
This is just an optimization; maybe we should expose an option to disable
it?

Or maybe we can find the common suffix on an NFA instead, to avoid
determinization?

Can you open a Jira issue so we can discuss options?

Thanks,

Mike McCandless

http://blog.mikemccandless.com

On Fri, Dec 15, 2017 at 5:38 AM, Poppe, Thomas (IP&Science) <
[hidden email]> wrote:

> Hello,
>
> We're using the automaton package as part of Elasticsearch for doing
> regexp queries.  Our business requires us to process rather complex regular
> expressions, for example (we have more complex examples, but this one
> illustrates the problem):
>
>         (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d
>
> With a large enough value of maxDeterminizedStates, this works.  The
> problem we're having is that the conversion of this regular expression to a
> CompiledAutomaton takes very long.  Almost all of the time goes into
> determining the common suffix for the Automaton (which is "d" in this
> example) - calculated with a call to Operations.getCommonSuffixBytesRef.
>
> If my understanding is correct, this suffix is only used as an
> optimization (is this correct?).  Skipping the calculation of this suffix
> allows us to process these kinds of queries.
>
> So here are my questions:
> - Would it be possible to introduce a way to skip the calculation of this
> common suffix (ideally something we control from within our query to
> Elasticsearch)?
> - Or would it be possible to take a look at this getCommonSuffixBytesRef
> operation, to see if it can be optimized?  Most of the time goes to
> determinizing the reversed automaton - maybe this can be avoided somehow?
> - Does anyone have any other suggestions?  We've tried to reduce the
> complexity of the query, but it is something we would really like to
> support.
>
> Thanks,
> Thomas Poppe
>
>
Reply | Threaded
Open this post in threaded view
|

Re: CompiledAutomaton performance issue

Poppe, Thomas (IP&Science)
Hi Mike,

Thanks for your response.  I opened Jira issue 8102 for this (https://issues.apache.org/jira/browse/LUCENE-8102).  I included your suggestions below in the description of the issue.

Regards,
Thomas

From: Michael McCandless <[hidden email]>
Date: Sunday, 17 December 2017 at 14:19
To: Lucene Users <[hidden email]>, "Poppe, Thomas (IP&Science)" <[hidden email]>
Subject: Re: CompiledAutomaton performance issue

This is just an optimization; maybe we should expose an option to disable it?

Or maybe we can find the common suffix on an NFA instead, to avoid determinization?

Can you open a Jira issue so we can discuss options?

Thanks,

Mike McCandless

http://blog.mikemccandless.com

On Fri, Dec 15, 2017 at 5:38 AM, Poppe, Thomas (IP&Science) <[hidden email]<mailto:[hidden email]>> wrote:
Hello,

We're using the automaton package as part of Elasticsearch for doing regexp queries.  Our business requires us to process rather complex regular expressions, for example (we have more complex examples, but this one illustrates the problem):

        (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d

With a large enough value of maxDeterminizedStates, this works.  The problem we're having is that the conversion of this regular expression to a CompiledAutomaton takes very long.  Almost all of the time goes into determining the common suffix for the Automaton (which is "d" in this example) - calculated with a call to Operations.getCommonSuffixBytesRef.

If my understanding is correct, this suffix is only used as an optimization (is this correct?).  Skipping the calculation of this suffix allows us to process these kinds of queries.

So here are my questions:
- Would it be possible to introduce a way to skip the calculation of this common suffix (ideally something we control from within our query to Elasticsearch)?
- Or would it be possible to take a look at this getCommonSuffixBytesRef operation, to see if it can be optimized?  Most of the time goes to determinizing the reversed automaton - maybe this can be avoided somehow?
- Does anyone have any other suggestions?  We've tried to reduce the complexity of the query, but it is something we would really like to support.

Thanks,
Thomas Poppe