Quantcast

Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Oliver Mannion
Hi there,

I'd like to construct an Automaton to prefix match against a large set of
strings. I gather a RunAutomation is immutable, thread safe and faster than
Automaton. Are there any other differences between the three Automaton
classes, for example, in memory usage?

Would the general approach for such a problem be to use
DaciukMihovAutomatonBuilder to create an Automaton from the sorted list of
my string set, set all the states to accept (to enable prefix matching),
then pass the Automaton into the constructor of a CharacterRunAutomaton,
and use the run method on the CharacterRunAutomaton to match any queries?

As it seems like I'm building up the Automaton at least three times, and
keeping a reference to the Automaton in the CharacterRunAutomaton, is this
the most memory efficient way of building such an Automaton?

Thanks in advance,

Oliver
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Michael McCandless-2
On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]> wrote:

> I'd like to construct an Automaton to prefix match against a large set of
> strings. I gather a RunAutomation is immutable, thread safe and faster than
> Automaton.

That's correct.

> Are there any other differences between the three Automaton
> classes, for example, in memory usage?

CompiledAutomaton is just a thin wrapper to hold onto both the
original Automaton and the RunAutomaton, plus some other corner-casey
things that are likely not interesting for your usage.

> Would the general approach for such a problem be to use
> DaciukMihovAutomatonBuilder to create an Automaton from the sorted list of
> my string set, set all the states to accept (to enable prefix matching),
> then pass the Automaton into the constructor of a CharacterRunAutomaton,
> and use the run method on the CharacterRunAutomaton to match any queries?

That sounds right.

You could also try doing everything in UTF8 space instead, and use
ByteRunAutomaton: it will likely be faster since it will do faster
lookups on each transition.  It should still be safe to set all states
as accept, even though some of those states will be inside a single
Unicode character, as long as the strings you test against are whole
UTF-8 sequences?

> As it seems like I'm building up the Automaton at least three times, and
> keeping a reference to the Automaton in the CharacterRunAutomaton, is this
> the most memory efficient way of building such an Automaton?

Yeah, it is.  The RunAutomaton will likely require substantial heap in
your case, probably more than the original automaton.

I suppose you don't actually need to keep the Automaton around once
the RunAutomaton is built, but Lucene doesn't make this possible
today, since the RunAutomaton holds onto the Automaton...

> Thanks in advance,

You're welcome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Oliver Mannion
Thanks Mike for getting back to me, sounds like I'm on the right track.

I'm building the automaton from around 1.7million strings, and it ends up
with about 3.8million states and it turns out building a
CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
suprised!), with negligible performance difference at run time. At your
suggestion I tried the ByteRunAutomaton and it was similar to the
CharacterRunAutomaton
in terms of heap and run time.  So for now I'm going to just stick to an
Automaton.

On 14 February 2017 at 00:41, Michael McCandless <[hidden email]>
wrote:

> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]>
> wrote:
>
> > I'd like to construct an Automaton to prefix match against a large set of
> > strings. I gather a RunAutomation is immutable, thread safe and faster
> than
> > Automaton.
>
> That's correct.
>
> > Are there any other differences between the three Automaton
> > classes, for example, in memory usage?
>
> CompiledAutomaton is just a thin wrapper to hold onto both the
> original Automaton and the RunAutomaton, plus some other corner-casey
> things that are likely not interesting for your usage.
>
> > Would the general approach for such a problem be to use
> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list
> of
> > my string set, set all the states to accept (to enable prefix matching),
> > then pass the Automaton into the constructor of a CharacterRunAutomaton,
> > and use the run method on the CharacterRunAutomaton to match any queries?
>
> That sounds right.
>
> You could also try doing everything in UTF8 space instead, and use
> ByteRunAutomaton: it will likely be faster since it will do faster
> lookups on each transition.  It should still be safe to set all states
> as accept, even though some of those states will be inside a single
> Unicode character, as long as the strings you test against are whole
> UTF-8 sequences?
>
> > As it seems like I'm building up the Automaton at least three times, and
> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
> this
> > the most memory efficient way of building such an Automaton?
>
> Yeah, it is.  The RunAutomaton will likely require substantial heap in
> your case, probably more than the original automaton.
>
> I suppose you don't actually need to keep the Automaton around once
> the RunAutomaton is built, but Lucene doesn't make this possible
> today, since the RunAutomaton holds onto the Automaton...
>
> > Thanks in advance,
>
> You're welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Michael McCandless-2
Wow, 2G heap, that's horrible!

How much heap does the automaton itself take?

You can use the automaton's step method to transition from a state
given the next input character to another state (or -1 if that state
doesn't accept that character); it will be slower than the 2 GB run
automaton, but perhaps fast enough?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[hidden email]> wrote:

> Thanks Mike for getting back to me, sounds like I'm on the right track.
>
> I'm building the automaton from around 1.7million strings, and it ends up
> with about 3.8million states and it turns out building a
> CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
> suprised!), with negligible performance difference at run time. At your
> suggestion I tried the ByteRunAutomaton and it was similar to the
> CharacterRunAutomaton
> in terms of heap and run time.  So for now I'm going to just stick to an
> Automaton.
>
> On 14 February 2017 at 00:41, Michael McCandless <[hidden email]>
> wrote:
>
>> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]>
>> wrote:
>>
>> > I'd like to construct an Automaton to prefix match against a large set of
>> > strings. I gather a RunAutomation is immutable, thread safe and faster
>> than
>> > Automaton.
>>
>> That's correct.
>>
>> > Are there any other differences between the three Automaton
>> > classes, for example, in memory usage?
>>
>> CompiledAutomaton is just a thin wrapper to hold onto both the
>> original Automaton and the RunAutomaton, plus some other corner-casey
>> things that are likely not interesting for your usage.
>>
>> > Would the general approach for such a problem be to use
>> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list
>> of
>> > my string set, set all the states to accept (to enable prefix matching),
>> > then pass the Automaton into the constructor of a CharacterRunAutomaton,
>> > and use the run method on the CharacterRunAutomaton to match any queries?
>>
>> That sounds right.
>>
>> You could also try doing everything in UTF8 space instead, and use
>> ByteRunAutomaton: it will likely be faster since it will do faster
>> lookups on each transition.  It should still be safe to set all states
>> as accept, even though some of those states will be inside a single
>> Unicode character, as long as the strings you test against are whole
>> UTF-8 sequences?
>>
>> > As it seems like I'm building up the Automaton at least three times, and
>> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>> this
>> > the most memory efficient way of building such an Automaton?
>>
>> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>> your case, probably more than the original automaton.
>>
>> I suppose you don't actually need to keep the Automaton around once
>> the RunAutomaton is built, but Lucene doesn't make this possible
>> today, since the RunAutomaton holds onto the Automaton...
>>
>> > Thanks in advance,
>>
>> You're welcome!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Oliver Mannion
Hi Mike,

Thanks for the suggestion, I've tried Operations.run on a Automaton and
it's fast enough for my use case.

However, the real problem I have is in building the Automaton
via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
of garbage produced during build (DaciukMihovAutomatonBuilder.States I
think). I'm using this in an app that also serves realtime requests, which
timeout during building of the Automaton. It's an inherited application
design and not the best (ideally I'd be building the Automaton offline in
another process) and I don't expect it's a use case considered for
DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
into other data structures for doing prefix matching. The dk.brics
Automaton has a similar performance profile, while the PatriciaTrie from
Apache Commons Collections seems to consume less CPU and produce less
garbage during build, although it has a less than ideal interface (it's a
Map).

Any further suggestion though would be most welcome!

Regards,

Oliver

On 14 February 2017 at 22:56, Michael McCandless <[hidden email]>
wrote:

> Wow, 2G heap, that's horrible!
>
> How much heap does the automaton itself take?
>
> You can use the automaton's step method to transition from a state
> given the next input character to another state (or -1 if that state
> doesn't accept that character); it will be slower than the 2 GB run
> automaton, but perhaps fast enough?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[hidden email]>
> wrote:
> > Thanks Mike for getting back to me, sounds like I'm on the right track.
> >
> > I'm building the automaton from around 1.7million strings, and it ends up
> > with about 3.8million states and it turns out building a
> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
> > suprised!), with negligible performance difference at run time. At your
> > suggestion I tried the ByteRunAutomaton and it was similar to the
> > CharacterRunAutomaton
> > in terms of heap and run time.  So for now I'm going to just stick to an
> > Automaton.
> >
> > On 14 February 2017 at 00:41, Michael McCandless <
> [hidden email]>
> > wrote:
> >
> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]>
> >> wrote:
> >>
> >> > I'd like to construct an Automaton to prefix match against a large
> set of
> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
> >> than
> >> > Automaton.
> >>
> >> That's correct.
> >>
> >> > Are there any other differences between the three Automaton
> >> > classes, for example, in memory usage?
> >>
> >> CompiledAutomaton is just a thin wrapper to hold onto both the
> >> original Automaton and the RunAutomaton, plus some other corner-casey
> >> things that are likely not interesting for your usage.
> >>
> >> > Would the general approach for such a problem be to use
> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
> list
> >> of
> >> > my string set, set all the states to accept (to enable prefix
> matching),
> >> > then pass the Automaton into the constructor of a
> CharacterRunAutomaton,
> >> > and use the run method on the CharacterRunAutomaton to match any
> queries?
> >>
> >> That sounds right.
> >>
> >> You could also try doing everything in UTF8 space instead, and use
> >> ByteRunAutomaton: it will likely be faster since it will do faster
> >> lookups on each transition.  It should still be safe to set all states
> >> as accept, even though some of those states will be inside a single
> >> Unicode character, as long as the strings you test against are whole
> >> UTF-8 sequences?
> >>
> >> > As it seems like I'm building up the Automaton at least three times,
> and
> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
> >> this
> >> > the most memory efficient way of building such an Automaton?
> >>
> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
> >> your case, probably more than the original automaton.
> >>
> >> I suppose you don't actually need to keep the Automaton around once
> >> the RunAutomaton is built, but Lucene doesn't make this possible
> >> today, since the RunAutomaton holds onto the Automaton...
> >>
> >> > Thanks in advance,
> >>
> >> You're welcome!
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Dawid Weiss-2
You could try using morfologik's byte-based implementation:

https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java

I can't guarantee it'll be fast enough -- you need to sort those input
sequences and even this may take a while. The construction of the
automaton after that is fairly fast. What are the time limits you have
with respect to input data sizes? Perhaps it's just unrealistic to
assume everything is performed as part of a single request?

Dawid

On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <[hidden email]> wrote:

> Hi Mike,
>
> Thanks for the suggestion, I've tried Operations.run on a Automaton and
> it's fast enough for my use case.
>
> However, the real problem I have is in building the Automaton
> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
> think). I'm using this in an app that also serves realtime requests, which
> timeout during building of the Automaton. It's an inherited application
> design and not the best (ideally I'd be building the Automaton offline in
> another process) and I don't expect it's a use case considered for
> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
> into other data structures for doing prefix matching. The dk.brics
> Automaton has a similar performance profile, while the PatriciaTrie from
> Apache Commons Collections seems to consume less CPU and produce less
> garbage during build, although it has a less than ideal interface (it's a
> Map).
>
> Any further suggestion though would be most welcome!
>
> Regards,
>
> Oliver
>
> On 14 February 2017 at 22:56, Michael McCandless <[hidden email]>
> wrote:
>
>> Wow, 2G heap, that's horrible!
>>
>> How much heap does the automaton itself take?
>>
>> You can use the automaton's step method to transition from a state
>> given the next input character to another state (or -1 if that state
>> doesn't accept that character); it will be slower than the 2 GB run
>> automaton, but perhaps fast enough?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[hidden email]>
>> wrote:
>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>> >
>> > I'm building the automaton from around 1.7million strings, and it ends up
>> > with about 3.8million states and it turns out building a
>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>> > suprised!), with negligible performance difference at run time. At your
>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>> > CharacterRunAutomaton
>> > in terms of heap and run time.  So for now I'm going to just stick to an
>> > Automaton.
>> >
>> > On 14 February 2017 at 00:41, Michael McCandless <
>> [hidden email]>
>> > wrote:
>> >
>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]>
>> >> wrote:
>> >>
>> >> > I'd like to construct an Automaton to prefix match against a large
>> set of
>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>> >> than
>> >> > Automaton.
>> >>
>> >> That's correct.
>> >>
>> >> > Are there any other differences between the three Automaton
>> >> > classes, for example, in memory usage?
>> >>
>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>> >> things that are likely not interesting for your usage.
>> >>
>> >> > Would the general approach for such a problem be to use
>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>> list
>> >> of
>> >> > my string set, set all the states to accept (to enable prefix
>> matching),
>> >> > then pass the Automaton into the constructor of a
>> CharacterRunAutomaton,
>> >> > and use the run method on the CharacterRunAutomaton to match any
>> queries?
>> >>
>> >> That sounds right.
>> >>
>> >> You could also try doing everything in UTF8 space instead, and use
>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>> >> lookups on each transition.  It should still be safe to set all states
>> >> as accept, even though some of those states will be inside a single
>> >> Unicode character, as long as the strings you test against are whole
>> >> UTF-8 sequences?
>> >>
>> >> > As it seems like I'm building up the Automaton at least three times,
>> and
>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>> >> this
>> >> > the most memory efficient way of building such an Automaton?
>> >>
>> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>> >> your case, probably more than the original automaton.
>> >>
>> >> I suppose you don't actually need to keep the Automaton around once
>> >> the RunAutomaton is built, but Lucene doesn't make this possible
>> >> today, since the RunAutomaton holds onto the Automaton...
>> >>
>> >> > Thanks in advance,
>> >>
>> >> You're welcome!
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Michael McCandless-2
We may be able to make DaciukMihovAutomatonBuilder's state registry
more ram efficient too ... I think it's essentially the same thing as
the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
automata.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <[hidden email]> wrote:

> You could try using morfologik's byte-based implementation:
>
> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>
> I can't guarantee it'll be fast enough -- you need to sort those input
> sequences and even this may take a while. The construction of the
> automaton after that is fairly fast. What are the time limits you have
> with respect to input data sizes? Perhaps it's just unrealistic to
> assume everything is performed as part of a single request?
>
> Dawid
>
> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <[hidden email]> wrote:
>> Hi Mike,
>>
>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>> it's fast enough for my use case.
>>
>> However, the real problem I have is in building the Automaton
>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>> think). I'm using this in an app that also serves realtime requests, which
>> timeout during building of the Automaton. It's an inherited application
>> design and not the best (ideally I'd be building the Automaton offline in
>> another process) and I don't expect it's a use case considered for
>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>> into other data structures for doing prefix matching. The dk.brics
>> Automaton has a similar performance profile, while the PatriciaTrie from
>> Apache Commons Collections seems to consume less CPU and produce less
>> garbage during build, although it has a less than ideal interface (it's a
>> Map).
>>
>> Any further suggestion though would be most welcome!
>>
>> Regards,
>>
>> Oliver
>>
>> On 14 February 2017 at 22:56, Michael McCandless <[hidden email]>
>> wrote:
>>
>>> Wow, 2G heap, that's horrible!
>>>
>>> How much heap does the automaton itself take?
>>>
>>> You can use the automaton's step method to transition from a state
>>> given the next input character to another state (or -1 if that state
>>> doesn't accept that character); it will be slower than the 2 GB run
>>> automaton, but perhaps fast enough?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[hidden email]>
>>> wrote:
>>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>>> >
>>> > I'm building the automaton from around 1.7million strings, and it ends up
>>> > with about 3.8million states and it turns out building a
>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>>> > suprised!), with negligible performance difference at run time. At your
>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>>> > CharacterRunAutomaton
>>> > in terms of heap and run time.  So for now I'm going to just stick to an
>>> > Automaton.
>>> >
>>> > On 14 February 2017 at 00:41, Michael McCandless <
>>> [hidden email]>
>>> > wrote:
>>> >
>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]>
>>> >> wrote:
>>> >>
>>> >> > I'd like to construct an Automaton to prefix match against a large
>>> set of
>>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>>> >> than
>>> >> > Automaton.
>>> >>
>>> >> That's correct.
>>> >>
>>> >> > Are there any other differences between the three Automaton
>>> >> > classes, for example, in memory usage?
>>> >>
>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>>> >> things that are likely not interesting for your usage.
>>> >>
>>> >> > Would the general approach for such a problem be to use
>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>>> list
>>> >> of
>>> >> > my string set, set all the states to accept (to enable prefix
>>> matching),
>>> >> > then pass the Automaton into the constructor of a
>>> CharacterRunAutomaton,
>>> >> > and use the run method on the CharacterRunAutomaton to match any
>>> queries?
>>> >>
>>> >> That sounds right.
>>> >>
>>> >> You could also try doing everything in UTF8 space instead, and use
>>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>>> >> lookups on each transition.  It should still be safe to set all states
>>> >> as accept, even though some of those states will be inside a single
>>> >> Unicode character, as long as the strings you test against are whole
>>> >> UTF-8 sequences?
>>> >>
>>> >> > As it seems like I'm building up the Automaton at least three times,
>>> and
>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>>> >> this
>>> >> > the most memory efficient way of building such an Automaton?
>>> >>
>>> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>>> >> your case, probably more than the original automaton.
>>> >>
>>> >> I suppose you don't actually need to keep the Automaton around once
>>> >> the RunAutomaton is built, but Lucene doesn't make this possible
>>> >> today, since the RunAutomaton holds onto the Automaton...
>>> >>
>>> >> > Thanks in advance,
>>> >>
>>> >> You're welcome!
>>> >>
>>> >> Mike McCandless
>>> >>
>>> >> http://blog.mikemccandless.com
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Dawid Weiss-2
Yep, true. I just wonder whether it's worth complicating the code...
Could be easier to build an FST<Void> and then recreate a RunAutomaton
from that directly... :)

Dawid

On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
<[hidden email]> wrote:

> We may be able to make DaciukMihovAutomatonBuilder's state registry
> more ram efficient too ... I think it's essentially the same thing as
> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
> automata.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <[hidden email]> wrote:
>> You could try using morfologik's byte-based implementation:
>>
>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>>
>> I can't guarantee it'll be fast enough -- you need to sort those input
>> sequences and even this may take a while. The construction of the
>> automaton after that is fairly fast. What are the time limits you have
>> with respect to input data sizes? Perhaps it's just unrealistic to
>> assume everything is performed as part of a single request?
>>
>> Dawid
>>
>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <[hidden email]> wrote:
>>> Hi Mike,
>>>
>>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>>> it's fast enough for my use case.
>>>
>>> However, the real problem I have is in building the Automaton
>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>>> think). I'm using this in an app that also serves realtime requests, which
>>> timeout during building of the Automaton. It's an inherited application
>>> design and not the best (ideally I'd be building the Automaton offline in
>>> another process) and I don't expect it's a use case considered for
>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>>> into other data structures for doing prefix matching. The dk.brics
>>> Automaton has a similar performance profile, while the PatriciaTrie from
>>> Apache Commons Collections seems to consume less CPU and produce less
>>> garbage during build, although it has a less than ideal interface (it's a
>>> Map).
>>>
>>> Any further suggestion though would be most welcome!
>>>
>>> Regards,
>>>
>>> Oliver
>>>
>>> On 14 February 2017 at 22:56, Michael McCandless <[hidden email]>
>>> wrote:
>>>
>>>> Wow, 2G heap, that's horrible!
>>>>
>>>> How much heap does the automaton itself take?
>>>>
>>>> You can use the automaton's step method to transition from a state
>>>> given the next input character to another state (or -1 if that state
>>>> doesn't accept that character); it will be slower than the 2 GB run
>>>> automaton, but perhaps fast enough?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[hidden email]>
>>>> wrote:
>>>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>>>> >
>>>> > I'm building the automaton from around 1.7million strings, and it ends up
>>>> > with about 3.8million states and it turns out building a
>>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>>>> > suprised!), with negligible performance difference at run time. At your
>>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>>>> > CharacterRunAutomaton
>>>> > in terms of heap and run time.  So for now I'm going to just stick to an
>>>> > Automaton.
>>>> >
>>>> > On 14 February 2017 at 00:41, Michael McCandless <
>>>> [hidden email]>
>>>> > wrote:
>>>> >
>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]>
>>>> >> wrote:
>>>> >>
>>>> >> > I'd like to construct an Automaton to prefix match against a large
>>>> set of
>>>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>>>> >> than
>>>> >> > Automaton.
>>>> >>
>>>> >> That's correct.
>>>> >>
>>>> >> > Are there any other differences between the three Automaton
>>>> >> > classes, for example, in memory usage?
>>>> >>
>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>>>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>>>> >> things that are likely not interesting for your usage.
>>>> >>
>>>> >> > Would the general approach for such a problem be to use
>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>>>> list
>>>> >> of
>>>> >> > my string set, set all the states to accept (to enable prefix
>>>> matching),
>>>> >> > then pass the Automaton into the constructor of a
>>>> CharacterRunAutomaton,
>>>> >> > and use the run method on the CharacterRunAutomaton to match any
>>>> queries?
>>>> >>
>>>> >> That sounds right.
>>>> >>
>>>> >> You could also try doing everything in UTF8 space instead, and use
>>>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>>>> >> lookups on each transition.  It should still be safe to set all states
>>>> >> as accept, even though some of those states will be inside a single
>>>> >> Unicode character, as long as the strings you test against are whole
>>>> >> UTF-8 sequences?
>>>> >>
>>>> >> > As it seems like I'm building up the Automaton at least three times,
>>>> and
>>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>>>> >> this
>>>> >> > the most memory efficient way of building such an Automaton?
>>>> >>
>>>> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>>>> >> your case, probably more than the original automaton.
>>>> >>
>>>> >> I suppose you don't actually need to keep the Automaton around once
>>>> >> the RunAutomaton is built, but Lucene doesn't make this possible
>>>> >> today, since the RunAutomaton holds onto the Automaton...
>>>> >>
>>>> >> > Thanks in advance,
>>>> >>
>>>> >> You're welcome!
>>>> >>
>>>> >> Mike McCandless
>>>> >>
>>>> >> http://blog.mikemccandless.com
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: [hidden email]
>>>> >> For additional commands, e-mail: [hidden email]
>>>> >>
>>>> >>
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Michael McCandless-2
Actually, that's a great idea to try (Oliver).  It would be a
relatively simple conversion... maybe Lucene could add some sugar on
top, e.g. to convert an FST<Void> to an automaton.  Hmm, maybe it even
exists somewhere already...

But even the FST Builder's NodeHash can be non-trivial in its heap
usage, but hopefully less than DaciukMihovAutomatonBuilder.

(And yes I do love how simple DaciukMihovAutomatonBuilder is).

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss <[hidden email]> wrote:

> Yep, true. I just wonder whether it's worth complicating the code...
> Could be easier to build an FST<Void> and then recreate a RunAutomaton
> from that directly... :)
>
> Dawid
>
> On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
> <[hidden email]> wrote:
>> We may be able to make DaciukMihovAutomatonBuilder's state registry
>> more ram efficient too ... I think it's essentially the same thing as
>> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
>> automata.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <[hidden email]> wrote:
>>> You could try using morfologik's byte-based implementation:
>>>
>>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>>>
>>> I can't guarantee it'll be fast enough -- you need to sort those input
>>> sequences and even this may take a while. The construction of the
>>> automaton after that is fairly fast. What are the time limits you have
>>> with respect to input data sizes? Perhaps it's just unrealistic to
>>> assume everything is performed as part of a single request?
>>>
>>> Dawid
>>>
>>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <[hidden email]> wrote:
>>>> Hi Mike,
>>>>
>>>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>>>> it's fast enough for my use case.
>>>>
>>>> However, the real problem I have is in building the Automaton
>>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
>>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>>>> think). I'm using this in an app that also serves realtime requests, which
>>>> timeout during building of the Automaton. It's an inherited application
>>>> design and not the best (ideally I'd be building the Automaton offline in
>>>> another process) and I don't expect it's a use case considered for
>>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>>>> into other data structures for doing prefix matching. The dk.brics
>>>> Automaton has a similar performance profile, while the PatriciaTrie from
>>>> Apache Commons Collections seems to consume less CPU and produce less
>>>> garbage during build, although it has a less than ideal interface (it's a
>>>> Map).
>>>>
>>>> Any further suggestion though would be most welcome!
>>>>
>>>> Regards,
>>>>
>>>> Oliver
>>>>
>>>> On 14 February 2017 at 22:56, Michael McCandless <[hidden email]>
>>>> wrote:
>>>>
>>>>> Wow, 2G heap, that's horrible!
>>>>>
>>>>> How much heap does the automaton itself take?
>>>>>
>>>>> You can use the automaton's step method to transition from a state
>>>>> given the next input character to another state (or -1 if that state
>>>>> doesn't accept that character); it will be slower than the 2 GB run
>>>>> automaton, but perhaps fast enough?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>>
>>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[hidden email]>
>>>>> wrote:
>>>>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>>>>> >
>>>>> > I'm building the automaton from around 1.7million strings, and it ends up
>>>>> > with about 3.8million states and it turns out building a
>>>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>>>>> > suprised!), with negligible performance difference at run time. At your
>>>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>>>>> > CharacterRunAutomaton
>>>>> > in terms of heap and run time.  So for now I'm going to just stick to an
>>>>> > Automaton.
>>>>> >
>>>>> > On 14 February 2017 at 00:41, Michael McCandless <
>>>>> [hidden email]>
>>>>> > wrote:
>>>>> >
>>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[hidden email]>
>>>>> >> wrote:
>>>>> >>
>>>>> >> > I'd like to construct an Automaton to prefix match against a large
>>>>> set of
>>>>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>>>>> >> than
>>>>> >> > Automaton.
>>>>> >>
>>>>> >> That's correct.
>>>>> >>
>>>>> >> > Are there any other differences between the three Automaton
>>>>> >> > classes, for example, in memory usage?
>>>>> >>
>>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>>>>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>>>>> >> things that are likely not interesting for your usage.
>>>>> >>
>>>>> >> > Would the general approach for such a problem be to use
>>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>>>>> list
>>>>> >> of
>>>>> >> > my string set, set all the states to accept (to enable prefix
>>>>> matching),
>>>>> >> > then pass the Automaton into the constructor of a
>>>>> CharacterRunAutomaton,
>>>>> >> > and use the run method on the CharacterRunAutomaton to match any
>>>>> queries?
>>>>> >>
>>>>> >> That sounds right.
>>>>> >>
>>>>> >> You could also try doing everything in UTF8 space instead, and use
>>>>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>>>>> >> lookups on each transition.  It should still be safe to set all states
>>>>> >> as accept, even though some of those states will be inside a single
>>>>> >> Unicode character, as long as the strings you test against are whole
>>>>> >> UTF-8 sequences?
>>>>> >>
>>>>> >> > As it seems like I'm building up the Automaton at least three times,
>>>>> and
>>>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>>>>> >> this
>>>>> >> > the most memory efficient way of building such an Automaton?
>>>>> >>
>>>>> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>>>>> >> your case, probably more than the original automaton.
>>>>> >>
>>>>> >> I suppose you don't actually need to keep the Automaton around once
>>>>> >> the RunAutomaton is built, but Lucene doesn't make this possible
>>>>> >> today, since the RunAutomaton holds onto the Automaton...
>>>>> >>
>>>>> >> > Thanks in advance,
>>>>> >>
>>>>> >> You're welcome!
>>>>> >>
>>>>> >> Mike McCandless
>>>>> >>
>>>>> >> http://blog.mikemccandless.com
>>>>> >>
>>>>> >> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: [hidden email]
>>>>> >> For additional commands, e-mail: [hidden email]
>>>>> >>
>>>>> >>
>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Oliver Mannion
Hi David & Mike,

Thanks for the suggestions!

Regarding my requirements for building the FSA, it happens in a background
process, not as part of a request, although the app is still serving
requests at the same time. So the time taken to build isn't really the
issue, it's just that it shouldn't starve the other requests of CPU.

I've tried building an FST<Object> with NoOutputs, and morfologik's FSA,
and both are much better in terms of build time, CPU and RAM usage compared
to Automaton. They are similar to, and in some dimensions even better than,
the PatriciaTrie. In particular building an FST with doShareSuffix = false
is the fastest of any option, and morfologik's fsa has the lowest heap
usage of all (both in terms of garbage and the final size of the FSA).
Either will meet my requirements. I quite like the API provided by
morfologik (no need to work with BytesRef) and it supports both prefix
(MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching
(MatchResult.EXACT_MATCH).

Given the efficiencies and functionality similarity of the FST vs
Automation in Lucene, I'm kind of curious as to why you would ever use
Automaton?

Thanks very much for your help!
Oliver

On 15 February 2017 at 21:49, Michael McCandless <[hidden email]>
wrote:

> Actually, that's a great idea to try (Oliver).  It would be a
> relatively simple conversion... maybe Lucene could add some sugar on
> top, e.g. to convert an FST<Void> to an automaton.  Hmm, maybe it even
> exists somewhere already...
>
> But even the FST Builder's NodeHash can be non-trivial in its heap
> usage, but hopefully less than DaciukMihovAutomatonBuilder.
>
> (And yes I do love how simple DaciukMihovAutomatonBuilder is).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss <[hidden email]>
> wrote:
> > Yep, true. I just wonder whether it's worth complicating the code...
> > Could be easier to build an FST<Void> and then recreate a RunAutomaton
> > from that directly... :)
> >
> > Dawid
> >
> > On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
> > <[hidden email]> wrote:
> >> We may be able to make DaciukMihovAutomatonBuilder's state registry
> >> more ram efficient too ... I think it's essentially the same thing as
> >> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
> >> automata.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <[hidden email]>
> wrote:
> >>> You could try using morfologik's byte-based implementation:
> >>>
> >>> https://github.com/morfologik/morfologik-stemming/blob/
> master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/
> FSABuilderTest.java
> >>>
> >>> I can't guarantee it'll be fast enough -- you need to sort those input
> >>> sequences and even this may take a while. The construction of the
> >>> automaton after that is fairly fast. What are the time limits you have
> >>> with respect to input data sizes? Perhaps it's just unrealistic to
> >>> assume everything is performed as part of a single request?
> >>>
> >>> Dawid
> >>>
> >>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <[hidden email]>
> wrote:
> >>>> Hi Mike,
> >>>>
> >>>> Thanks for the suggestion, I've tried Operations.run on a Automaton
> and
> >>>> it's fast enough for my use case.
> >>>>
> >>>> However, the real problem I have is in building the Automaton
> >>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes
> quite a
> >>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about
> 800mb
> >>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States
> I
> >>>> think). I'm using this in an app that also serves realtime requests,
> which
> >>>> timeout during building of the Automaton. It's an inherited
> application
> >>>> design and not the best (ideally I'd be building the Automaton
> offline in
> >>>> another process) and I don't expect it's a use case considered for
> >>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started
> looking
> >>>> into other data structures for doing prefix matching. The dk.brics
> >>>> Automaton has a similar performance profile, while the PatriciaTrie
> from
> >>>> Apache Commons Collections seems to consume less CPU and produce less
> >>>> garbage during build, although it has a less than ideal interface
> (it's a
> >>>> Map).
> >>>>
> >>>> Any further suggestion though would be most welcome!
> >>>>
> >>>> Regards,
> >>>>
> >>>> Oliver
> >>>>
> >>>> On 14 February 2017 at 22:56, Michael McCandless <
> [hidden email]>
> >>>> wrote:
> >>>>
> >>>>> Wow, 2G heap, that's horrible!
> >>>>>
> >>>>> How much heap does the automaton itself take?
> >>>>>
> >>>>> You can use the automaton's step method to transition from a state
> >>>>> given the next input character to another state (or -1 if that state
> >>>>> doesn't accept that character); it will be slower than the 2 GB run
> >>>>> automaton, but perhaps fast enough?
> >>>>>
> >>>>> Mike McCandless
> >>>>>
> >>>>> http://blog.mikemccandless.com
> >>>>>
> >>>>>
> >>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[hidden email]
> >
> >>>>> wrote:
> >>>>> > Thanks Mike for getting back to me, sounds like I'm on the right
> track.
> >>>>> >
> >>>>> > I'm building the automaton from around 1.7million strings, and it
> ends up
> >>>>> > with about 3.8million states and it turns out building a
> >>>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was
> quite
> >>>>> > suprised!), with negligible performance difference at run time. At
> your
> >>>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
> >>>>> > CharacterRunAutomaton
> >>>>> > in terms of heap and run time.  So for now I'm going to just stick
> to an
> >>>>> > Automaton.
> >>>>> >
> >>>>> > On 14 February 2017 at 00:41, Michael McCandless <
> >>>>> [hidden email]>
> >>>>> > wrote:
> >>>>> >
> >>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <
> [hidden email]>
> >>>>> >> wrote:
> >>>>> >>
> >>>>> >> > I'd like to construct an Automaton to prefix match against a
> large
> >>>>> set of
> >>>>> >> > strings. I gather a RunAutomation is immutable, thread safe and
> faster
> >>>>> >> than
> >>>>> >> > Automaton.
> >>>>> >>
> >>>>> >> That's correct.
> >>>>> >>
> >>>>> >> > Are there any other differences between the three Automaton
> >>>>> >> > classes, for example, in memory usage?
> >>>>> >>
> >>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
> >>>>> >> original Automaton and the RunAutomaton, plus some other
> corner-casey
> >>>>> >> things that are likely not interesting for your usage.
> >>>>> >>
> >>>>> >> > Would the general approach for such a problem be to use
> >>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the
> sorted
> >>>>> list
> >>>>> >> of
> >>>>> >> > my string set, set all the states to accept (to enable prefix
> >>>>> matching),
> >>>>> >> > then pass the Automaton into the constructor of a
> >>>>> CharacterRunAutomaton,
> >>>>> >> > and use the run method on the CharacterRunAutomaton to match any
> >>>>> queries?
> >>>>> >>
> >>>>> >> That sounds right.
> >>>>> >>
> >>>>> >> You could also try doing everything in UTF8 space instead, and use
> >>>>> >> ByteRunAutomaton: it will likely be faster since it will do faster
> >>>>> >> lookups on each transition.  It should still be safe to set all
> states
> >>>>> >> as accept, even though some of those states will be inside a
> single
> >>>>> >> Unicode character, as long as the strings you test against are
> whole
> >>>>> >> UTF-8 sequences?
> >>>>> >>
> >>>>> >> > As it seems like I'm building up the Automaton at least three
> times,
> >>>>> and
> >>>>> >> > keeping a reference to the Automaton in the
> CharacterRunAutomaton, is
> >>>>> >> this
> >>>>> >> > the most memory efficient way of building such an Automaton?
> >>>>> >>
> >>>>> >> Yeah, it is.  The RunAutomaton will likely require substantial
> heap in
> >>>>> >> your case, probably more than the original automaton.
> >>>>> >>
> >>>>> >> I suppose you don't actually need to keep the Automaton around
> once
> >>>>> >> the RunAutomaton is built, but Lucene doesn't make this possible
> >>>>> >> today, since the RunAutomaton holds onto the Automaton...
> >>>>> >>
> >>>>> >> > Thanks in advance,
> >>>>> >>
> >>>>> >> You're welcome!
> >>>>> >>
> >>>>> >> Mike McCandless
> >>>>> >>
> >>>>> >> http://blog.mikemccandless.com
> >>>>> >>
> >>>>> >> ------------------------------------------------------------
> ---------
> >>>>> >> To unsubscribe, e-mail: [hidden email]
> >>>>> >> For additional commands, e-mail: [hidden email]
> >>>>> >>
> >>>>> >>
> >>>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Dawid Weiss-2
> PatriciaTrie. In particular building an FST with doShareSuffix = false is
> the fastest of any option,

If you don't share the suffix then you are building a kind of Patricia
trie... But suffix sharing is cheap and can give you a memory saving
(and resulting cache locality sometimes) that is non-trivial to
neglect.

> and morfologik's fsa has the lowest heap usage of
> all (both in terms of garbage and the final size of the FSA).

I think these would be minuscule differences, really. Something not
worth the effort of maintaining a compatibility with two libraries,
for example. Lucene's FST are transducers, so they do have an
additional benefit in that you can store something extra with each
entry (this may be handy sometimes -- for example frequency
information attached to each term, etc.).

> to work with BytesRef) and it supports both prefix
> (MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching
> (MatchResult.EXACT_MATCH).

So does Lucene's FST, although at a lower level (you'd need to descend
manually).

> Given the efficiencies and functionality similarity of the FST vs Automation
> in Lucene, I'm kind of curious as to why you would ever use Automaton?

The Automaton class was ported from Brics  library and it supports a
much more generic class of automata, including optimizations from
NFA->DFA, etc. The FST class is built directly from one specific kind
of data (unique sorted keys on input) and it's heavily optimized
towards this case.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Oliver Mannion
Hi Dawid,

Yes, the performance differences between Lucene's FST,  morfologik's fsa
and the PatriciaTrie are rather small.

I've managed to get something working well for my use-case. Thanks Dawid
and Michael for all your insight!

On 20 February 2017 at 19:21, Dawid Weiss <[hidden email]> wrote:

> > PatriciaTrie. In particular building an FST with doShareSuffix = false is
> > the fastest of any option,
>
> If you don't share the suffix then you are building a kind of Patricia
> trie... But suffix sharing is cheap and can give you a memory saving
> (and resulting cache locality sometimes) that is non-trivial to
> neglect.
>
> > and morfologik's fsa has the lowest heap usage of
> > all (both in terms of garbage and the final size of the FSA).
>
> I think these would be minuscule differences, really. Something not
> worth the effort of maintaining a compatibility with two libraries,
> for example. Lucene's FST are transducers, so they do have an
> additional benefit in that you can store something extra with each
> entry (this may be handy sometimes -- for example frequency
> information attached to each term, etc.).
>
> > to work with BytesRef) and it supports both prefix
> > (MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching
> > (MatchResult.EXACT_MATCH).
>
> So does Lucene's FST, although at a lower level (you'd need to descend
> manually).
>
> > Given the efficiencies and functionality similarity of the FST vs
> Automation
> > in Lucene, I'm kind of curious as to why you would ever use Automaton?
>
> The Automaton class was ported from Brics  library and it supports a
> much more generic class of automata, including optimizations from
> NFA->DFA, etc. The FST class is built directly from one specific kind
> of data (unique sorted keys on input) and it's heavily optimized
> towards this case.
>
> Dawid
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Michael McCandless-2
Wonderful, glad to hear you got something working, and thanks for bringing
closure.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Mar 4, 2017 at 1:32 AM, Oliver Mannion <[hidden email]> wrote:

> Hi Dawid,
>
> Yes, the performance differences between Lucene's FST,  morfologik's fsa
> and the PatriciaTrie are rather small.
>
> I've managed to get something working well for my use-case. Thanks Dawid
> and Michael for all your insight!
>
> On 20 February 2017 at 19:21, Dawid Weiss <[hidden email]> wrote:
>
>> > PatriciaTrie. In particular building an FST with doShareSuffix = false
>> is
>> > the fastest of any option,
>>
>> If you don't share the suffix then you are building a kind of Patricia
>> trie... But suffix sharing is cheap and can give you a memory saving
>> (and resulting cache locality sometimes) that is non-trivial to
>> neglect.
>>
>> > and morfologik's fsa has the lowest heap usage of
>> > all (both in terms of garbage and the final size of the FSA).
>>
>> I think these would be minuscule differences, really. Something not
>> worth the effort of maintaining a compatibility with two libraries,
>> for example. Lucene's FST are transducers, so they do have an
>> additional benefit in that you can store something extra with each
>> entry (this may be handy sometimes -- for example frequency
>> information attached to each term, etc.).
>>
>> > to work with BytesRef) and it supports both prefix
>> > (MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching
>> > (MatchResult.EXACT_MATCH).
>>
>> So does Lucene's FST, although at a lower level (you'd need to descend
>> manually).
>>
>> > Given the efficiencies and functionality similarity of the FST vs
>> Automation
>> > in Lucene, I'm kind of curious as to why you would ever use Automaton?
>>
>> The Automaton class was ported from Brics  library and it supports a
>> much more generic class of automata, including optimizations from
>> NFA->DFA, etc. The FST class is built directly from one specific kind
>> of data (unique sorted keys on input) and it's heavily optimized
>> towards this case.
>>
>> Dawid
>>
>
>
Loading...