Quantcast

SynonymFilter, FST, and Aho-Corasick algorithm

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

SynonymFilter, FST, and Aho-Corasick algorithm

David Smiley
Hello.
  I'm embarking on developing code similar to the SynonymFilter but which merely needs to record out of band to the analysis where there is matching text in the input tokens to the corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through it and when it's done it tells me at what offsets there is a match to a corpus of keyword & phrases, and to what keywords/phrases they were exactly.  It doesn't have to inject or modify the token stream because the results of this are going elsewhere.  Although, it would be a fine approach to only omit the "tags" as I call them as a way of consuming the results, but I'm not indexing them so it doesn't matter.

  I noticed the following TODOs at the start:

// TODO: maybe we should resolve token -> wordID then run
// FST on wordIDs, for better perf?

I intend on doing this since my matching keyword/phrases are often more than one word, and I expect this will save memory and be faster.

// TODO: a more efficient approach would be Aho/Corasick's
// algorithm
// http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
// It improves over the current approach here
// because it does not fully re-start matching at every
// token.  For example if one pattern is "a b c x"
// and another is "b c d" and the input is "a b c d", on
// trying to parse "a b c x" but failing when you got to x,
// rather than starting over again your really should
// immediately recognize that "b c d" matches at the next
// input.  I suspect this won't matter that much in
// practice, but it's possible on some set of synonyms it
// will.  We'd have to modify Aho/Corasick to enforce our
// conflict resolving (eg greedy matching) because that algo
// finds all matches.  This really amounts to adding a .*
// closure to the FST and then determinizing it.

Could someone please clarify how the problem in the example above is to be fixed?  At the end it states how to solve it, but I don't know how to do that and I'm not sure if there is anything more to it since after all if it's as easy as that last sentence sounds then it would have been done already ;-)

This code is intense!  I wish FSTs were better documented.  For example, there are no javadocs on public members of FST.Arc like "output" and "nextFinalOutput" which are pertinent since SynonymFilter directly accesses them.  IMO the state of FSTs is such that those that wrote them know how they work (Robert, McCandless, Weiss) and seemingly everyone else like me doesn't touch them because we don't know how.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss-2

Knowing how fsts work and being comfortable with the api that evolved through a series of exploratory patches are two different things. I like my fsa api much better and there was an effort to do something similar for lucene but i gave up at some point because the speed of development killed me.

Can you describe what youre trying to achieve in more detail? Ive used fsts for pattern matching (sequences of arbitrary length) and my experience is that simple state trackers work wery well (even if they may seem to do lots of spurious tracking).

On Jul 12, 2012 5:09 PM, "Smiley, David W." <[hidden email]> wrote:
Hello.
  I'm embarking on developing code similar to the SynonymFilter but which merely needs to record out of band to the analysis where there is matching text in the input tokens to the corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through it and when it's done it tells me at what offsets there is a match to a corpus of keyword & phrases, and to what keywords/phrases they were exactly.  It doesn't have to inject or modify the token stream because the results of this are going elsewhere.  Although, it would be a fine approach to only omit the "tags" as I call them as a way of consuming the results, but I'm not indexing them so it doesn't matter.

  I noticed the following TODOs at the start:

// TODO: maybe we should resolve token -> wordID then run
// FST on wordIDs, for better perf?

I intend on doing this since my matching keyword/phrases are often more than one word, and I expect this will save memory and be faster.

// TODO: a more efficient approach would be Aho/Corasick's
// algorithm
// http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
// It improves over the current approach here
// because it does not fully re-start matching at every
// token.  For example if one pattern is "a b c x"
// and another is "b c d" and the input is "a b c d", on
// trying to parse "a b c x" but failing when you got to x,
// rather than starting over again your really should
// immediately recognize that "b c d" matches at the next
// input.  I suspect this won't matter that much in
// practice, but it's possible on some set of synonyms it
// will.  We'd have to modify Aho/Corasick to enforce our
// conflict resolving (eg greedy matching) because that algo
// finds all matches.  This really amounts to adding a .*
// closure to the FST and then determinizing it.

Could someone please clarify how the problem in the example above is to be fixed?  At the end it states how to solve it, but I don't know how to do that and I'm not sure if there is anything more to it since after all if it's as easy as that last sentence sounds then it would have been done already ;-)

This code is intense!  I wish FSTs were better documented.  For example, there are no javadocs on public members of FST.Arc like "output" and "nextFinalOutput" which are pertinent since SynonymFilter directly accesses them.  IMO the state of FSTs is such that those that wrote them know how they work (Robert, McCandless, Weiss) and seemingly everyone else like me doesn't touch them because we don't know how.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

David Smiley

On Jul 12, 2012, at 11:51 AM, Dawid Weiss wrote:

Knowing how fsts work and being comfortable with the api that evolved through a series of exploratory patches are two different things. I like my fsa api much better and there was an effort to do something similar for lucene but i gave up at some point because the speed of development killed me.

Do you mean it was slow to coordinate / get consensus or…?  Just curious.

Can you describe what youre trying to achieve in more detail? Ive used fsts for pattern matching (sequences of arbitrary length) and my experience is that simple state trackers work wery well (even if they may seem to do lots of spurious tracking).

I rather like Wikipedia's definition:  http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

The number of names I want to handle is in the millions and so use of Lucene's FST is essential as I see it.

~ David


On Jul 12, 2012 5:09 PM, "Smiley, David W." <[hidden email]> wrote:
Hello.
  I'm embarking on developing code similar to the SynonymFilter but which merely needs to record out of band to the analysis where there is matching text in the input tokens to the corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through it and when it's done it tells me at what offsets there is a match to a corpus of keyword & phrases, and to what keywords/phrases they were exactly.  It doesn't have to inject or modify the token stream because the results of this are going elsewhere.  Although, it would be a fine approach to only omit the "tags" as I call them as a way of consuming the results, but I'm not indexing them so it doesn't matter.

  I noticed the following TODOs at the start:

// TODO: maybe we should resolve token -> wordID then run
// FST on wordIDs, for better perf?

I intend on doing this since my matching keyword/phrases are often more than one word, and I expect this will save memory and be faster.

// TODO: a more efficient approach would be Aho/Corasick's
// algorithm
// http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
// It improves over the current approach here
// because it does not fully re-start matching at every
// token.  For example if one pattern is "a b c x"
// and another is "b c d" and the input is "a b c d", on
// trying to parse "a b c x" but failing when you got to x,
// rather than starting over again your really should
// immediately recognize that "b c d" matches at the next
// input.  I suspect this won't matter that much in
// practice, but it's possible on some set of synonyms it
// will.  We'd have to modify Aho/Corasick to enforce our
// conflict resolving (eg greedy matching) because that algo
// finds all matches.  This really amounts to adding a .*
// closure to the FST and then determinizing it.

Could someone please clarify how the problem in the example above is to be fixed?  At the end it states how to solve it, but I don't know how to do that and I'm not sure if there is anything more to it since after all if it's as easy as that last sentence sounds then it would have been done already ;-)

This code is intense!  I wish FSTs were better documented.  For example, there are no javadocs on public members of FST.Arc like "output" and "nextFinalOutput" which are pertinent since SynonymFilter directly accesses them.  IMO the state of FSTs is such that those that wrote them know how they work (Robert, McCandless, Weiss) and seemingly everyone else like me doesn't touch them because we don't know how.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss-2

The development was too fast for me to keep up. And by the time i had some concept of the api mike wrote about million lines of code that would have to be rewritten ;)

The current api isn't bad. Its fast.

I asked for an example of what you're trying to do because then i'd be able to tell you if what i used would work. The number of entries does not matter. I did use fsts but simple fsts nothing special.

On Jul 12, 2012 6:05 PM, "Smiley, David W." <[hidden email]> wrote:

On Jul 12, 2012, at 11:51 AM, Dawid Weiss wrote:

Knowing how fsts work and being comfortable with the api that evolved through a series of exploratory patches are two different things. I like my fsa api much better and there was an effort to do something similar for lucene but i gave up at some point because the speed of development killed me.

Do you mean it was slow to coordinate / get consensus or…?  Just curious.

Can you describe what youre trying to achieve in more detail? Ive used fsts for pattern matching (sequences of arbitrary length) and my experience is that simple state trackers work wery well (even if they may seem to do lots of spurious tracking).

I rather like Wikipedia's definition:  http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

The number of names I want to handle is in the millions and so use of Lucene's FST is essential as I see it.

~ David


On Jul 12, 2012 5:09 PM, "Smiley, David W." <[hidden email]> wrote:
Hello.
  I'm embarking on developing code similar to the SynonymFilter but which merely needs to record out of band to the analysis where there is matching text in the input tokens to the corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through it and when it's done it tells me at what offsets there is a match to a corpus of keyword & phrases, and to what keywords/phrases they were exactly.  It doesn't have to inject or modify the token stream because the results of this are going elsewhere.  Although, it would be a fine approach to only omit the "tags" as I call them as a way of consuming the results, but I'm not indexing them so it doesn't matter.

  I noticed the following TODOs at the start:

// TODO: maybe we should resolve token -> wordID then run
// FST on wordIDs, for better perf?

I intend on doing this since my matching keyword/phrases are often more than one word, and I expect this will save memory and be faster.

// TODO: a more efficient approach would be Aho/Corasick's
// algorithm
// http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
// It improves over the current approach here
// because it does not fully re-start matching at every
// token.  For example if one pattern is "a b c x"
// and another is "b c d" and the input is "a b c d", on
// trying to parse "a b c x" but failing when you got to x,
// rather than starting over again your really should
// immediately recognize that "b c d" matches at the next
// input.  I suspect this won't matter that much in
// practice, but it's possible on some set of synonyms it
// will.  We'd have to modify Aho/Corasick to enforce our
// conflict resolving (eg greedy matching) because that algo
// finds all matches.  This really amounts to adding a .*
// closure to the FST and then determinizing it.

Could someone please clarify how the problem in the example above is to be fixed?  At the end it states how to solve it, but I don't know how to do that and I'm not sure if there is anything more to it since after all if it's as easy as that last sentence sounds then it would have been done already ;-)

This code is intense!  I wish FSTs were better documented.  For example, there are no javadocs on public members of FST.Arc like "output" and "nextFinalOutput" which are pertinent since SynonymFilter directly accesses them.  IMO the state of FSTs is such that those that wrote them know how they work (Robert, McCandless, Weiss) and seemingly everyone else like me doesn't touch them because we don't know how.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Michael McCandless-2
On Thu, Jul 12, 2012 at 12:10 PM, Dawid Weiss <[hidden email]> wrote:
> The development was too fast for me to keep up. And by the time i had some
> concept of the api mike wrote about million lines of code that would have to
> be rewritten ;)

Mike is very happy to help rewrite that code for a better FST API :)

We can and should also make incremental improvements.

I do agree it's horrible to have code that only a small set of people
understand: such code is effectively dead.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss-2

The api shouldn't be the goal. Those initial fst changes were driven by real needs an optimizations and are still great. The api will eventually take better shape and form based on those use cases i'm sure.

That patch that I had tried to extract a common notion of output and its grammar. It was half baked anyway so no big loss.

On Jul 12, 2012 6:20 PM, "Michael McCandless" <[hidden email]> wrote:
On Thu, Jul 12, 2012 at 12:10 PM, Dawid Weiss <[hidden email]> wrote:
> The development was too fast for me to keep up. And by the time i had some
> concept of the api mike wrote about million lines of code that would have to
> be rewritten ;)

Mike is very happy to help rewrite that code for a better FST API :)

We can and should also make incremental improvements.

I do agree it's horrible to have code that only a small set of people
understand: such code is effectively dead.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Robert Muir
In reply to this post by Michael McCandless-2
On Thu, Jul 12, 2012 at 12:19 PM, Michael McCandless
<[hidden email]> wrote:

> On Thu, Jul 12, 2012 at 12:10 PM, Dawid Weiss <[hidden email]> wrote:
>> The development was too fast for me to keep up. And by the time i had some
>> concept of the api mike wrote about million lines of code that would have to
>> be rewritten ;)
>
> Mike is very happy to help rewrite that code for a better FST API :)
>
> We can and should also make incremental improvements.
>
> I do agree it's horrible to have code that only a small set of people
> understand: such code is effectively dead.
>

I agree, and think we should make improvements to the API whenever we can.

but a fast, efficient, and self-documenting FST API is probably going
to be elusive

I think a lot of this could be fixed with examples and docs, which
we've been working at too, e.g.:
http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/util/fst/package-summary.html#package_description

The biggest problem I have with documentation here is when it becomes
out-of-date. We've made a lot of progress here, we have a
javadocs-lint task that runs in hudson and checks all of our links and
fails if any are dead, etc.

But this does us no good for code samples. I think we need to
seriously revisit/develop a plan for code samples in documentation.
All the samples we have in various docs (e.g. package documentation)
is very fragile, and it discourages me totally from adding any
advanced examples or any more than are minimally necessary to get
started, because I'm afraid of the manual maintenance cost.

Instead I think we should setup a proper examples infrastructure,
where these examples are actually compiled and such. We can still link
to them in javadocs.
Have a look at this example from the demo/ module:
http://lucene.apache.org/core/4_0_0-ALPHA/demo/overview-summary.html#Location_of_the_source

I think we should have more than just SearchFiles and IndexFiles and
also move our examples here, rather than being inlined in the javadocs
text. This way they are compile-time checked, and we can link to them
from anywhere (its safe, and we have link-checkers that prove it).

I'm open to any other ideas though: this is just the best one i have now.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss
> I think a lot of this could be fixed with examples and docs, which

We use a simple ANT task that extracts snippets of code from Java
(very often unit tests) and include these in JavaDocs, unfortunately
by post-processing. See the example here:

http://download.carrot2.org/stable/javadoc/

and the sources (linked) are here:

https://github.com/carrot2/carrot2/blob/df49d66087d0da9e87043e13a400ac148952a41c/applications/carrot2-examples/examples/org/carrot2/examples/clustering/ClusteringDocumentList.java

As you can see there are simple tags of the form:

[[[start:clustering-document-list-intro]]]
...
[[[end:clustering-document-list-intro]]]

these get extracted to separate files which are then included in the
javadocs. Crude, but works. I'm sure it could be improved and probably
other folks have come up with a similar idea, I just don't know of
such attempts.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Robert Muir
On Thu, Jul 12, 2012 at 1:26 PM, Dawid Weiss
<[hidden email]> wrote:

>> I think a lot of this could be fixed with examples and docs, which
>
> We use a simple ANT task that extracts snippets of code from Java
> (very often unit tests) and include these in JavaDocs, unfortunately
> by post-processing. See the example here:
>
> http://download.carrot2.org/stable/javadoc/
>
> and the sources (linked) are here:
>
> https://github.com/carrot2/carrot2/blob/df49d66087d0da9e87043e13a400ac148952a41c/applications/carrot2-examples/examples/org/carrot2/examples/clustering/ClusteringDocumentList.java
>
> As you can see there are simple tags of the form:
>
> [[[start:clustering-document-list-intro]]]
> ...
> [[[end:clustering-document-list-intro]]]
>
> these get extracted to separate files which are then included in the
> javadocs. Crude, but works. I'm sure it could be improved and probably
> other folks have come up with a similar idea, I just don't know of
> such attempts.

Thats more sophisticated than what we do with the javadocs
"linksource" option in the demo.
The key advantage there is that we never want to link to any
trunk/development code as it may not exist.
So with linksource (like my example), it basically makes an htmlized
version of your code included in the javadocs.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Robert Muir
On Thu, Jul 12, 2012 at 1:29 PM, Robert Muir <[hidden email]> wrote:

> Thats more sophisticated than what we do with the javadocs
> "linksource" option in the demo.
> The key advantage there is that we never want to link to any
> trunk/development code as it may not exist.
> So with linksource (like my example), it basically makes an htmlized
> version of your code included in the javadocs.
>

But your "view source" button seems to also do this? So we would just
want to omit the "github" link at the bottom I think?

I like this solution... how much code is it... can we have it?

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss
In reply to this post by Robert Muir
> Thats more sophisticated than what we do with the javadocs
> "linksource" option in the demo.

We don't even sometimes include full sources of these snippets. They
are demonstrational and we just want to make sure they compile/ run
cleanly (that's why they're typically part of tests, not core
sources).

> But your "view source" button seems to also do this? So we would just

I believe that's js/css magic but I'm not sure.

> I like this solution... how much code is it... can we have it?

Sure. Staszek put it together, it's really nothing fancy. We only have
a binary in c2 repository but I'm sure we can make it available --
I'll ask Staszek to put it on github. One thing I forgot was that we
generate that overview from an XML/XSL file (using xincludes) and this
is probably an overkill for Lucene. I'd be much faster/ easier to just
html-ize those snippets and include them directly with replacement
patterns (even an ANT copy/filter would do here).

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss
bq. I rather like Wikipedia's definition:
http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

I did a similar thing but:

1) based on entire words as individual "tokens" (instead of letter-by-letter),
2) all words present in input patterns can be encoded as a separate
data structure which maps to an unique integer
3) the matcher is essentially tracking the following:

Match {
  automaton_arc toNextNode;
  final int matchStart;
  int matchLength;
}

you then process the input word-by-word and advance each Match if
there is an arc leaving "toNextNode" and matching the current word. If
the toNextNode arc is final then you've hit a match and need to record
it (it may not be the longest match so if you only care about the
longest matches then additional processing is required).

You create new Match objects and discard mismatched existing Matches
as you process the input. Essentially, it's as if you tried to walk
down in the automaton starting on every single position in the input.
This may seem costly but in reality the matches are infrequent
compared to the input text and they are rarely very, very long (to
create lots of states). I used the above approach for entity matching
and it worked super-fast.

All this said, an Aho-Corasick transition graph would of course be
more efficient. The question is how much more efficient and how much
code/ work you'll need to put into it to make it work :)

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss
bq. You create new Match objects and discard mismatched existing Matches

I didn't say that explicitly but obviously you don't need to create
new objects when you're doing this. The pool of match states can be
only as big as the longest pattern so you can pool them and reuse.
Zero allocation cost.

Dawid

On Thu, Jul 12, 2012 at 9:50 PM, Dawid Weiss
<[hidden email]> wrote:

> bq. I rather like Wikipedia's definition:
> http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm
>
> I did a similar thing but:
>
> 1) based on entire words as individual "tokens" (instead of letter-by-letter),
> 2) all words present in input patterns can be encoded as a separate
> data structure which maps to an unique integer
> 3) the matcher is essentially tracking the following:
>
> Match {
>   automaton_arc toNextNode;
>   final int matchStart;
>   int matchLength;
> }
>
> you then process the input word-by-word and advance each Match if
> there is an arc leaving "toNextNode" and matching the current word. If
> the toNextNode arc is final then you've hit a match and need to record
> it (it may not be the longest match so if you only care about the
> longest matches then additional processing is required).
>
> You create new Match objects and discard mismatched existing Matches
> as you process the input. Essentially, it's as if you tried to walk
> down in the automaton starting on every single position in the input.
> This may seem costly but in reality the matches are infrequent
> compared to the input text and they are rarely very, very long (to
> create lots of states). I used the above approach for entity matching
> and it worked super-fast.
>
> All this said, an Aho-Corasick transition graph would of course be
> more efficient. The question is how much more efficient and how much
> code/ work you'll need to put into it to make it work :)
>
> Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

David Smiley
In reply to this post by Dawid Weiss
Thanks for your explanation, I already had a very rough idea of the approach.  Can Aho-Corasick be implemented with Lucene's FST?  Again, the SynonymFilterFactory said this RE Aho-Corasick:
// This really amounts to adding a .*
// closure to the FST and then determinizing it.

You didn't mention FST once and that's the API I'm having trouble groking.

~ David

On Jul 12, 2012, at 3:51 PM, Dawid Weiss [via Lucene] wrote:

bq. I rather like Wikipedia's definition:
http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

I did a similar thing but:

1) based on entire words as individual "tokens" (instead of letter-by-letter),
2) all words present in input patterns can be encoded as a separate
data structure which maps to an unique integer
3) the matcher is essentially tracking the following:

Match {
  automaton_arc toNextNode;
  final int matchStart;
  int matchLength;
}

you then process the input word-by-word and advance each Match if
there is an arc leaving "toNextNode" and matching the current word. If
the toNextNode arc is final then you've hit a match and need to record
it (it may not be the longest match so if you only care about the
longest matches then additional processing is required).

You create new Match objects and discard mismatched existing Matches
as you process the input. Essentially, it's as if you tried to walk
down in the automaton starting on every single position in the input.
This may seem costly but in reality the matches are infrequent
compared to the input text and they are rarely very, very long (to
create lots of states). I used the above approach for entity matching
and it worked super-fast.

All this said, an Aho-Corasick transition graph would of course be
more efficient. The question is how much more efficient and how much
code/ work you'll need to put into it to make it work :)

Dawid

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss
This comment was probably made against Brics library FST
implementation because you can't really add a ".*" to the algorithm
that builds the FST incrementally is not possible because it accepts
fixed strings (and builds an already determinized automaton).

That's part of the reason my runtime solution was suboptimal -- the
tradeoff is that you can construct the FST very efficiently from
millions of input entries but don't need to manipulate it. Brics will
probably bail out with an OOM if you try to manipulate large graphs.

I'd gladly share my code because it was Lucene based... but I can't --
paid consulting job, sorry. Shouldn't be too hard to rewrite from
scratch though, really.

Dawid

On Thu, Jul 12, 2012 at 10:07 PM, Smiley, David W. <[hidden email]> wrote:

> Thanks for your explanation, I already had a very rough idea of the
> approach.  Can Aho-Corasick be implemented with Lucene's FST?  Again, the
> SynonymFilterFactory said this RE Aho-Corasick:
> // This really amounts to adding a .*
> // closure to the FST and then determinizing it.
>
> You didn't mention FST once and that's the API I'm having trouble groking.
>
> ~ David
>
>
> On Jul 12, 2012, at 3:51 PM, Dawid Weiss [via Lucene] wrote:
>
> bq. I rather like Wikipedia's definition:
> http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm
>
> I did a similar thing but:
>
> 1) based on entire words as individual "tokens" (instead of
> letter-by-letter),
> 2) all words present in input patterns can be encoded as a separate
> data structure which maps to an unique integer
> 3) the matcher is essentially tracking the following:
>
> Match {
>   automaton_arc toNextNode;
>   final int matchStart;
>   int matchLength;
> }
>
> you then process the input word-by-word and advance each Match if
> there is an arc leaving "toNextNode" and matching the current word. If
> the toNextNode arc is final then you've hit a match and need to record
> it (it may not be the longest match so if you only care about the
> longest matches then additional processing is required).
>
> You create new Match objects and discard mismatched existing Matches
> as you process the input. Essentially, it's as if you tried to walk
> down in the automaton starting on every single position in the input.
> This may seem costly but in reality the matches are infrequent
> compared to the input text and they are rarely very, very long (to
> create lots of states). I used the above approach for entity matching
> and it worked super-fast.
>
> All this said, an Aho-Corasick transition graph would of course be
> more efficient. The question is how much more efficient and how much
> code/ work you'll need to put into it to make it work :)
>
> Dawid
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Michael McCandless-2
In reply to this post by David Smiley
Some responses below:

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jul 12, 2012 at 11:08 AM, Smiley, David W. <[hidden email]> wrote:
> Hello.
>   I'm embarking on developing code similar to the SynonymFilter but which merely needs to record out of band to the analysis where there is matching text in the input tokens to the corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through it and when it's done it tells me at what offsets there is a match to a corpus of keyword & phrases, and to what keywords/phrases they were exactly.  It doesn't have to inject or modify the token stream because the results of this are going elsewhere.  Although, it would be a fine approach to only omit the "tags" as I call them as a way of consuming the results, but I'm not indexing them so it doesn't matter.
>
>   I noticed the following TODOs at the start:
>
> // TODO: maybe we should resolve token -> wordID then run
> // FST on wordIDs, for better perf?
>
> I intend on doing this since my matching keyword/phrases are often more than one word, and I expect this will save memory and be faster.

Be sure to test this is really faster: you'll need to add a step to
resolve word -> id (eg via hashmap) which may net/net add cost because
the FST can incrementally (quickly) determine a word doesn't exist
with a given prefix.  FST can also do better sharing (less RAM) of
shared prefixes/suffixes.

> // TODO: a more efficient approach would be Aho/Corasick's
> // algorithm
> // http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
> // It improves over the current approach here
> // because it does not fully re-start matching at every
> // token.  For example if one pattern is "a b c x"
> // and another is "b c d" and the input is "a b c d", on
> // trying to parse "a b c x" but failing when you got to x,
> // rather than starting over again your really should
> // immediately recognize that "b c d" matches at the next
> // input.  I suspect this won't matter that much in
> // practice, but it's possible on some set of synonyms it
> // will.  We'd have to modify Aho/Corasick to enforce our
> // conflict resolving (eg greedy matching) because that algo
> // finds all matches.  This really amounts to adding a .*
> // closure to the FST and then determinizing it.
>
> Could someone please clarify how the problem in the example above is to be fixed?  At the end it states how to solve it, but I don't know how to do that and I'm not sure if there is anything more to it since after all if it's as easy as that last sentence sounds then it would have been done already ;-)

The FSTs we create are not "malleable" so implementing what that crazy
comment says would not be easy.

However, there is a cool paper that Robert found:

    http://www.cis.uni-muenchen.de/people/Schulz/Pub/dictle5.ps

That I think does not require heavily modifying the minimal FST (just
augmenting it w/ additional arcs that you follow on failure to match).
 I think it's basically Aho Corasick, done as an FST (which eg you can
then compose with other FSTs to compile a chain of replacements into a
single FST ... at least this was my quick understanding).

Still, I would first try the obvious approach (use FST the way
SynFilter does) and see if it's fast enough.  I think Aho Corasick
only really matters if your patterns have high overlap after shifting
(eg aaaab and aaaaab).

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SynonymFilter, FST, and Aho-Corasick algorithm

Dawid Weiss
Some follow-ups to Mike's comments.

> Be sure to test this is really faster: you'll need to add a step to
> resolve word -> id (eg via hashmap) which may net/net add cost because
> the FST can incrementally (quickly) determine a word doesn't exist
> with a given prefix.  FST can also do better sharing (less RAM) of

This is a tradeoff that you need to make for your particular type of
data. The number of unique words in your patterns and the fanout of
initial letter prefixes will matter here. For me by-word ID was more
effective because I had a relatively small (in-memory) term dictionary
and wanted to limit the number of states tracked at once. With the
letter-by-letter approach the number of states to track will grow.

> However, there is a cool paper that Robert found:
>     http://www.cis.uni-muenchen.de/people/Schulz/Pub/dictle5.ps

Yeah... If we had full fst library and lots of memory pattern matching
(and rewriting) is essentially a finite, determinized regexp. So it
can be definitely made very efficient using fsts, the question is how
much processing needs to be done to construct an automaton for this.

That paper looks cool but I didn't have time to look deeply into it --
the simple approach works way (orders of magnitude) below the set
speed threshold required by the client. Again -- even with  degenerate
cases the processing time is bounded by the longest search pattern
(there is no backtracking, only concurrent state tracking and possibly
overlap resolution afterwards).

> Still, I would first try the obvious approach (use FST the way
> SynFilter does) and see if it's fast enough.

+1.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...