Merge multiple FSTs to build suggesters

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Merge multiple FSTs to build suggesters

Karthik zorfy
Hi,

I'm working on an application which uses fuzzy suggester to provide auto complete feature with fuzzy matching. I need to periodically build suggesters in order for the latest data to reflect in suggest results. As the index size grows, I frequently run into OutOfMemory issue when building suggesters and require manual intervention to increase the JVM heap size.

I'm thinking about the following approach to overcome this issue.

Split the search index(search documents) into multiple segments and build suggest at segment level and finally merge the suggest results(FSTs).

Has anyone solved similar use case or have any suggestions.


Best,
Karthic
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge multiple FSTs to build suggesters

Michael McCandless-2
Hi Karthic,

There are known algorithms to take the union of FSTs, but unfortunately they are not yet implemented in Lucene -- patches welcome!

The fun OpenFST library does implement it: http://www.openfst.org/twiki/bin/view/FST/UnionDoc  Maybe that could be used for inspiration/poaching?  It is also Apache licensed.

Unfortunately, building the FST is memory consuming.  There are a few expert parameters to the FST.Builder that you could tweak to use less memory (at the cost of producing a somewhat larger FST in the end).

Elasticsearch works around this limitation by writing an FST per segment, and then at suggest time, it pulls best suggestions for each segment and then does a partial/merge sort in the end to get the overall best.  This lets the suggester remain near-real-time...

On Mon, Jul 6, 2020 at 8:17 AM Karthik zorfy <[hidden email]> wrote:
Hi,

I'm working on an application which uses fuzzy suggester to provide auto complete feature with fuzzy matching. I need to periodically build suggesters in order for the latest data to reflect in suggest results. As the index size grows, I frequently run into OutOfMemory issue when building suggesters and require manual intervention to increase the JVM heap size.

I'm thinking about the following approach to overcome this issue.

Split the search index(search documents) into multiple segments and build suggest at segment level and finally merge the suggest results(FSTs).

Has anyone solved similar use case or have any suggestions.


Best,
Karthic
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]