Enhancing Hunspell support

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Enhancing Hunspell support

Peter Gromov
Hi,

I'd like to contribute to the support of Hunspell in Lucene, specifically:
* support the flags necessary for English, German, French, Spanish and Russian dictionaries, possibly more languages later
* provide a public API to check if a word is misspelled
* mirror Hunspell's suggestion algorithm in Lucene, probably in the "src/suggest" module

For context: I work on natural language support for IntelliJ-based IDEs. We'd like to use Hunspell dictionaries there, but interfacing with native binaries proved to be slow and unreliable. So we'd prefer a JVM-only reimplementation of Hunspell spellchecker and suggester. Lucene's Hunspell-related code currently seems closest to that goal, so we thought we can enhance it further.

Is there anything non-obvious that I should know before diving into the implementation?

The contribution will likely consist of many commits, dedicated to specific subtasks or small refactorings. Should I file separate JIRA issues for each of them, or having a single big one (e.g. "Hunspell improvements") is enough?

Peter Gromov
Reply | Threaded
Open this post in threaded view
|

Re: Enhancing Hunspell support

Robert Muir
On Mon, Jan 11, 2021 at 9:38 AM Peter Gromov
<[hidden email]> wrote:

>
> Hi,
>
> I'd like to contribute to the support of Hunspell in Lucene, specifically:
> * support the flags necessary for English, German, French, Spanish and Russian dictionaries, possibly more languages later
> * provide a public API to check if a word is misspelled
> * mirror Hunspell's suggestion algorithm in Lucene, probably in the "src/suggest" module
>
> For context: I work on natural language support for IntelliJ-based IDEs. We'd like to use Hunspell dictionaries there, but interfacing with native binaries proved to be slow and unreliable. So we'd prefer a JVM-only reimplementation of Hunspell spellchecker and suggester. Lucene's Hunspell-related code currently seems closest to that goal, so we thought we can enhance it further.
>
> Is there anything non-obvious that I should know before diving into the implementation?

great! note that currently the code tries to determine stems for a
word only. For that, it should already support the dictionaries
languages you mentioned (various flag encodings and all that).

There's no decompounding logic to support languages like german (for
search purposes you can find some alternatives for this in the source
tree).

There's no "suggest" logic to try to generate potential
correctly-spelled-words. I'm not sure how many dictionaries in
practice really provide the options to "tweak" the default hunspell
correction algorithm.

Most of the code was written based on the documentation in the
hunspell(4) manual page.

It is best to keep the tests small by making a "mini dictionary" and
associated test case when trying to fix something. Of course it is not
always so easy to boil problems down into such a test, but it at least
ensures things are improving and prevents playing whack-a-mole.
Example:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestZeroAffix.java
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/zeroaffix.aff
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/zeroaffix.dic

> The contribution will likely consist of many commits, dedicated to specific subtasks or small refactorings. Should I file separate JIRA issues for each of them, or having a single big one (e.g. "Hunspell improvements") is enough?
>
> Peter Gromov

IMO smaller issues are better here. If improvements have a test and
don't break the other tests then it can keep getting better.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]