I was checking some details in Hunspell implementation and to me it seems that there might be some useful opportunities for perf improvements.
Currenty, when HunspellStemmer.stem() is called the dictionary is checked for all possible pre/suffixes for given word.
For example when we want to stem word "externalization" then we check for the following suffixes: [ externalization, xternalization, ternalization, ... , ion, on, n, "" ] (and similarly for prefixes).
My questions are:
1) Why do we test longer suffixes than is necessary?
Would it make sense to learn the max length of pre/suffix when loading the dictionary and then apply this value when generating possible suffixes? For example if english dictionary would contain max suffix of length 4 then in the above example for word "externalization" we can safely test only the following suffixes: tion, ion, on, n, "".
Given that the same algorithm is recursively applied to all generated folded word forms until we hit recursion level this can make for a lot of dictionary lookups for each individual token. I noticed that the lookup is implemented using CharArrayMap (and it says it should be fast) but still we could easily skip many lookups calls just because we KNOW in advance they can not return any result. Do you think such optimization is worth the effort?
2) Can suffix be represented by whole input word?
Given we generate possible strings for suffix lookup from input word why don't we skip the first letter? For example if the input word is "hey" does it make sense to consider only the following possible suffixes "ey", "y", ""? If I read the code correctly we include "hey" into the set as well (the whole string itself). Is this correct? My understanding is that if I have a word and I cut suffix from it I shouldn't be left with empty string, no? May be this is language/dictionary specific, dunno.
(I am happy to open issue for any of these and provide patches).