Spellchecking and frequency

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spellchecking and frequency

dan sutton
Hi,

I've recently been looking into Spellchecking in solr, and was struck by how
limited the usefulness of the tool was.

Like most corpora , ours contains lots of different spelling mistakes for
the same word, so the 'spellcheck.onlyMorePopular' is not really that useful
unless you click on it numerous times.

I was thinking that since most of the time people spell words correctly why
was there no other frequency parameter that could enter into the score? i.e.
something like:

spell_score ~ edit_dist * freq

I'm sure others have come across this issue and was wonding what
steps/algorithms they have used to overcome these limitations?

Cheers,
Dan
Reply | Threaded
Open this post in threaded view
|

Re: Spellchecking and frequency

Mark Holland
Hi,

I found the suggestions returned from the standard solr spellcheck not to be
that relevant. By contrast, aspell, given the same dictionary and mispelled
words, gives much more accurate suggestions.

I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
the java aspell library. I also extended the SpellCheckComponent to take the
matrix of suggested words and query the corpus to find the first combination
of suggestions which returned a match. This works well for my use case,
where term frequency is irrelevant to spelling or scoring.

I'd like to publish the code in case someone finds it useful (although it's
a bit crude at the moment and will need a decent tidy up). Would it be
appropriate to open up a Jira issue for this?

Cheers,
~mark

On 27 July 2010 09:33, dan sutton <[hidden email]> wrote:

> Hi,
>
> I've recently been looking into Spellchecking in solr, and was struck by
> how
> limited the usefulness of the tool was.
>
> Like most corpora , ours contains lots of different spelling mistakes for
> the same word, so the 'spellcheck.onlyMorePopular' is not really that
> useful
> unless you click on it numerous times.
>
> I was thinking that since most of the time people spell words correctly why
> was there no other frequency parameter that could enter into the score?
> i.e.
> something like:
>
> spell_score ~ edit_dist * freq
>
> I'm sure others have come across this issue and was wonding what
> steps/algorithms they have used to overcome these limitations?
>
> Cheers,
> Dan
>
Reply | Threaded
Open this post in threaded view
|

RE: Spellchecking and frequency

Dyer, James
Mark,

I'd like to see your code if you open a JIRA for this.  I recently
opened SOLR-2010 with a patch that does something similar to the second
part only of what you describe (find combinations that actually return a
match).  But I'm not sure if my approach is the best one so I would like
to see yours to compare.

James Dyer
E-Commerce Systems
Ingram Book Company
(615) 213-4311

-----Original Message-----
From: Mark Holland [mailto:[hidden email]]
Sent: Tuesday, July 27, 2010 1:04 PM
To: [hidden email]
Subject: Re: Spellchecking and frequency

Hi,

I found the suggestions returned from the standard solr spellcheck not
to be
that relevant. By contrast, aspell, given the same dictionary and
mispelled
words, gives much more accurate suggestions.

I therefore wrote an implementation of SolrSpellChecker that wraps
jazzy,
the java aspell library. I also extended the SpellCheckComponent to take
the
matrix of suggested words and query the corpus to find the first
combination
of suggestions which returned a match. This works well for my use case,
where term frequency is irrelevant to spelling or scoring.

I'd like to publish the code in case someone finds it useful (although
it's
a bit crude at the moment and will need a decent tidy up). Would it be
appropriate to open up a Jira issue for this?

Cheers,
~mark

On 27 July 2010 09:33, dan sutton <[hidden email]> wrote:

> Hi,
>
> I've recently been looking into Spellchecking in solr, and was struck
by
> how
> limited the usefulness of the tool was.
>
> Like most corpora , ours contains lots of different spelling mistakes
for
> the same word, so the 'spellcheck.onlyMorePopular' is not really that
> useful
> unless you click on it numerous times.
>
> I was thinking that since most of the time people spell words
correctly why
> was there no other frequency parameter that could enter into the
score?

> i.e.
> something like:
>
> spell_score ~ edit_dist * freq
>
> I'm sure others have come across this issue and was wonding what
> steps/algorithms they have used to overcome these limitations?
>
> Cheers,
> Dan
>
Reply | Threaded
Open this post in threaded view
|

Re: Spellchecking and frequency

Erick Erickson
In reply to this post by Mark Holland
"Yonik's Law of Patches" reads: "A half-baked patch in Jira, with no
documentation, no tests and no backwards compatibilty is better than no
patch at all."

It'd be perfectly appropriate, IMO, for you to post an outline of what your
enhancements do over on the SOLR dev list and get a reaction from the folks
over there as to whether it should be a Jira or not... see
[hidden email]

Best
Erick

On Tue, Jul 27, 2010 at 2:04 PM, Mark Holland <[hidden email]>wrote:

> Hi,
>
> I found the suggestions returned from the standard solr spellcheck not to
> be
> that relevant. By contrast, aspell, given the same dictionary and mispelled
> words, gives much more accurate suggestions.
>
> I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
> the java aspell library. I also extended the SpellCheckComponent to take
> the
> matrix of suggested words and query the corpus to find the first
> combination
> of suggestions which returned a match. This works well for my use case,
> where term frequency is irrelevant to spelling or scoring.
>
> I'd like to publish the code in case someone finds it useful (although it's
> a bit crude at the moment and will need a decent tidy up). Would it be
> appropriate to open up a Jira issue for this?
>
> Cheers,
> ~mark
>
> On 27 July 2010 09:33, dan sutton <[hidden email]> wrote:
>
> > Hi,
> >
> > I've recently been looking into Spellchecking in solr, and was struck by
> > how
> > limited the usefulness of the tool was.
> >
> > Like most corpora , ours contains lots of different spelling mistakes for
> > the same word, so the 'spellcheck.onlyMorePopular' is not really that
> > useful
> > unless you click on it numerous times.
> >
> > I was thinking that since most of the time people spell words correctly
> why
> > was there no other frequency parameter that could enter into the score?
> > i.e.
> > something like:
> >
> > spell_score ~ edit_dist * freq
> >
> > I'm sure others have come across this issue and was wonding what
> > steps/algorithms they have used to overcome these limitations?
> >
> > Cheers,
> > Dan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Spellchecking and frequency

dan sutton
In reply to this post by Mark Holland
Hi Mark,

Thanks for that info looks very interesting, would be great to see your
code. Out of interest did you use the dictionary and the phonetic file? Did
you see better results with both?

In regards to the secondary part to check the corpus for matching
suggestions, would another way to do this is to have an event listener to
listen for commits, and then build the dictionary for matching corpus words
that way, then you avoid the performance hit at query time.

Cheers,
Dan

On Tue, Jul 27, 2010 at 7:04 PM, Mark Holland <[hidden email]>wrote:

> Hi,
>
> I found the suggestions returned from the standard solr spellcheck not to
> be
> that relevant. By contrast, aspell, given the same dictionary and mispelled
> words, gives much more accurate suggestions.
>
> I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
> the java aspell library. I also extended the SpellCheckComponent to take
> the
> matrix of suggested words and query the corpus to find the first
> combination
> of suggestions which returned a match. This works well for my use case,
> where term frequency is irrelevant to spelling or scoring.
>
> I'd like to publish the code in case someone finds it useful (although it's
> a bit crude at the moment and will need a decent tidy up). Would it be
> appropriate to open up a Jira issue for this?
>
> Cheers,
> ~mark
>
> On 27 July 2010 09:33, dan sutton <[hidden email]> wrote:
>
> > Hi,
> >
> > I've recently been looking into Spellchecking in solr, and was struck by
> > how
> > limited the usefulness of the tool was.
> >
> > Like most corpora , ours contains lots of different spelling mistakes for
> > the same word, so the 'spellcheck.onlyMorePopular' is not really that
> > useful
> > unless you click on it numerous times.
> >
> > I was thinking that since most of the time people spell words correctly
> why
> > was there no other frequency parameter that could enter into the score?
> > i.e.
> > something like:
> >
> > spell_score ~ edit_dist * freq
> >
> > I'm sure others have come across this issue and was wonding what
> > steps/algorithms they have used to overcome these limitations?
> >
> > Cheers,
> > Dan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Spellchecking and frequency

Jonathan Rochkind
In reply to this post by Erick Erickson

>> I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
>> the java aspell library. I also extended the SpellCheckComponent to take
>> the
>> matrix of suggested words and query the corpus to find the first
>> combination
>> of suggestions which returned a match. This works well for my use case,
>> where term frequency is irrelevant to spelling or scoring.

This is interesting to me. I also have not been that happy with standard
solr spellcheck.

In addition to possibly filing a JIRA for future fix to Solr itself,
another option would be you could make your 'alternate' SpellCheck
component available as a seperate .jar, so anyone could use it just by
installing and specifying it in their solrconfig.xml.  I would encourage
you to consider that, not as a replacement for suggesting a patch to
Solr itself, but so people can use your improved spellchecker
immediately, without waiting for possible Solr patches.

Jonathan