Spell checking street names

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Spell checking street names

Max Metral
I'm using Lucene to spell check street names.  Right now, I'm using
Double Metaphone on the street name (we have a sophisticated regex to
parse out the NAME as opposed to the unit, number, street type, or
suffix).  I think that Double Metaphone is probably overkill/wrong, and
a spell checking approach (n-gram based) would be better.  Part of the
reason is if we look at some common mistakes:

 

For Commonwealth:

Communwealth

Comonwealth

Common wealth

 

Double metaphone will get the first two, but not the last.  Spell check
(I think) would get all 3.  The last is much more common than in typical
generic text search (Fairoaks vs. Fair Oaks, New Market vs. Newmarket,
etc).  However, spell check will only get the third if the n-gram input
is untokenized (right?).

 

 Conceptually, I feel like people will most often misspell or mistype
rather than completely omitting words from the street name.  So running
the n-gram on the untokenized street name seems like a good thing.
Problem is I can't see how I do this, SpellChecker seems to always want
to tokenize things, and I'm a bit confused on how to give it an analyzer
that doesn't tokenize.

 

I feel like this might be a newbie question, so apologies if so.  But,
1) does an untokenized n-gram spell checker seem like a good thing for
this app? 2) Which analyzer can I use for no tokenization at all?

 

--Max

Reply | Threaded
Open this post in threaded view
|

Re: Spell checking street names

Otis Gospodnetic-2
Hmmm, "untokenized
n-gram
spell
checker"... does that really make sense?
lucene as 2-gram: lu uc ce en ne..... but all as a single token?  No, I don't think that will work with the Lucene spellchecker.

As for non-tokenizing Analyzer - KeywordAnalyzer.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Max Metral <[hidden email]>
To: [hidden email]
Sent: Wednesday, January 30, 2008 11:34:20 AM
Subject: Spell checking street names

I'm
using
Lucene
to
spell
check
street
names.  
Right
now,
I'm
using
Double
Metaphone
on
the
street
name
(we
have
a
sophisticated
regex
to
parse
out
the
NAME
as
opposed
to
the
unit,
number,
street
type,
or
suffix).  
I
think
that
Double
Metaphone
is
probably
overkill/wrong,
and
a
spell
checking
approach
(n-gram
based)
would
be
better.  
Part
of
the
reason
is
if
we
look
at
some
common
mistakes:

 

For
Commonwealth:

Communwealth

Comonwealth

Common
wealth

 

Double
metaphone
will
get
the
first
two,
but
not
the
last.  
Spell
check
(I
think)
would
get
all
3.  
The
last
is
much
more
common
than
in
typical
generic
text
search
(Fairoaks
vs.
Fair
Oaks,
New
Market
vs.
Newmarket,
etc).  
However,
spell
check
will
only
get
the
third
if
the
n-gram
input
is
untokenized
(right?).

 

 
Conceptually,
I
feel
like
people
will
most
often
misspell
or
mistype
rather
than
completely
omitting
words
from
the
street
name.  
So
running
the
n-gram
on
the
untokenized
street
name
seems
like
a
good
thing.
Problem
is
I
can't
see
how
I
do
this,
SpellChecker
seems
to
always
want
to
tokenize
things,
and
I'm
a
bit
confused
on
how
to
give
it
an
analyzer
that
doesn't
tokenize.

 

I
feel
like
this
might
be
a
newbie
question,
so
apologies
if
so.  
But,
1)
does
an
untokenized
n-gram
spell
checker
seem
like
a
good
thing
for
this
app?
2)
Which
analyzer
can
I
use
for
no
tokenization
at
all?

 

--Max





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spell checking street names

eks dev
In reply to this post by Max Metral
Otis,
I think it was proposed to have spell checker that works on multiple tokens / Document:

where field to be searched with SpellChecker" looks like  "lucene search library" does not get tokenized and then fed to the SpellChecker, rather having this as a  "single token" that gets chopped into n-Grams.

"lucene search library"->"lu
uc
ce
en
ne se ea ar"....

so far so good, that would work, but re-scoring part of the SpellChecker would be far for optimal for these cases, plain edit distance does not support gaps (some sort of Needleman Wunsch would work nice).

Think of this as having N-Gram index of Documents, rather than tokens. This approach absolutely makes sense for short "documents" like Address.

And to come back to the Question, yes, you can make separate Field from your address and index it as a keyword, than you can just feed this field to the SpellChecker. It will not be perfect.... but will do solid job. Your case will be covered ok.

Commonwealth:
Communwealth
Comonwealth
Common wealth


----- Original Message ----
From: Otis Gospodnetic <[hidden email]>
To: [hidden email]
Sent: Thursday, 31 January, 2008 6:02:28 AM
Subject: Re: Spell checking street names

Hmmm,
"untokenized
n-gram
spell
checker"...
does
that
really
make
sense?
lucene
as
2-gram:
lu
uc
ce
en
ne.....
but
all
as
a
single
token?  
No,
I
don't
think
that
will
work
with
the
Lucene
spellchecker.

As
for
non-tokenizing
Analyzer
-
KeywordAnalyzer.

Otis
--
Sematext
--
http://sematext.com/ 
--
Lucene
-
Solr
-
Nutch

-----
Original
Message
----
From:
Max
Metral
<[hidden email]>
To:
[hidden email]
Sent:
Wednesday,
January
30,
2008
11:34:20
AM
Subject:
Spell
checking
street
names

I'm
using
Lucene
to
spell
check
street
names.  
Right
now,
I'm
using
Double
Metaphone
on
the
street
name
(we
have
a
sophisticated
regex
to
parse
out
the
NAME
as
opposed
to
the
unit,
number,
street
type,
or
suffix).  
I
think
that
Double
Metaphone
is
probably
overkill/wrong,
and
a
spell
checking
approach
(n-gram
based)
would
be
better.  
Part
of
the
reason
is
if
we
look
at
some
common
mistakes:

 

For
Commonwealth:

Communwealth

Comonwealth

Common
wealth

 

Double
metaphone
will
get
the
first
two,
but
not
the
last.  
Spell
check
(I
think)
would
get
all
3.  
The
last
is
much
more
common
than
in
typical
generic
text
search
(Fairoaks
vs.
Fair
Oaks,
New
Market
vs.
Newmarket,
etc).  
However,
spell
check
will
only
get
the
third
if
the
n-gram
input
is
untokenized
(right?).

 

 
Conceptually,
I
feel
like
people
will
most
often
misspell
or
mistype
rather
than
completely
omitting
words
from
the
street
name.  
So
running
the
n-gram
on
the
untokenized
street
name
seems
like
a
good
thing.
Problem
is
I
can't
see
how
I
do
this,
SpellChecker
seems
to
always
want
to
tokenize
things,
and
I'm
a
bit
confused
on
how
to
give
it
an
analyzer
that
doesn't
tokenize.

 

I
feel
like
this
might
be
a
newbie
question,
so
apologies
if
so.  
But,
1)
does
an
untokenized
n-gram
spell
checker
seem
like
a
good
thing
for
this
app?
2)
Which
analyzer
can
I
use
for
no
tokenization
at
all?

 

--Max





---------------------------------------------------------------------
To
unsubscribe,
e-mail:
[hidden email]
For
additional
commands,
e-mail:
[hidden email]






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spell checking street names

Karl Wettin
In reply to this post by Max Metral

30 jan 2008 kl. 17.34 skrev Max Metral:

> Part of the reason is if we look at some common mistakes:
>
>
> For Commonwealth:
>
> Communwealth
>
> Comonwealth
>
> Common wealth

If they are common misstakes you can pick them up using reinforcement  
learning.

<http://issues.apache.org/jira/browse/LUCENE-626> might help you.



    karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Spell checking street names

Max Metral
Thanks all.  I've got it working now using a KeywordAnalyzer.  The edit
distance metric I'm using is purely "edit" based, i.e. when I input
"Bennett", I get "Jennett", "Gannett", "Kenneth" and THEN "Bennet".
While I see the logic, it's obviously not the best metric.  Is there an
appropriate edit distance metric that takes phonetics into account?

-----Original Message-----
From: Karl Wettin [mailto:[hidden email]]
Sent: Thursday, January 31, 2008 6:12 AM
To: [hidden email]
Subject: Re: Spell checking street names


30 jan 2008 kl. 17.34 skrev Max Metral:

> Part of the reason is if we look at some common mistakes:
>
>
> For Commonwealth:
>
> Communwealth
>
> Comonwealth
>
> Common wealth

If they are common misstakes you can pick them up using reinforcement  
learning.

<http://issues.apache.org/jira/browse/LUCENE-626> might help you.



    karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]