Phonetic Token Filter

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Phonetic Token Filter

Walter Underwood, Netflix
I've written a simple phonetic token filter (and factory) based
on the Double Metaphone implementation in Jakarta Codecs to
contribute. Three questions:

1. Does this sound like a generally useful addition?

2. Should we have a Jira issue first?

3. This adds a depencency on the codecs jar. How do we add that
to the distro?

The code is very simple, but I need to learn the contribution
process and build some tests, so this won't happen in one day.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Phonetic Token Filter

Yonik Seeley-2
On 11/21/06, Walter Underwood <[hidden email]> wrote:
> I've written a simple phonetic token filter (and factory) based
> on the Double Metaphone implementation in Jakarta Codecs to
> contribute. Three questions:
>
> 1. Does this sound like a generally useful addition?

Definitely useful.
If it's generally applicable enough and light weight enough then it
should go in the core.  Otherwise it could go in contrib (which we
don't really have yet, but we will when the need arises).

This sounds like it should probably go in the core.

> 2. Should we have a Jira issue first?

Yes, please.

> 3. This adds a depencency on the codecs jar. How do we add that
> to the distro?

It would go in the lib directory if it ends up in Solr proper.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Phonetic Token Filter

Chris Hostetter-3

: > 2. Should we have a Jira issue first?

this wiki should have all the info you need...

http://wiki.apache.org/solr/HowToContribute



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Phonetic Token Filter

Bertrand Delacretaz
In reply to this post by Walter Underwood, Netflix
On 11/21/06, Walter Underwood <[hidden email]> wrote:
> ...I've written a simple phonetic token filter (and factory) based
> on the Double Metaphone implementation in Jakarta Codecs to
> contribute. Three questions:
>
> 1. Does this sound like a generally useful addition?...

Sure!

Do you know if it is supposed to work for non-english languages? I'm
interested in testing it on French (and maybe German) texts, once your
patch is ready.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Phonetic Token Filter

Walter Underwood, Netflix
On 11/21/06 1:01 AM, "Bertrand Delacretaz" <[hidden email]> wrote:

> Do you know if it is supposed to work for non-english languages? I'm
> interested in testing it on French (and maybe German) texts, once your
> patch is ready.

Double Metaphone has several rules for non-English words, but it
assumes English pronunciation. I think the biggest problem would
consonants that are silent or vowel sounds. For example, it codes
"Paris" as "PRS" instead of with a silent "s" as in French, and
"Jonas" as "JNS" where in German it would be pronounced "yonas".
And "Wim Winders" is coded as "AM ANTR", treating the "W" as a
vowel instead of a "V" sound.

It is worth a try. Most implementations of Double Metaphone are
well-commented, so you could change it for other languages.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Phonetic Token Filter

Bertrand Delacretaz
On 11/21/06, Walter Underwood <[hidden email]> wrote:
> ...It is worth a try. Most implementations of Double Metaphone are
> well-commented, so you could change it for other languages...

Ok, I'll see if I find some time to test that, thanks for the clarifications!
-Bertrand