string similarity measures

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

string similarity measures

Cam Bazz
Hello,
This came up before but - if we were to make a swear word filter, string
edit distances are no good. for example words like `shot` is confused with
`shit`. there is also problem with words like hitchcock. appearently i need
something like soundex or double metaphone. the thing is - these are
language specific, and i am not operating in english.

I need a fuzzy like curse word filter for turkish, simply.

Best regards,
-C.B.
Reply | Threaded
Open this post in threaded view
|

Re: string similarity measures

Karl Wettin

4 sep 2008 kl. 14.38 skrev Cam Bazz:

> Hello,
> This came up before but - if we were to make a swear word filter,  
> string
> edit distances are no good. for example words like `shot` is  
> confused with
> `shit`. there is also problem with words like hitchcock. appearently  
> i need
> something like soundex or double metaphone. the thing is - these are
> language specific, and i am not operating in english.
>
> I need a fuzzy like curse word filter for turkish, simply.

You probably need to make a large list of words. I would try to learn  
from the users that do swear, perhaps even trust my users to report  
each other. I would probably also look at storing in what context the  
word is used, perhaps by adding the surrounding words (ngrams,  
shingles, markov chains). Compare "go to hell" and "when hell frezes  
over". The first is rather derogatory while the second doen't have to  
be bad at all.

I'm thinking Hidden Markov Models and Neural Networks.


           karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: string similarity measures

Cam Bazz
yes, I already have a system for users reporting words. they fall on an
operator screen and if operator approves, or if 3 other people marked it as
curse, then it is filtered.
in the other thread you wrote:

>I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficient. That would probably work out just fine. ?>You might
want to add more weight the greater the size of the shingle.
>
>There are shingle filters in lucene/java/contrib/analyzers and there is a
Tanimoto distance in lucene/mahout/.

would that apply to my case? tanimoto coefficient over shingles?

Best,


On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[hidden email]> wrote:

>
> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>
>
>  Hello,
>> This came up before but - if we were to make a swear word filter, string
>> edit distances are no good. for example words like `shot` is confused with
>> `shit`. there is also problem with words like hitchcock. appearently i
>> need
>> something like soundex or double metaphone. the thing is - these are
>> language specific, and i am not operating in english.
>>
>> I need a fuzzy like curse word filter for turkish, simply.
>>
>
> You probably need to make a large list of words. I would try to learn from
> the users that do swear, perhaps even trust my users to report each other. I
> would probably also look at storing in what context the word is used,
> perhaps by adding the surrounding words (ngrams, shingles, markov chains).
> Compare "go to hell" and "when hell frezes over". The first is rather
> derogatory while the second doen't have to be bad at all.
>
> I'm thinking Hidden Markov Models and Neural Networks.
>
>
>          karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: string similarity measures

Karl Wettin

4 sep 2008 kl. 15.54 skrev Cam Bazz:

> yes, I already have a system for users reporting words. they fall on  
> an
> operator screen and if operator approves, or if 3 other people  
> marked it as
> curse, then it is filtered.
> in the other thread you wrote:
>
>> I would create 1-5 ngram sized shingles and measure the distance  
>> using
> Tanimoto coefficient. That would probably work out just fine. ?>You  
> might
> want to add more weight the greater the size of the shingle.
>>
>> There are shingle filters in lucene/java/contrib/analyzers and  
>> there is a
> Tanimoto distance in lucene/mahout/.
>
> would that apply to my case? tanimoto coefficient over shingles?

Not really, no.


      karl


>
>
> Best,
>
>
> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[hidden email]>  
> wrote:
>
>>
>> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>>
>>
>> Hello,
>>> This came up before but - if we were to make a swear word filter,  
>>> string
>>> edit distances are no good. for example words like `shot` is  
>>> confused with
>>> `shit`. there is also problem with words like hitchcock.  
>>> appearently i
>>> need
>>> something like soundex or double metaphone. the thing is - these are
>>> language specific, and i am not operating in english.
>>>
>>> I need a fuzzy like curse word filter for turkish, simply.
>>>
>>
>> You probably need to make a large list of words. I would try to  
>> learn from
>> the users that do swear, perhaps even trust my users to report each  
>> other. I
>> would probably also look at storing in what context the word is used,
>> perhaps by adding the surrounding words (ngrams, shingles, markov  
>> chains).
>> Compare "go to hell" and "when hell frezes over". The first is rather
>> derogatory while the second doen't have to be bad at all.
>>
>> I'm thinking Hidden Markov Models and Neural Networks.
>>
>>
>>         karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: string similarity measures

Cam Bazz
let me rephrase the problem. I already have a set of bad words. I want to
avoid people inputting typos of the bad words.
for example 'shit' is banned, but someone may enter sh1t.

how can i flag those phonetically similar bad words to the marked bad words?

Best.

On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[hidden email]> wrote:

>
> 4 sep 2008 kl. 15.54 skrev Cam Bazz:
>
>  yes, I already have a system for users reporting words. they fall on an
>> operator screen and if operator approves, or if 3 other people marked it
>> as
>> curse, then it is filtered.
>> in the other thread you wrote:
>>
>>  I would create 1-5 ngram sized shingles and measure the distance using
>>>
>> Tanimoto coefficient. That would probably work out just fine. ?>You might
>> want to add more weight the greater the size of the shingle.
>>
>>>
>>> There are shingle filters in lucene/java/contrib/analyzers and there is a
>>>
>> Tanimoto distance in lucene/mahout/.
>>
>> would that apply to my case? tanimoto coefficient over shingles?
>>
>
> Not really, no.
>
>
>     karl
>
>
>
>
>>
>> Best,
>>
>>
>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[hidden email]>
>> wrote:
>>
>>
>>> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>>>
>>>
>>> Hello,
>>>
>>>> This came up before but - if we were to make a swear word filter, string
>>>> edit distances are no good. for example words like `shot` is confused
>>>> with
>>>> `shit`. there is also problem with words like hitchcock. appearently i
>>>> need
>>>> something like soundex or double metaphone. the thing is - these are
>>>> language specific, and i am not operating in english.
>>>>
>>>> I need a fuzzy like curse word filter for turkish, simply.
>>>>
>>>>
>>> You probably need to make a large list of words. I would try to learn
>>> from
>>> the users that do swear, perhaps even trust my users to report each
>>> other. I
>>> would probably also look at storing in what context the word is used,
>>> perhaps by adding the surrounding words (ngrams, shingles, markov
>>> chains).
>>> Compare "go to hell" and "when hell frezes over". The first is rather
>>> derogatory while the second doen't have to be bad at all.
>>>
>>> I'm thinking Hidden Markov Models and Neural Networks.
>>>
>>>
>>>        karl
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: string similarity measures

Mathieu Lecarme

I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA.

On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <[hidden email]> wrote:

> let me rephrase the problem. I already have a set of bad words. I want to
> avoid people inputting typos of the bad words.
> for example 'shit' is banned, but someone may enter sh1t.
>
> how can i flag those phonetically similar bad words to the marked bad
> words?
>
> Best.
>
> On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <[hidden email]> wrote:
>
>>
>> 4 sep 2008 kl. 15.54 skrev Cam Bazz:
>>
>>  yes, I already have a system for users reporting words. they fall on an
>>> operator screen and if operator approves, or if 3 other people marked
> it
>>> as
>>> curse, then it is filtered.
>>> in the other thread you wrote:
>>>
>>>  I would create 1-5 ngram sized shingles and measure the distance using
>>>>
>>> Tanimoto coefficient. That would probably work out just fine. ?>You
> might
>>> want to add more weight the greater the size of the shingle.
>>>
>>>>
>>>> There are shingle filters in lucene/java/contrib/analyzers and there
> is a
>>>>
>>> Tanimoto distance in lucene/mahout/.
>>>
>>> would that apply to my case? tanimoto coefficient over shingles?
>>>
>>
>> Not really, no.
>>
>>
>>     karl
>>
>>
>>
>>
>>>
>>> Best,
>>>
>>>
>>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[hidden email]>
>>> wrote:
>>>
>>>
>>>> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>>>>
>>>>
>>>> Hello,
>>>>
>>>>> This came up before but - if we were to make a swear word filter,
> string
>>>>> edit distances are no good. for example words like `shot` is confused
>>>>> with
>>>>> `shit`. there is also problem with words like hitchcock. appearently
> i
>>>>> need
>>>>> something like soundex or double metaphone. the thing is - these are
>>>>> language specific, and i am not operating in english.
>>>>>
>>>>> I need a fuzzy like curse word filter for turkish, simply.
>>>>>
>>>>>
>>>> You probably need to make a large list of words. I would try to learn
>>>> from
>>>> the users that do swear, perhaps even trust my users to report each
>>>> other. I
>>>> would probably also look at storing in what context the word is used,
>>>> perhaps by adding the surrounding words (ngrams, shingles, markov
>>>> chains).
>>>> Compare "go to hell" and "when hell frezes over". The first is rather
>>>> derogatory while the second doen't have to be bad at all.
>>>>
>>>> I'm thinking Hidden Markov Models and Neural Networks.
>>>>
>>>>
>>>>        karl
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]