Similarity percentage between two Strings

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Similarity percentage between two Strings

Thiago Moreira-5

    Hey all,

    I want to know how much two Strings are similar! The thing is: I'm processing an email box and I want to group all messages that have the subject similar, makes sense?? I looked on the documentation but I didn't find how to accomplish this. It's not necessary add the messages or the subjects on some kind of index. I'm using 2.3.2 version of Lucene.

    Anyone has some idea?

    Thanks in advance.
--

Thiago Moreira
Software Engineer

[hidden email]
Liferay, Inc.
Enterprise. Open Source. For Life.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Similarity percentage between two Strings

N Hira
I don't know how much of this is a Lucene problem, but -- as I'm sure  
you will inevitably hear from others on the list -- it depends on  
what your definition of "similar" is.

By similar, do you mean:
1.  Identical, except for variations in case (upper/lower)
2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or "...  
(summary")
3.  Allow 1., 2. and permit some new terms ... how many?
4.  Allow all of the above and allow some changes to terms using  
stemming (E.g., "Google releases Chrome" is similar to "Google  
announces the release of its new Chrome web browser")
....

I'm sure you see where this is going.  So ... how do you define similar?

Good luck!

-h
----------------------------------------------------------------------
Hira, N.R.
Cognocys, Inc.

On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:

>
>     Hey all,
>
>     I want to know how much two Strings are similar! The thing is:  
> I'm processing an email box and I want to group all messages that  
> have the subject similar, makes sense?? I looked on the  
> documentation but I didn't find how to accomplish this. It's not  
> necessary add the messages or the subjects on some kind of index.  
> I'm using 2.3.2 version of Lucene.
>
>     Anyone has some idea?
>
>     Thanks in advance.
> --
> Thiago Moreira
> Software Engineer
> [hidden email]
> Liferay, Inc.
> Enterprise. Open Source. For Life.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Similarity percentage between two Strings

Thiago Moreira-5
In reply to this post by Thiago Moreira-5

    Well, the similar definition that I'm looking for is the number 2, maybe the number 3, but to start the number 2 is enough. If you guys think that is not a Lucene problem what else tool can I use to implement this requirement??

    Thanks

Thiago Moreira
Software Engineer

[hidden email]
Liferay, Inc.
Enterprise. Open Source. For Life.


N. Hira wrote:
I don't know how much of this is a Lucene problem, but -- as I'm sure you will inevitably hear from others on the list -- it depends on what your definition of "similar" is.

By similar, do you mean:
1.  Identical, except for variations in case (upper/lower)
2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or "... (summary")
3.  Allow 1., 2. and permit some new terms ... how many?
4.  Allow all of the above and allow some changes to terms using stemming (E.g., "Google releases Chrome" is similar to "Google announces the release of its new Chrome web browser")
....

I'm sure you see where this is going.  So ... how do you define similar?

Good luck!

-h
----------------------------------------------------------------------
Hira, N.R.
Cognocys, Inc.

On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:


    Hey all,

    I want to know how much two Strings are similar! The thing is: I'm processing an email box and I want to group all messages that have the subject similar, makes sense?? I looked on the documentation but I didn't find how to accomplish this. It's not necessary add the messages or the subjects on some kind of index. I'm using 2.3.2 version of Lucene.

    Anyone has some idea?

    Thanks in advance.
-- 
Thiago Moreira
Software Engineer
[hidden email]
Liferay, Inc.
Enterprise. Open Source. For Life.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Similarity percentage between two Strings

N Hira
More details may change my opinion (not quite sure how others feel  
yet), but with the way you've described it so far, it seems like all  
you need is a basic string matcher:

For every message:
        - if <blah>message.subject<blah> is found in the pool, then this  
message is "similar to" the message in the pool
        - if no such message could be found, then this message add it to the  
pool and wait to find more like it

Please describe the problem in greater detail if this seems too  
simplistic.

-h

On 03-Sep-2008, at 5:58 PM, Thiago Moreira wrote:

>
>     Well, the similar definition that I'm looking for is the number  
> 2, maybe the number 3, but to start the number 2 is enough. If you  
> guys think that is not a Lucene problem what else tool can I use to  
> implement this requirement??
>
>     Thanks
> Thiago Moreira
> Software Engineer
> [hidden email]
> Liferay, Inc.
> Enterprise. Open Source. For Life.
>
>
> N. Hira wrote:
>>
>> I don't know how much of this is a Lucene problem, but -- as I'm  
>> sure you will inevitably hear from others on the list -- it  
>> depends on what your definition of "similar" is.
>>
>> By similar, do you mean:
>> 1.  Identical, except for variations in case (upper/lower)
>> 2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or  
>> "... (summary")
>> 3.  Allow 1., 2. and permit some new terms ... how many?
>> 4.  Allow all of the above and allow some changes to terms using  
>> stemming (E.g., "Google releases Chrome" is similar to "Google  
>> announces the release of its new Chrome web browser")
>> ....
>>
>> I'm sure you see where this is going.  So ... how do you define  
>> similar?
>>
>> Good luck!
>>
>> -h
>> ---------------------------------------------------------------------
>> -
>> Hira, N.R.
>> Cognocys, Inc.
>>
>> On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:
>>
>>>
>>>     Hey all,
>>>
>>>     I want to know how much two Strings are similar! The thing  
>>> is: I'm processing an email box and I want to group all messages  
>>> that have the subject similar, makes sense?? I looked on the  
>>> documentation but I didn't find how to accomplish this. It's not  
>>> necessary add the messages or the subjects on some kind of index.  
>>> I'm using 2.3.2 version of Lucene.
>>>
>>>     Anyone has some idea?
>>>
>>>     Thanks in advance.
>>> --
>>> Thiago Moreira
>>> Software Engineer
>>> [hidden email]
>>> Liferay, Inc.
>>> Enterprise. Open Source. For Life.
>>>





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Similarity percentage between two Strings

Ian Lea
In reply to this post by Thiago Moreira-5
Googling for "java string similarity" throws up some stuff you might
find useful.


--
Ian.


On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[hidden email]> wrote:

>
>     Well, the similar definition that I'm looking for is the number 2, maybe
> the number 3, but to start the number 2 is enough. If you guys think that is
> not a Lucene problem what else tool can I use to implement this
> requirement??
>
>     Thanks
> ________________________________
> Thiago Moreira
> Software Engineer
> [hidden email]
> Liferay, Inc.
> Enterprise. Open Source. For Life.
>
>
> N. Hira wrote:
>
> I don't know how much of this is a Lucene problem, but -- as I'm sure you
> will inevitably hear from others on the list -- it depends on what your
> definition of "similar" is.
>
> By similar, do you mean:
> 1.  Identical, except for variations in case (upper/lower)
> 2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or "...
> (summary")
> 3.  Allow 1., 2. and permit some new terms ... how many?
> 4.  Allow all of the above and allow some changes to terms using stemming
> (E.g., "Google releases Chrome" is similar to "Google announces the release
> of its new Chrome web browser")
> ....
>
> I'm sure you see where this is going.  So ... how do you define similar?
>
> Good luck!
>
> -h
> ----------------------------------------------------------------------
> Hira, N.R.
> Cognocys, Inc.
>
> On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:
>
>
>     Hey all,
>
>     I want to know how much two Strings are similar! The thing is: I'm
> processing an email box and I want to group all messages that have the
> subject similar, makes sense?? I looked on the documentation but I didn't
> find how to accomplish this. It's not necessary add the messages or the
> subjects on some kind of index. I'm using 2.3.2 version of Lucene.
>
>     Anyone has some idea?
>
>     Thanks in advance.
> --
> Thiago Moreira
> Software Engineer
> [hidden email]
> Liferay, Inc.
> Enterprise. Open Source. For Life.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Similarity percentage between two Strings

Karl Wettin
In reply to this post by Thiago Moreira-5
I would create 1-5 ngram sized shingles and measure the distance using  
Tanimoto coefficient. That would probably work out just fine. You  
might want to add more weight the greater the size of the shingle.

There are shingle filters in lucene/java/contrib/analyzers and there  
is a Tanimoto distance in lucene/mahout/.

Feel free to report back on how well it works.


      karl

4 sep 2008 kl. 00.58 skrev Thiago Moreira:

>
>     Well, the similar definition that I'm looking for is the number  
> 2, maybe the number 3, but to start the number 2 is enough. If you  
> guys think that is not a Lucene problem what else tool can I use to  
> implement this requirement??
>
>     Thanks
> Thiago Moreira
> Software Engineer
> [hidden email]
> Liferay, Inc.
> Enterprise. Open Source. For Life.
>
>
> N. Hira wrote:
>>
>> I don't know how much of this is a Lucene problem, but -- as I'm  
>> sure you will inevitably hear from others on the list -- it depends  
>> on what your definition of "similar" is.
>>
>> By similar, do you mean:
>> 1.  Identical, except for variations in case (upper/lower)
>> 2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or  
>> "... (summary")
>> 3.  Allow 1., 2. and permit some new terms ... how many?
>> 4.  Allow all of the above and allow some changes to terms using  
>> stemming (E.g., "Google releases Chrome" is similar to "Google  
>> announces the release of its new Chrome web browser")
>> ....
>>
>> I'm sure you see where this is going.  So ... how do you define  
>> similar?
>>
>> Good luck!
>>
>> -h
>> ----------------------------------------------------------------------
>> Hira, N.R.
>> Cognocys, Inc.
>>
>> On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:
>>
>>>
>>>     Hey all,
>>>
>>>     I want to know how much two Strings are similar! The thing is:  
>>> I'm processing an email box and I want to group all messages that  
>>> have the subject similar, makes sense?? I looked on the  
>>> documentation but I didn't find how to accomplish this. It's not  
>>> necessary add the messages or the subjects on some kind of index.  
>>> I'm using 2.3.2 version of Lucene.
>>>
>>>     Anyone has some idea?
>>>
>>>     Thanks in advance.
>>> --
>>> Thiago Moreira
>>> Software Engineer
>>> [hidden email]
>>> Liferay, Inc.
>>> Enterprise. Open Source. For Life.
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Similarity percentage between two Strings

Thiago Moreira-5
In reply to this post by Thiago Moreira-5

    For those interested in my solution I took this article as based to
implement the requirements.

    http://www.catalysoft.com/articles/StrikeAMatch.html

    Thanks.


----- Original Message -----
From: [hidden email]
Sent: Thu, September 4, 2008 1:20
Subject:Re: Similarity percentage between two Strings


Googling for "java string similarity" throws up some stuff you might
find useful.


--
Ian.


On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[hidden email]>
wrote:
>
>     Well, the similar definition that I'm looking for is the number 2,
maybe
> the number 3, but to start the number 2 is enough. If you guys think
that is

> not a Lucene problem what else tool can I use to implement this
> requirement??
>
>     Thanks
> ________________________________
> Thiago Moreira
> Software Engineer
> [hidden email]
> Liferay, Inc.
> Enterprise. Open Source. For Life.
>
>
> N. Hira wrote:
>
> I don't know how much of this is a Lucene problem, but -- as I'm sure you
> will inevitably hear from others on the list -- it depends on what your
> definition of "similar" is.
>
> By similar, do you mean:
> 1.  Identical, except for variations in case (upper/lower)
> 2.  Allow 1., but also allow prefixes/suffixes (e.g., "FW:  " or "...
> (summary")
> 3.  Allow 1., 2. and permit some new terms ... how many?
> 4.  Allow all of the above and allow some changes to terms using stemming
> (E.g., "Google releases Chrome" is similar to "Google announces the release
> of its new Chrome web browser")
> ....
>
> I'm sure you see where this is going.  So ... how do you define similar?
>
> Good luck!
>
> -h
> ----------------------------------------------------------------------
> Hira, N.R.
> Cognocys, Inc.
>
> On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote:
>
>
>     Hey all,
>
>     I want to know how much two Strings are similar! The thing is: I'm
> processing an email box and I want to group all messages that have the
> subject similar, makes sense?? I looked on the documentation but I didn't
> find how to accomplish this. It's not necessary add the messages or the
> subjects on some kind of index. I'm using 2.3.2 version of Lucene.
>
>     Anyone has some idea?
>
>     Thanks in advance.
> --
> Thiago Moreira
> Software Engineer
> [hidden email]
> Liferay, Inc.
> Enterprise. Open Source. For Life.



----- End of original message -----


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]