Text Similarity

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Text Similarity

Aroop Ganguly-2
Hi Team

This is what I want to do:
1. I have 2 datasets of the schema id-number and company-name
2. I want to ultimately be able to link (join or any other means) the 2 data sets based on the similarity between the company-name fields of the 2 data set.

Example:

Dataset 1
————————
Id | Company Name
—| —————————————
1 | Aroop Inc
2 | Ganguly & Ganguly Corp


Dataset 2
————————
Yo Revenue    | Company Name
— ————— |————————
1K | aroop and sons
2K | Ganguly Corp
3K | Ganguly and Ganguly
2K | Aroop Inc.
6K | Ganguly Corporation



I want to be able to get a join in the end, based on a smart similarity score between the company names in the 2 data sets.

Final Dataset
—--- | —————————————| ————————|—————————————————————   |————————————————————
Id | Company Name | Revenue | Matched Company Name from Dataset2 | Similarity Score
—--- | —————————————-----------------------—| —————————————————————   |———————————————————
1 | Aroop Inc | 2K | Aroop Inc. |       99%
2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75%
—--- | —————————————| ————————|—————————————————————--- |————————————————————

How should I proceed? (I have preprocessed the data sets to lowercase it and remove non essential words like pronouns and acronyms like LTD or Co. )

Thanks
Aroop
Reply | Threaded
Open this post in threaded view
|

Re: Text Similarity

Rahul Singh-3
How do you define similarity? There are various different methods that work for different methods. In solr depending on which index time analyzer / tokenizer you are using, it will treat one company name as similar in one scenario and not in another.

This seems like a case of data deduplication — the join I’m pretty sure works on exact matches.

Consider creating a “identity” collection where you map the different names to a unique identity key. This could then be technically be joined on two datasets and then those could be joined again.

Rahul
On Jul 11, 2018, 4:42 PM -0400, Aroop Ganguly <[hidden email]>, wrote:

> Hi Team
>
> This is what I want to do:
> 1. I have 2 datasets of the schema id-number and company-name
> 2. I want to ultimately be able to link (join or any other means) the 2 data sets based on the similarity between the company-name fields of the 2 data set.
>
> Example:
>
> Dataset 1
> ————————
> Id | Company Name
> —| —————————————
> 1 | Aroop Inc
> 2 | Ganguly & Ganguly Corp
>
>
> Dataset 2
> ————————
> Yo Revenue | Company Name
> — ————— |————————
> 1K | aroop and sons
> 2K | Ganguly Corp
> 3K | Ganguly and Ganguly
> 2K | Aroop Inc.
> 6K | Ganguly Corporation
>
>
>
> I want to be able to get a join in the end, based on a smart similarity score between the company names in the 2 data sets.
>
> Final Dataset
> —--- | —————————————| ————————|————————————————————— |————————————————————
> Id | Company Name | Revenue | Matched Company Name from Dataset2 | Similarity Score
> —--- | —————————————-----------------------—| ————————————————————— |———————————————————
> 1 | Aroop Inc | 2K | Aroop Inc. | 99%
> 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75%
> —--- | —————————————| ————————|—————————————————————--- |————————————————————
>
> How should I proceed? (I have preprocessed the data sets to lowercase it and remove non essential words like pronouns and acronyms like LTD or Co. )
>
> Thanks
> Aroop
Reply | Threaded
Open this post in threaded view
|

Re: Text Similarity

Aroop Ganguly-2
Thanks for your answer Rahul. I think I have explained similarity with the example, assuming the natural order.
I would assume this is a common action for people who use solr and do search based systems.
I am basically looking for any design patterns that people use to achieve the results as explained in the example below.

Please do not take join very literally. It has to be a smart join and I think yours approach seems like a step towards vectorizing each name. Thanks.

Are there any other ways that people have tackled such problems ?


> On Jul 15, 2018, at 2:51 PM, Rahul Singh <[hidden email]> wrote:
>
> How do you define similarity? There are various different methods that work for different methods. In solr depending on which index time analyzer / tokenizer you are using, it will treat one company name as similar in one scenario and not in another.
>
> This seems like a case of data deduplication — the join I’m pretty sure works on exact matches.
>
> Consider creating a “identity” collection where you map the different names to a unique identity key. This could then be technically be joined on two datasets and then those could be joined again.
>
> Rahul
> On Jul 11, 2018, 4:42 PM -0400, Aroop Ganguly <[hidden email]>, wrote:
>> Hi Team
>>
>> This is what I want to do:
>> 1. I have 2 datasets of the schema id-number and company-name
>> 2. I want to ultimately be able to link (join or any other means) the 2 data sets based on the similarity between the company-name fields of the 2 data set.
>>
>> Example:
>>
>> Dataset 1
>> ————————
>> Id | Company Name
>> —| —————————————
>> 1 | Aroop Inc
>> 2 | Ganguly & Ganguly Corp
>>
>>
>> Dataset 2
>> ————————
>> Yo Revenue | Company Name
>> — ————— |————————
>> 1K | aroop and sons
>> 2K | Ganguly Corp
>> 3K | Ganguly and Ganguly
>> 2K | Aroop Inc.
>> 6K | Ganguly Corporation
>>
>>
>>
>> I want to be able to get a join in the end, based on a smart similarity score between the company names in the 2 data sets.
>>
>> Final Dataset
>> —--- | —————————————| ————————|————————————————————— |————————————————————
>> Id | Company Name | Revenue | Matched Company Name from Dataset2 | Similarity Score
>> —--- | —————————————-----------------------—| ————————————————————— |———————————————————
>> 1 | Aroop Inc | 2K | Aroop Inc. | 99%
>> 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75%
>> —--- | —————————————| ————————|—————————————————————--- |————————————————————
>>
>> How should I proceed? (I have preprocessed the data sets to lowercase it and remove non essential words like pronouns and acronyms like LTD or Co. )
>>
>> Thanks
>> Aroop