Suggestion for short text matching using dictionary

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestion for short text matching using dictionary

climbingrose
Firstly, my apologies for being off topic. I'm asking this question because
I think there are some machine learning and text processing experts on this
mailing list.

Basically, my task is to normalize a fairly unstructured set of short texts
using a dictionary. We have a pre-defined list of products and periodically
receive product feeds from various websites. Basically, our site is similar
to a shopping comparison engine but on a different domain. We would like to
normalize the products' names in the feeds to using our pre-defined list.
For example:

"Nokia N95 8GB Black" ---> "Nokia N95 8GB"
"Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"

My original idea is to index the list of pre-defined names and then query
the index using the product's name. The highest scored result will be used
to normalize the product.

The problem with this is sometimes you get wrong matches because of noise.
For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
"Nokia Bluetooth Headset" which is desirable.

Is there a better solution for this problem? Thanks in advance.

--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for short text matching using dictionary

Grant Ingersoll-2
below


On Jun 27, 2008, at 1:18 AM, climbingrose wrote:

> Firstly, my apologies for being off topic. I'm asking this question  
> because
> I think there are some machine learning and text processing experts  
> on this
> mailing list.
>
> Basically, my task is to normalize a fairly unstructured set of  
> short texts
> using a dictionary. We have a pre-defined list of products and  
> periodically
> receive product feeds from various websites. Basically, our site is  
> similar
> to a shopping comparison engine but on a different domain. We would  
> like to
> normalize the products' names in the feeds to using our pre-defined  
> list.
> For example:
>
> "Nokia N95 8GB Black" ---> "Nokia N95 8GB"
> "Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
>
> My original idea is to index the list of pre-defined names and then  
> query
> the index using the product's name. The highest scored result will  
> be used
> to normalize the product.
>
> The problem with this is sometimes you get wrong matches because of  
> noise.
> For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
> "Nokia Bluetooth Headset" which is desirable.


I assume you mean "not desirable" here given the context...

Your approach is worth trying.  At a deeper level, you may want to  
look into a topic called "record linkage" and an open source project  
called Second String by William Cohen's group at Carnegie Mellon (http://secondstring.sourceforge.net/ 
) which has a whole bunch of implementations of fuzzy string matching  
algorithms like Jaro-Winkler, Levenstein, etc. that can then be used  
to implement what you are after.

You could potentially use the spell checking functionality to simulate  
some of this a bit better than just a pure vector match.  Index your  
dictionary into a spelling index (see SOLR-572) and then send in spell  
checking queries.  In fact, you probably could integrate Second String  
into the spell checker pretty easily since one can now plugin the  
distance measure into the spell checker.

You may find some help on this by searching http://lucene.markmail.org 
for things like "record linkage" or "record matching" or various other  
related terms.

Another option is to write up a NormalizingTokenFilter that analyzes  
the tokens as they come in to see if they match your dictionary list.

As with all of these, there is going to be some trial and error here  
to come up with something that hits most of the time, as it will never  
be perfect.

Good luck,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for short text matching using dictionary

climbingrose
Thanks Grant. I did try Secondstring before and found out that it wasn't
particular good for doing a lot of text matching. I'm leaning toward the
combination of Lucene and Secondstring. Googling around a bit, I came across
this project http://datamining.anu.edu.au/projects/linkage.html. Looks
interesting but the implementation is in Python though. I think they use
Hidden Markov Model to label training data then matching records
probalistically.

On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll <[hidden email]>
wrote:

> below
>
>
>
> On Jun 27, 2008, at 1:18 AM, climbingrose wrote:
>
>  Firstly, my apologies for being off topic. I'm asking this question
>> because
>> I think there are some machine learning and text processing experts on
>> this
>> mailing list.
>>
>> Basically, my task is to normalize a fairly unstructured set of short
>> texts
>> using a dictionary. We have a pre-defined list of products and
>> periodically
>> receive product feeds from various websites. Basically, our site is
>> similar
>> to a shopping comparison engine but on a different domain. We would like
>> to
>> normalize the products' names in the feeds to using our pre-defined list.
>> For example:
>>
>> "Nokia N95 8GB Black" ---> "Nokia N95 8GB"
>> "Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
>>
>> My original idea is to index the list of pre-defined names and then query
>> the index using the product's name. The highest scored result will be used
>> to normalize the product.
>>
>> The problem with this is sometimes you get wrong matches because of noise.
>> For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
>> "Nokia Bluetooth Headset" which is desirable.
>>
>
>
> I assume you mean "not desirable" here given the context...
>
> Your approach is worth trying.  At a deeper level, you may want to look
> into a topic called "record linkage" and an open source project called
> Second String by William Cohen's group at Carnegie Mellon (
> http://secondstring.sourceforge.net/) which has a whole bunch of
> implementations of fuzzy string matching algorithms like Jaro-Winkler,
> Levenstein, etc. that can then be used to implement what you are after.
>
> You could potentially use the spell checking functionality to simulate some
> of this a bit better than just a pure vector match.  Index your dictionary
> into a spelling index (see SOLR-572) and then send in spell checking
> queries.  In fact, you probably could integrate Second String into the spell
> checker pretty easily since one can now plugin the distance measure into the
> spell checker.
>
> You may find some help on this by searching http://lucene.markmail.org for
> things like "record linkage" or "record matching" or various other related
> terms.
>
> Another option is to write up a NormalizingTokenFilter that analyzes the
> tokens as they come in to see if they match your dictionary list.
>
> As with all of these, there is going to be some trial and error here to
> come up with something that hits most of the time, as it will never be
> perfect.
>
> Good luck,
> Grant
>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>


--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion for short text matching using dictionary

Grant Ingersoll-2
Yeah, I don't know if SecondString scales.  Note that Lucene now has  
an implementation of Jaro-Winkler, which is a pretty good distance  
measure, so you may want to give that a try, plus if you see speedups,  
feel free to contrib a patch ;-)

I'm wondering if Hadoop couldn't help w/ the scale problem.  Perhaps  
you might ask over on the Mahout user list, too, as there are a fair  
number of text geeks over there.  Might be interesting to think about  
a contribution in this department.  Also Tom and I are likely to have  
a chapter on this topic (and other "string" related issues) in "Taming  
Text" (Manning), but that one isn't written yet (shameless plug, I  
know, sorry!)  Tom definitely knows more on this particular topic then  
I do.

-Grant

On Jun 27, 2008, at 11:25 AM, climbingrose wrote:

> Thanks Grant. I did try Secondstring before and found out that it  
> wasn't
> particular good for doing a lot of text matching. I'm leaning toward  
> the
> combination of Lucene and Secondstring. Googling around a bit, I  
> came across
> this project http://datamining.anu.edu.au/projects/linkage.html. Looks
> interesting but the implementation is in Python though. I think they  
> use
> Hidden Markov Model to label training data then matching records
> probalistically.
>
> On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll  
> <[hidden email]>
> wrote:
>
>> below
>>
>>
>>
>> On Jun 27, 2008, at 1:18 AM, climbingrose wrote:
>>
>> Firstly, my apologies for being off topic. I'm asking this question
>>> because
>>> I think there are some machine learning and text processing  
>>> experts on
>>> this
>>> mailing list.
>>>
>>> Basically, my task is to normalize a fairly unstructured set of  
>>> short
>>> texts
>>> using a dictionary. We have a pre-defined list of products and
>>> periodically
>>> receive product feeds from various websites. Basically, our site is
>>> similar
>>> to a shopping comparison engine but on a different domain. We  
>>> would like
>>> to
>>> normalize the products' names in the feeds to using our pre-
>>> defined list.
>>> For example:
>>>
>>> "Nokia N95 8GB Black" ---> "Nokia N95 8GB"
>>> "Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
>>>
>>> My original idea is to index the list of pre-defined names and  
>>> then query
>>> the index using the product's name. The highest scored result will  
>>> be used
>>> to normalize the product.
>>>
>>> The problem with this is sometimes you get wrong matches because  
>>> of noise.
>>> For example, "Black Nokia N95, 8GB + Free bluetooth headset" can  
>>> match
>>> "Nokia Bluetooth Headset" which is desirable.
>>>
>>
>>
>> I assume you mean "not desirable" here given the context...
>>
>> Your approach is worth trying.  At a deeper level, you may want to  
>> look
>> into a topic called "record linkage" and an open source project  
>> called
>> Second String by William Cohen's group at Carnegie Mellon (
>> http://secondstring.sourceforge.net/) which has a whole bunch of
>> implementations of fuzzy string matching algorithms like Jaro-
>> Winkler,
>> Levenstein, etc. that can then be used to implement what you are  
>> after.
>>
>> You could potentially use the spell checking functionality to  
>> simulate some
>> of this a bit better than just a pure vector match.  Index your  
>> dictionary
>> into a spelling index (see SOLR-572) and then send in spell checking
>> queries.  In fact, you probably could integrate Second String into  
>> the spell
>> checker pretty easily since one can now plugin the distance measure  
>> into the
>> spell checker.
>>
>> You may find some help on this by searching http://lucene.markmail.org 
>>  for
>> things like "record linkage" or "record matching" or various other  
>> related
>> terms.
>>
>> Another option is to write up a NormalizingTokenFilter that  
>> analyzes the
>> tokens as they come in to see if they match your dictionary list.
>>
>> As with all of these, there is going to be some trial and error  
>> here to
>> come up with something that hits most of the time, as it will never  
>> be
>> perfect.
>>
>> Good luck,
>> Grant
>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> Regards,
>
> Cuong Hoang

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ