Require some advice

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Require some advice

Pavan Gupta
Hi,
I am new to text search and mining and have been doing research for
different available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip ,
city and skills associated with the person. SMS would be in form of free
text. The parsed data would be stored in database and used by Solr to
display results.
A SMS message could in the following form:
"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
We need to interpret in the following manner:
first name -> John
last name -> Mayer
city-> Mumbai
zip -> 411004
area->Juhu
skills -> car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS
message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER
(stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of
unstructured SMS messages. Do we have something similar in open source
world? Can we extend Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan
Reply | Threaded
Open this post in threaded view
|

RE: Require some advice

Michael Griffiths
Solr is a search engine, not an entity extraction tool.

While there are some decent open source entity extraction tools, they are focused on processing sentences and paragraphs. The structural differences in text messages means you'd need to do a fair amount of work to get decent entity extraction.

That said, you may want to look into simple word/phrase matching if your domain is sufficiently small. Use RegEx to extract ZIP, use dictionaries to extract city/area, skills, and names. Much simpler and cheaper.

-----Original Message-----
From: Pavan Gupta [mailto:[hidden email]]
Sent: Thursday, August 12, 2010 2:58 PM
To: [hidden email]
Subject: Require some advice

Hi,
I am new to text search and mining and have been doing research for different available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip , city and skills associated with the person. SMS would be in form of free text. The parsed data would be stored in database and used by Solr to display results.
A SMS message could in the following form:
"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
We need to interpret in the following manner:
first name -> John
last name -> Mayer
city-> Mumbai
zip -> 411004
area->Juhu
skills -> car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER (stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of unstructured SMS messages. Do we have something similar in open source world? Can we extend Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan
Reply | Threaded
Open this post in threaded view
|

RE: Require some advice

Nagelberg, Kallin
Try this,

http://viewer.opencalais.com/

They have an open API for that data. With your text message of :

"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
 
It gives back:

People: John Mayer Mumbai
Positions: body guard, car driver.

It's not perfect but it's not bad either..

Regards,
Kallin Nagelberg
-----Original Message-----
From: Michael Griffiths [mailto:[hidden email]]
Sent: Thursday, August 12, 2010 3:28 PM
To: [hidden email]
Subject: RE: Require some advice

Solr is a search engine, not an entity extraction tool.

While there are some decent open source entity extraction tools, they are focused on processing sentences and paragraphs. The structural differences in text messages means you'd need to do a fair amount of work to get decent entity extraction.

That said, you may want to look into simple word/phrase matching if your domain is sufficiently small. Use RegEx to extract ZIP, use dictionaries to extract city/area, skills, and names. Much simpler and cheaper.

-----Original Message-----
From: Pavan Gupta [mailto:[hidden email]]
Sent: Thursday, August 12, 2010 2:58 PM
To: [hidden email]
Subject: Require some advice

Hi,
I am new to text search and mining and have been doing research for different available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip , city and skills associated with the person. SMS would be in form of free text. The parsed data would be stored in database and used by Solr to display results.
A SMS message could in the following form:
"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
We need to interpret in the following manner:
first name -> John
last name -> Mayer
city-> Mumbai
zip -> 411004
area->Juhu
skills -> car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER (stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of unstructured SMS messages. Do we have something similar in open source world? Can we extend Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan
Reply | Threaded
Open this post in threaded view
|

Re: Require some advice

Tommaso Teofili
In reply to this post by Michael Griffiths
Hi Pavan,
you may want to plug UIMA as a particular UpdateRequestProcessor [1] while
indexing data (I am working on such a use case). This way you could extract
entities and add them either as dynamicFields or pre defined (fixed) fields.

2010/8/12 Michael Griffiths <[hidden email]>

>
> While there are some decent open source entity extraction tools, they are
> focused on processing sentences and paragraphs. The structural differences
> in text messages means you'd need to do a fair amount of work to get decent
> entity extraction.
>
> That said, you may want to look into simple word/phrase matching if your
> domain is sufficiently small. Use RegEx to extract ZIP, use dictionaries to
> extract city/area, skills, and names. Much simpler and cheaper.
>
>
>
in UIMA you have some components that may be useful (DictionaryAnnotator,
ConceptMapper, Tagger, RegExAnnotator [2] ) for the above cases, however, as
Michael underlined, you have to consider the effort needed to understand,
use and eventually customize such components. UIMA is well suited for large
scale collections of data and let you work on a flexible and customizable
analysis pipeline that may change and be enriched in the future, but you
have to evaluate well if you deserve it.


2010/8/12 Nagelberg, Kallin <[hidden email]>

> Try this,
>
> http://viewer.opencalais.com/


the OpenCalais service is wrapped as a UIMA analysis engine and may be
called inside a UIMA pipeline together with other components (see above) or
services (i.e.: the UIMA wrapped Alchemy API service [3] ).
That said, this makes sense only if you are strongly focused on searching
over text and its extracted entities.
My 2 cents,
Tommaso

[1] : http://wiki.apache.org/solr/UpdateRequestProcessor
[2] : http://uima.apache.org/annotators.html
[3] : http://svn.apache.org/viewvc/uima/sandbox/trunk/AlchemyAPIAnnotator/