Entity extraction?

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Entity extraction?

Charlie Jackson
During a recent sales pitch to my company by FAST, they mentioned entity
extraction. I'd never heard of it before, but they described it as
basically recognizing people/places/things in documents being indexed
and then being able to do faceting on this data at query time. Does
anything like this already exist in SOLR? If not, I'm not opposed to
developing it myself, but I could use some pointers on where to start.

 

Thanks,

- Charlie

Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

rossini
Solr can do a simple facet seach like FAST, but the entity extraction
demands other tecnologies. I do not know how FAST does it but at the company
I´m working on (www.cortex-intelligence.com), we use a mix of statistical
and language-specific tasks to recognize and categorize entities in the
text. Ling Pipe is another tool (free) that does that too. In case you would
like to see a simple demo: http://www.cortex-intelligence.com/tech/

Rossini


On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson <[hidden email]
> wrote:

> During a recent sales pitch to my company by FAST, they mentioned entity
> extraction. I'd never heard of it before, but they described it as
> basically recognizing people/places/things in documents being indexed
> and then being able to do faceting on this data at query time. Does
> anything like this already exist in SOLR? If not, I'm not opposed to
> developing it myself, but I could use some pointers on where to start.
>
>
>
> Thanks,
>
> - Charlie
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Rogerio Pereira
You can find more about this topic in this book availabe at amazon:
http://www.amazon.com/Building-Search-Applications-Lucene-Lingpipe/dp/0615204252/

2008/10/24 Rafael Rossini <[hidden email]>

> Solr can do a simple facet seach like FAST, but the entity extraction
> demands other tecnologies. I do not know how FAST does it but at the
> company
> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
> and language-specific tasks to recognize and categorize entities in the
> text. Ling Pipe is another tool (free) that does that too. In case you
> would
> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>
> Rossini
>
>
> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson <
> [hidden email]
> > wrote:
>
> > During a recent sales pitch to my company by FAST, they mentioned entity
> > extraction. I'd never heard of it before, but they described it as
> > basically recognizing people/places/things in documents being indexed
> > and then being able to do faceting on this data at query time. Does
> > anything like this already exist in SOLR? If not, I'm not opposed to
> > developing it myself, but I could use some pointers on where to start.
> >
> >
> >
> > Thanks,
> >
> > - Charlie
> >
> >
>



--
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]  [Twitter:
http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)
Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Ryan McKinley
In reply to this post by Charlie Jackson
This is not something solr does currently...

It sounds like something that should be added to Mahout:
http://lucene.apache.org/mahout/


On Oct 24, 2008, at 4:18 PM, Charlie Jackson wrote:

> During a recent sales pitch to my company by FAST, they mentioned  
> entity
> extraction. I'd never heard of it before, but they described it as
> basically recognizing people/places/things in documents being indexed
> and then being able to do faceting on this data at query time. Does
> anything like this already exist in SOLR? If not, I'm not opposed to
> developing it myself, but I could use some pointers on where to start.
>
>
>
> Thanks,
>
> - Charlie
>

Reply | Threaded
Open this post in threaded view
|

RE: Entity extraction?

Charlie Jackson
In reply to this post by Rogerio Pereira
Thanks for the replies, guys, that gives me a good place to start looking.

- Charlie

-----Original Message-----
From: Rogerio Pereira [mailto:[hidden email]]
Sent: Friday, October 24, 2008 5:14 PM
To: [hidden email]
Subject: Re: Entity extraction?

You can find more about this topic in this book availabe at amazon:
http://www.amazon.com/Building-Search-Applications-Lucene-Lingpipe/dp/0615204252/

2008/10/24 Rafael Rossini <[hidden email]>

> Solr can do a simple facet seach like FAST, but the entity extraction
> demands other tecnologies. I do not know how FAST does it but at the
> company
> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
> and language-specific tasks to recognize and categorize entities in the
> text. Ling Pipe is another tool (free) that does that too. In case you
> would
> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>
> Rossini
>
>
> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson <
> [hidden email]
> > wrote:
>
> > During a recent sales pitch to my company by FAST, they mentioned entity
> > extraction. I'd never heard of it before, but they described it as
> > basically recognizing people/places/things in documents being indexed
> > and then being able to do faceting on this data at query time. Does
> > anything like this already exist in SOLR? If not, I'm not opposed to
> > developing it myself, but I could use some pointers on where to start.
> >
> >
> >
> > Thanks,
> >
> > - Charlie
> >
> >
>



--
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]  [Twitter:
http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)

Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Rogerio Pereira
In reply to this post by Ryan McKinley
I agree Ryan and I would like see a completly integration between solr,
nutch, tika and mahout in the future.

2008/10/24 Ryan McKinley <[hidden email]>

> This is not something solr does currently...
>
> It sounds like something that should be added to Mahout:
> http://lucene.apache.org/mahout/
>
>
>
> On Oct 24, 2008, at 4:18 PM, Charlie Jackson wrote:
>
>  During a recent sales pitch to my company by FAST, they mentioned entity
>> extraction. I'd never heard of it before, but they described it as
>> basically recognizing people/places/things in documents being indexed
>> and then being able to do faceting on this data at query time. Does
>> anything like this already exist in SOLR? If not, I'm not opposed to
>> developing it myself, but I could use some pointers on where to start.
>>
>>
>>
>> Thanks,
>>
>> - Charlie
>>
>>
>


--
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]  [Twitter:
http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)
Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Julien Nioche-4
Hi,

Open Source NLP platforms like GATE (http://gate.ac.uk) or Apache UIMA are
typically used for these types of tasks. GATE in particular comes with an
application called ANNIE which does Named Entity Recognition. OpenCalais
does that as well and should be easy to embed, but it can't be tuned to do
more specific things unlike UIMA or GATE based applications.

Depending on the architecture you have in mind it could be worth
investigating Nutch and add the NER as a custom plugin; NLP being often a
CPU intensive task you could leverage the scalability of Hadoop in Nutch.
There is a patch which allows to delegate the indexing to SOLR. As someone
else already said these named entities could then be used as facets.

HTH

Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com

2008/10/24 Rogerio Pereira <[hidden email]>

> I agree Ryan and I would like see a completly integration between solr,
> nutch, tika and mahout in the future.
>
> 2008/10/24 Ryan McKinley <[hidden email]>
>
> > This is not something solr does currently...
> >
> > It sounds like something that should be added to Mahout:
> > http://lucene.apache.org/mahout/
> >
> >
> >
> > On Oct 24, 2008, at 4:18 PM, Charlie Jackson wrote:
> >
> >  During a recent sales pitch to my company by FAST, they mentioned entity
> >> extraction. I'd never heard of it before, but they described it as
> >> basically recognizing people/places/things in documents being indexed
> >> and then being able to do faceting on this data at query time. Does
> >> anything like this already exist in SOLR? If not, I'm not opposed to
> >> developing it myself, but I could use some pointers on where to start.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> - Charlie
> >>
> >>
> >
>
>
> --
> Regards,
>
> Rogério (_rogerio_)
>
> [Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]
>  [Twitter:
> http://twitter.com/ararog]
>
> "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
> distribua e aprenda mais."
> (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)
>
Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

vaiju1981
Hi,

One can use the OpenNLP Max entropy library and create there own
named-entity extraction.
I had used it in one of the projects which I did with Solr.

It is easy to integrate most of the NLP libraries with Solr. Though we
had named-entity extraction embedded in our crawler which would populate
a field called entities in the database, which we would ingest in Solr
as yet another field.

--Thanks and Regards
Vaijanath N. Rao

Julien Nioche wrote:

> Hi,
>
> Open Source NLP platforms like GATE (http://gate.ac.uk) or Apache UIMA are
> typically used for these types of tasks. GATE in particular comes with an
> application called ANNIE which does Named Entity Recognition. OpenCalais
> does that as well and should be easy to embed, but it can't be tuned to do
> more specific things unlike UIMA or GATE based applications.
>
> Depending on the architecture you have in mind it could be worth
> investigating Nutch and add the NER as a custom plugin; NLP being often a
> CPU intensive task you could leverage the scalability of Hadoop in Nutch.
> There is a patch which allows to delegate the indexing to SOLR. As someone
> else already said these named entities could then be used as facets.
>
> HTH
>
> Julien
>  

Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Otis Gospodnetic-2
In reply to this post by Charlie Jackson
For the record, LingPipe is not free.  It's good, but it's not free.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Rafael Rossini <[hidden email]>
> To: [hidden email]
> Sent: Friday, October 24, 2008 6:08:14 PM
> Subject: Re: Entity extraction?
>
> Solr can do a simple facet seach like FAST, but the entity extraction
> demands other tecnologies. I do not know how FAST does it but at the company
> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
> and language-specific tasks to recognize and categorize entities in the
> text. Ling Pipe is another tool (free) that does that too. In case you would
> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>
> Rossini
>
>
> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
> > wrote:
>
> > During a recent sales pitch to my company by FAST, they mentioned entity
> > extraction. I'd never heard of it before, but they described it as
> > basically recognizing people/places/things in documents being indexed
> > and then being able to do faceting on this data at query time. Does
> > anything like this already exist in SOLR? If not, I'm not opposed to
> > developing it myself, but I could use some pointers on where to start.
> >
> >
> >
> > Thanks,
> >
> > - Charlie
> >
> >

Reply | Threaded
Open this post in threaded view
|

RE: Entity extraction?

Charlie Jackson
True, though I may be able to convince the powers that be that it's worth the investment.

There are a number of open source or free tools listed on the Wikipedia entry for entity extraction (http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free) -- does anyone have any experience with any of these?

____________________________________________
Charlie Jackson
312-873-6537
[hidden email]

-----Original Message-----
From: Otis Gospodnetic [mailto:[hidden email]]
Sent: Monday, October 27, 2008 10:23 AM
To: [hidden email]
Subject: Re: Entity extraction?

For the record, LingPipe is not free.  It's good, but it's not free.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Rafael Rossini <[hidden email]>
> To: [hidden email]
> Sent: Friday, October 24, 2008 6:08:14 PM
> Subject: Re: Entity extraction?
>
> Solr can do a simple facet seach like FAST, but the entity extraction
> demands other tecnologies. I do not know how FAST does it but at the company
> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
> and language-specific tasks to recognize and categorize entities in the
> text. Ling Pipe is another tool (free) that does that too. In case you would
> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>
> Rossini
>
>
> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
> > wrote:
>
> > During a recent sales pitch to my company by FAST, they mentioned entity
> > extraction. I'd never heard of it before, but they described it as
> > basically recognizing people/places/things in documents being indexed
> > and then being able to do faceting on this data at query time. Does
> > anything like this already exist in SOLR? If not, I'm not opposed to
> > developing it myself, but I could use some pointers on where to start.
> >
> >
> >
> > Thanks,
> >
> > - Charlie
> >
> >



Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Walter Underwood, Netflix
The vendor mentioned entity extraction, but that doesn't mean you need it.
Entity extraction is a pretty specific technology, and it has been a
money-losing product at many companies for many years, going back to
Xerox ThingFinder well over ten years ago.

My guess is that very few people really need entity extraction.

Using EE for automatic taxonomy generation is even harder to get right.
At best, that is a way to get a starter set of categories that you can
edit. You will not get a production quality taxonomy automatically.

wunder

On 10/27/08 8:31 AM, "Charlie Jackson" <[hidden email]> wrote:

> True, though I may be able to convince the powers that be that it's worth the
> investment.
>
> There are a number of open source or free tools listed on the Wikipedia entry
> for entity extraction
> (http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free) --
> does anyone have any experience with any of these?
>
> ____________________________________________
> Charlie Jackson
> 312-873-6537
> [hidden email]
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:[hidden email]]
> Sent: Monday, October 27, 2008 10:23 AM
> To: [hidden email]
> Subject: Re: Entity extraction?
>
> For the record, LingPipe is not free.  It's good, but it's not free.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Rafael Rossini <[hidden email]>
>> To: [hidden email]
>> Sent: Friday, October 24, 2008 6:08:14 PM
>> Subject: Re: Entity extraction?
>>
>> Solr can do a simple facet seach like FAST, but the entity extraction
>> demands other tecnologies. I do not know how FAST does it but at the company
>> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
>> and language-specific tasks to recognize and categorize entities in the
>> text. Ling Pipe is another tool (free) that does that too. In case you would
>> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>>
>> Rossini
>>
>>
>> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
>>> wrote:
>>
>>> During a recent sales pitch to my company by FAST, they mentioned entity
>>> extraction. I'd never heard of it before, but they described it as
>>> basically recognizing people/places/things in documents being indexed
>>> and then being able to do faceting on this data at query time. Does
>>> anything like this already exist in SOLR? If not, I'm not opposed to
>>> developing it myself, but I could use some pointers on where to start.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> - Charlie
>>>
>>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Entity extraction?

Charlie Jackson
Yeah, when they first mentioned it, my initial thought was "cool, but we don't need it." However, some of the higher ups in the company are saying we might want it at some point, so I've been asked to look into it. I'll be sure to let them know about the flaws in the concept, thanks for that info.

____________________________________________
Charlie Jackson
[hidden email]


-----Original Message-----
From: Walter Underwood [mailto:[hidden email]]
Sent: Monday, October 27, 2008 11:17 AM
To: [hidden email]
Subject: Re: Entity extraction?

The vendor mentioned entity extraction, but that doesn't mean you need it.
Entity extraction is a pretty specific technology, and it has been a
money-losing product at many companies for many years, going back to
Xerox ThingFinder well over ten years ago.

My guess is that very few people really need entity extraction.

Using EE for automatic taxonomy generation is even harder to get right.
At best, that is a way to get a starter set of categories that you can
edit. You will not get a production quality taxonomy automatically.

wunder

On 10/27/08 8:31 AM, "Charlie Jackson" <[hidden email]> wrote:

> True, though I may be able to convince the powers that be that it's worth the
> investment.
>
> There are a number of open source or free tools listed on the Wikipedia entry
> for entity extraction
> (http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free) --
> does anyone have any experience with any of these?
>
> ____________________________________________
> Charlie Jackson
> 312-873-6537
> [hidden email]
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:[hidden email]]
> Sent: Monday, October 27, 2008 10:23 AM
> To: [hidden email]
> Subject: Re: Entity extraction?
>
> For the record, LingPipe is not free.  It's good, but it's not free.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Rafael Rossini <[hidden email]>
>> To: [hidden email]
>> Sent: Friday, October 24, 2008 6:08:14 PM
>> Subject: Re: Entity extraction?
>>
>> Solr can do a simple facet seach like FAST, but the entity extraction
>> demands other tecnologies. I do not know how FAST does it but at the company
>> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
>> and language-specific tasks to recognize and categorize entities in the
>> text. Ling Pipe is another tool (free) that does that too. In case you would
>> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>>
>> Rossini
>>
>>
>> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
>>> wrote:
>>
>>> During a recent sales pitch to my company by FAST, they mentioned entity
>>> extraction. I'd never heard of it before, but they described it as
>>> basically recognizing people/places/things in documents being indexed
>>> and then being able to do faceting on this data at query time. Does
>>> anything like this already exist in SOLR? If not, I'm not opposed to
>>> developing it myself, but I could use some pointers on where to start.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> - Charlie
>>>
>>>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

rossini
In reply to this post by Walter Underwood, Netflix
Well... IMHO that depends. One of the services we provide is a "automatic
clipping" in which our client chooses 20~30 texts from the media he woud
like to be aware. With classification algorithms we then keep him aware of
every new text of his interest. We gained about 10% of precision just by
adding EE information to the algorithm.

Rossini

On Mon, Oct 27, 2008 at 2:17 PM, Walter Underwood <[hidden email]>wrote:

> The vendor mentioned entity extraction, but that doesn't mean you need it.
> Entity extraction is a pretty specific technology, and it has been a
> money-losing product at many companies for many years, going back to
> Xerox ThingFinder well over ten years ago.
>
> My guess is that very few people really need entity extraction.
>
> Using EE for automatic taxonomy generation is even harder to get right.
> At best, that is a way to get a starter set of categories that you can
> edit. You will not get a production quality taxonomy automatically.
>
> wunder
>
> On 10/27/08 8:31 AM, "Charlie Jackson" <[hidden email]> wrote:
>
> > True, though I may be able to convince the powers that be that it's worth
> the
> > investment.
> >
> > There are a number of open source or free tools listed on the Wikipedia
> entry
> > for entity extraction
> > (
> http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free)
> --
> > does anyone have any experience with any of these?
> >
> > ____________________________________________
> > Charlie Jackson
> > 312-873-6537
> > [hidden email]
> >
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:[hidden email]]
> > Sent: Monday, October 27, 2008 10:23 AM
> > To: [hidden email]
> > Subject: Re: Entity extraction?
> >
> > For the record, LingPipe is not free.  It's good, but it's not free.
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Rafael Rossini <[hidden email]>
> >> To: [hidden email]
> >> Sent: Friday, October 24, 2008 6:08:14 PM
> >> Subject: Re: Entity extraction?
> >>
> >> Solr can do a simple facet seach like FAST, but the entity extraction
> >> demands other tecnologies. I do not know how FAST does it but at the
> company
> >> I´m working on (www.cortex-intelligence.com), we use a mix of
> statistical
> >> and language-specific tasks to recognize and categorize entities in the
> >> text. Ling Pipe is another tool (free) that does that too. In case you
> would
> >> like to see a simple demo: http://www.cortex-intelligence.com/tech/
> >>
> >> Rossini
> >>
> >>
> >> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
> >>> wrote:
> >>
> >>> During a recent sales pitch to my company by FAST, they mentioned
> entity
> >>> extraction. I'd never heard of it before, but they described it as
> >>> basically recognizing people/places/things in documents being indexed
> >>> and then being able to do faceting on this data at query time. Does
> >>> anything like this already exist in SOLR? If not, I'm not opposed to
> >>> developing it myself, but I could use some pointers on where to start.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> - Charlie
> >>>
> >>>
> >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Walter Underwood, Netflix
In reply to this post by Charlie Jackson
Verity sold a lot of features based on "we might need it at some point."
Very few people deployed the advanced features. They just didn't need them.

wunder

On 10/27/08 9:27 AM, "Charlie Jackson" <[hidden email]> wrote:

> Yeah, when they first mentioned it, my initial thought was "cool, but we don't
> need it." However, some of the higher ups in the company are saying we might
> want it at some point, so I've been asked to look into it. I'll be sure to let
> them know about the flaws in the concept, thanks for that info.
>
> ____________________________________________
> Charlie Jackson
> [hidden email]
>
>
> -----Original Message-----
> From: Walter Underwood [mailto:[hidden email]]
> Sent: Monday, October 27, 2008 11:17 AM
> To: [hidden email]
> Subject: Re: Entity extraction?
>
> The vendor mentioned entity extraction, but that doesn't mean you need it.
> Entity extraction is a pretty specific technology, and it has been a
> money-losing product at many companies for many years, going back to
> Xerox ThingFinder well over ten years ago.
>
> My guess is that very few people really need entity extraction.
>
> Using EE for automatic taxonomy generation is even harder to get right.
> At best, that is a way to get a starter set of categories that you can
> edit. You will not get a production quality taxonomy automatically.
>
> wunder
>
> On 10/27/08 8:31 AM, "Charlie Jackson" <[hidden email]> wrote:
>
>> True, though I may be able to convince the powers that be that it's worth the
>> investment.
>>
>> There are a number of open source or free tools listed on the Wikipedia entry
>> for entity extraction
>> (http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free)
>> --
>> does anyone have any experience with any of these?
>>
>> ____________________________________________
>> Charlie Jackson
>> 312-873-6537
>> [hidden email]
>>
>> -----Original Message-----
>> From: Otis Gospodnetic [mailto:[hidden email]]
>> Sent: Monday, October 27, 2008 10:23 AM
>> To: [hidden email]
>> Subject: Re: Entity extraction?
>>
>> For the record, LingPipe is not free.  It's good, but it's not free.
>>
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>>> From: Rafael Rossini <[hidden email]>
>>> To: [hidden email]
>>> Sent: Friday, October 24, 2008 6:08:14 PM
>>> Subject: Re: Entity extraction?
>>>
>>> Solr can do a simple facet seach like FAST, but the entity extraction
>>> demands other tecnologies. I do not know how FAST does it but at the company
>>> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
>>> and language-specific tasks to recognize and categorize entities in the
>>> text. Ling Pipe is another tool (free) that does that too. In case you would
>>> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>>>
>>> Rossini
>>>
>>>
>>> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
>>>> wrote:
>>>
>>>> During a recent sales pitch to my company by FAST, they mentioned entity
>>>> extraction. I'd never heard of it before, but they described it as
>>>> basically recognizing people/places/things in documents being indexed
>>>> and then being able to do faceting on this data at query time. Does
>>>> anything like this already exist in SOLR? If not, I'm not opposed to
>>>> developing it myself, but I could use some pointers on where to start.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> - Charlie
>>>>
>>>>
>>
>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Benson Margulies
Extractors are exactly as good as the data you have to train or
configure them with. An open source extractor platform may still
require you to come up with a rather large heap of data from
somewhere.

Not all the vendors of extractors lose money.

How useful NEE is for search is an ongoing question that depends on
what sort of data you are working with and what sort of precision
challenges most concern you.


On Mon, Oct 27, 2008 at 12:34 PM, Walter Underwood
<[hidden email]> wrote:

> Verity sold a lot of features based on "we might need it at some point."
> Very few people deployed the advanced features. They just didn't need them.
>
> wunder
>
> On 10/27/08 9:27 AM, "Charlie Jackson" <[hidden email]> wrote:
>
>> Yeah, when they first mentioned it, my initial thought was "cool, but we don't
>> need it." However, some of the higher ups in the company are saying we might
>> want it at some point, so I've been asked to look into it. I'll be sure to let
>> them know about the flaws in the concept, thanks for that info.
>>
>> ____________________________________________
>> Charlie Jackson
>> [hidden email]
>>
>>
>> -----Original Message-----
>> From: Walter Underwood [mailto:[hidden email]]
>> Sent: Monday, October 27, 2008 11:17 AM
>> To: [hidden email]
>> Subject: Re: Entity extraction?
>>
>> The vendor mentioned entity extraction, but that doesn't mean you need it.
>> Entity extraction is a pretty specific technology, and it has been a
>> money-losing product at many companies for many years, going back to
>> Xerox ThingFinder well over ten years ago.
>>
>> My guess is that very few people really need entity extraction.
>>
>> Using EE for automatic taxonomy generation is even harder to get right.
>> At best, that is a way to get a starter set of categories that you can
>> edit. You will not get a production quality taxonomy automatically.
>>
>> wunder
>>
>> On 10/27/08 8:31 AM, "Charlie Jackson" <[hidden email]> wrote:
>>
>>> True, though I may be able to convince the powers that be that it's worth the
>>> investment.
>>>
>>> There are a number of open source or free tools listed on the Wikipedia entry
>>> for entity extraction
>>> (http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free)
>>> --
>>> does anyone have any experience with any of these?
>>>
>>> ____________________________________________
>>> Charlie Jackson
>>> 312-873-6537
>>> [hidden email]
>>>
>>> -----Original Message-----
>>> From: Otis Gospodnetic [mailto:[hidden email]]
>>> Sent: Monday, October 27, 2008 10:23 AM
>>> To: [hidden email]
>>> Subject: Re: Entity extraction?
>>>
>>> For the record, LingPipe is not free.  It's good, but it's not free.
>>>
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>>> From: Rafael Rossini <[hidden email]>
>>>> To: [hidden email]
>>>> Sent: Friday, October 24, 2008 6:08:14 PM
>>>> Subject: Re: Entity extraction?
>>>>
>>>> Solr can do a simple facet seach like FAST, but the entity extraction
>>>> demands other tecnologies. I do not know how FAST does it but at the company
>>>> I´m working on (www.cortex-intelligence.com), we use a mix of statistical
>>>> and language-specific tasks to recognize and categorize entities in the
>>>> text. Ling Pipe is another tool (free) that does that too. In case you would
>>>> like to see a simple demo: http://www.cortex-intelligence.com/tech/
>>>>
>>>> Rossini
>>>>
>>>>
>>>> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
>>>>> wrote:
>>>>
>>>>> During a recent sales pitch to my company by FAST, they mentioned entity
>>>>> extraction. I'd never heard of it before, but they described it as
>>>>> basically recognizing people/places/things in documents being indexed
>>>>> and then being able to do faceting on this data at query time. Does
>>>>> anything like this already exist in SOLR? If not, I'm not opposed to
>>>>> developing it myself, but I could use some pointers on where to start.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> - Charlie
>>>>>
>>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Grant Ingersoll-2
In reply to this post by vaiju1981
Warning: shameless plug:  Tom Morton and I have a chapter on NER and  
OpenNLP (and Solr, for that matter) in our book "Taming  
Text" (Manning) and the code will be open once we have a place to put  
it (hopefully soon).  In fact, you'll see us doing a lot of this kind  
of stuff w/ Solr and it should all be coming back to Solr/Lucene/
Mahout at some point (for instance, see https://issues.apache.org/jira/browse/SOLR-769 
, as I'm sure FAST told you they can do clustering, too!)
--end shameless plug ---

As for Mahout, NER  is a classification problem, and there are some  
tools in Mahout to do classification,  but nothing specifically  
targeted at NER at the moment.  Mahout, like Nutch, also takes  
advantage of Hadoop for scaling.  The combination of Mahout in Solr  
makes a lot of sense, IMO.


On Oct 25, 2008, at 11:25 PM, Vaijanath N. Rao wrote:

> Hi,
>
> One can use the OpenNLP Max entropy library and create there own  
> named-entity extraction.
> I had used it in one of the projects which I did with Solr.
>
> It is easy to integrate most of the NLP libraries with Solr. Though  
> we had named-entity extraction embedded in our crawler which would  
> populate a field called entities in the database, which we would  
> ingest in Solr as yet another field.
>
> --Thanks and Regards
> Vaijanath N. Rao
>
> Julien Nioche wrote:
>> Hi,
>>
>> Open Source NLP platforms like GATE (http://gate.ac.uk) or Apache  
>> UIMA are
>> typically used for these types of tasks. GATE in particular comes  
>> with an
>> application called ANNIE which does Named Entity Recognition.  
>> OpenCalais
>> does that as well and should be easy to embed, but it can't be  
>> tuned to do
>> more specific things unlike UIMA or GATE based applications.
>>
>> Depending on the architecture you have in mind it could be worth
>> investigating Nutch and add the NER as a custom plugin; NLP being  
>> often a
>> CPU intensive task you could leverage the scalability of Hadoop in  
>> Nutch.
>> There is a patch which allows to delegate the indexing to SOLR. As  
>> someone
>> else already said these named entities could then be used as facets.
>>
>> HTH
>>
>> Julien
>>
>

--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Ryan McKinley

On Oct 27, 2008, at 6:10 PM, Grant Ingersoll wrote:

> Warning: shameless plug:  Tom Morton and I have a chapter on NER and  
> OpenNLP (and Solr, for that matter) in our book "Taming  
> Text" (Manning) and the code will be open once we have a place to  
> put it (hopefully soon).  In fact, you'll see us doing a lot of this  
> kind of stuff w/ Solr and it should all be coming back to Solr/
> Lucene/Mahout at some point (for instance, see https://issues.apache.org/jira/browse/SOLR-769 
> , as I'm sure FAST told you they can do clustering, too!)
> --end shameless plug ---
>

thats great!

I just got the MEAP copy, it looks really good
http://www.manning.com/ingersoll/


> As for Mahout, NER  is a classification problem, and there are some  
> tools in Mahout to do classification,  but nothing specifically  
> targeted at NER at the moment.  Mahout, like Nutch, also takes  
> advantage of Hadoop for scaling.  The combination of Mahout in Solr  
> makes a lot of sense, IMO.
>

Perhaps this is more appropriate to ask on the mahout list, but...  
when you say "Mahout, like Nutch, also takes advantage of Hadoop for  
scaling", does that mean that much of Mahout requires hadoop?  Is it  
possible to do smaller scale problems on a simple setup and only  
invoke hadoop when required?

ryan



Reply | Threaded
Open this post in threaded view
|

Re: Entity extraction?

Grant Ingersoll-2

On Oct 27, 2008, at 8:53 PM, Ryan McKinley wrote:

>
> On Oct 27, 2008, at 6:10 PM, Grant Ingersoll wrote:
>
>> Warning: shameless plug:  Tom Morton and I have a chapter on NER  
>> and OpenNLP (and Solr, for that matter) in our book "Taming  
>> Text" (Manning) and the code will be open once we have a place to  
>> put it (hopefully soon).  In fact, you'll see us doing a lot of  
>> this kind of stuff w/ Solr and it should all be coming back to Solr/
>> Lucene/Mahout at some point (for instance, see https://issues.apache.org/jira/browse/SOLR-769 
>> , as I'm sure FAST told you they can do clustering, too!)
>> --end shameless plug ---
>>
>
> thats great!
>
> I just got the MEAP copy, it looks really good
> http://www.manning.com/ingersoll/

Thanks!

>
>
>
>> As for Mahout, NER  is a classification problem, and there are some  
>> tools in Mahout to do classification,  but nothing specifically  
>> targeted at NER at the moment.  Mahout, like Nutch, also takes  
>> advantage of Hadoop for scaling.  The combination of Mahout in Solr  
>> makes a lot of sense, IMO.
>>
>
> Perhaps this is more appropriate to ask on the mahout list, but...  
> when you say "Mahout, like Nutch, also takes advantage of Hadoop for  
> scaling", does that mean that much of Mahout requires hadoop?  Is it  
> possible to do smaller scale problems on a simple setup and only  
> invoke hadoop when required?

Yes, probably better asked on Mahout, but to answer your question,  
yes, most of the implementations require Hadoop so far, but it is not  
a strict requirement.  That being said, it is fairly easy to run them  
on a simple setup (i.e. single node).