How to tune Analyzer for Text Extraction

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to tune Analyzer for Text Extraction

xs2Abhishek
Hi,

I am trying to make a decision on weather or not I can use Lucene for my requirements, which mainly include data tagging. I have to be able to parse or index a .txt file and then be able to extract text accordingly. For e.g if the input document has some text like: "Location: New York" , so for this input I should be able to extract "New York" if key word Location is present. I am trying to learn about Lucene and looked into "tokensFromAnalysis(analyzer, text)". But i'm still not sure how I could extract data using lucene. Can I use queries to extract this piece of information?

Any help on this would be appreciated.

Thanks,
Abhishek
Reply | Threaded
Open this post in threaded view
|

Re: How to tune Analyzer for Text Extraction

Michael Wechner
xs2Abhishek schrieb:

> Hi,
>
> I am trying to make a decision on weather or not I can use Lucene for my
> requirements, which mainly include data tagging. I have to be able to parse
> or index a .txt file and then be able to extract text accordingly. For e.g
> if the input document has some text like: "Location: New York" , so for this
> input I should be able to extract "New York" if key word Location is
> present. I am trying to learn about Lucene and looked into
> "tokensFromAnalysis(analyzer, text)". But i'm still not sure how I could
> extract data using lucene. Can I use queries to extract this piece of
> information?
>  

before feeding the content into Lucene you might want to pre-parse your
content with Apache Tika or something similar.

Cheers

Michael
> Any help on this would be appreciated.
>
> Thanks,
> Abhishek
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to tune Analyzer for Text Extraction

Shai Erera
In reply to this post by xs2Abhishek
If this file has a predefined construct, e.g.:
title: someting
location: new york
....
then you can write a simple parser that extracts that information.

But I think otherwise this falls outside the scope of Lucene, unless I
misunderstood you.

If I had to give it a long shot though, I'd try to index all the data using
WhitespaceAnalyzer, and then query for "Location". I'd also use the
Highlighter in contrib to find matching segments of text, and take whatever
has come after "Location". You should know though how much to take after
Location ...

Maybe if you post here a sample input, it'll trigger something in me :).

Shai

On Wed, Aug 12, 2009 at 12:27 AM, xs2Abhishek <[hidden email]> wrote:

>
> Hi,
>
> I am trying to make a decision on weather or not I can use Lucene for my
> requirements, which mainly include data tagging. I have to be able to parse
> or index a .txt file and then be able to extract text accordingly. For e.g
> if the input document has some text like: "Location: New York" , so for
> this
> input I should be able to extract "New York" if key word Location is
> present. I am trying to learn about Lucene and looked into
> "tokensFromAnalysis(analyzer, text)". But i'm still not sure how I could
> extract data using lucene. Can I use queries to extract this piece of
> information?
>
> Any help on this would be appreciated.
>
> Thanks,
> Abhishek
> --
> View this message in context:
> http://www.nabble.com/How-to-tune-Analyzer-for-Text-Extraction-tp24926082p24926082.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to tune Analyzer for Text Extraction

Grant Ingersoll-2
In reply to this post by xs2Abhishek

On Aug 11, 2009, at 5:27 PM, xs2Abhishek wrote:

>
> Hi,
>
> I am trying to make a decision on weather or not I can use Lucene  
> for my
> requirements, which mainly include data tagging. I have to be able  
> to parse
> or index a .txt file and then be able to extract text accordingly.  
> For e.g
> if the input document has some text like: "Location: New York" , so  
> for this
> input I should be able to extract "New York" if key word Location is
> present. I am trying to learn about Lucene and looked into
> "tokensFromAnalysis(analyzer, text)". But i'm still not sure how I  
> could
> extract data using lucene. Can I use queries to extract this piece of
> information?
>

You will likely need to write your own TokenFilter that can do the  
extraction.  It is feasible to plug in something like OpenNLP or other  
extraction toolkits into the Analysis stream and then provide these  
capabilities.  That, combined with the Tee/Sink Tokenizer/TokenFilter  
capabilities can make for some lightweight, but still powerful  
extraction capabilities.  You might also look at UIMA, which is in the  
Apache Incubator.


> Any help on this would be appreciated.
>
> Thanks,
> Abhishek
> --
> View this message in context: http://www.nabble.com/How-to-tune-Analyzer-for-Text-Extraction-tp24926082p24926082.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to tune Analyzer for Text Extraction

Julien Nioche-4
In reply to this post by xs2Abhishek
Hi,

you should also have a look at GATE (http://gate.ac.uk) which comes with a
NER application called ANNIE. You could use it to analyse your docs before
indexing them with Lucene or SOLR.

As Grant mentioned, UIMA can also be used for that as there are a number of
NER annotators available for it (OpenCalais, Stanford NER)

Julien

--
DigitalPebble Ltd
http://www.digitalpebble.com

2009/8/11 xs2Abhishek <[hidden email]>

>
> Hi,
>
> I am trying to make a decision on weather or not I can use Lucene for my
> requirements, which mainly include data tagging. I have to be able to parse
> or index a .txt file and then be able to extract text accordingly. For e.g
> if the input document has some text like: "Location: New York" , so for
> this
> input I should be able to extract "New York" if key word Location is
> present. I am trying to learn about Lucene and looked into
> "tokensFromAnalysis(analyzer, text)". But i'm still not sure how I could
> extract data using lucene. Can I use queries to extract this piece of
> information?
>
> Any help on this would be appreciated.
>
> Thanks,
> Abhishek
> --
> View this message in context:
> http://www.nabble.com/How-to-tune-Analyzer-for-Text-Extraction-tp24926082p24926082.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to tune Analyzer for Text Extraction

xs2Abhishek

Hi,

Thanks for your replies, it really helped me a lot.

Thanks&Regards,
Abhishek

--
View this message in context: http://www.nabble.com/How-to-tune-Analyzer-for-Text-Extraction-tp24926082p24938899.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to tune Analyzer for Text Extraction

xs2Abhishek
In reply to this post by Shai Erera
Hi,

Well you completely understood my problem, the point you mentioned about how much to extract after the word Location is something i'll have to figure out. So lets say that the input to my system would be:-
"
Location : Montvale, NJ
Duration : 7 months
"
Now the problem is when the input changes to :-
"
located in Montvale,NJ... For a duration of 7 months
"

After reading about Lucene and some other API's like GATE as replied to this post, i'd like to know that weather this kind of extraction is possible by Lucene or not?
Ques2) I have been reading some documentation of Lucene but I could not figure out a way to extract data from index. For e.g Lets say that I successfully found the word "location" in an indexed document, but how can I extract next 15 characters or some text after the word "location".

Thanks & Regards,
Abhishek