Beginner: Specific indexing

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Beginner: Specific indexing

MilleB
Hi guys,

Fairly new to Lucene, and just finished reading Lucene in Action.

My problem is the following I need to index the documents that only contains
the following pattern(s) in a mass of documents:

<tag> <#1> <#2>

<tag> is a fixed list of words
<#x> are small numbers <100

My idea is to simply build a TokenFilter that will look for those... do I
have it right ?

Some side questions:
what if I want to index <tag> <#1> <#2> as keywords ?
what if I also want to give full text search on the select documents ?

Thx for your help
Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

hossman
I may be missunderstanding your question, but i wouldn't attempt to tackle
this with a TokenFilter unless you want both the "tag" and the numbers to
appear in the same field.  i think what you want to do is first parse
whatever file format you are dealing with, then build Documents based on
the individual Fields.

a TokenFilter comes into play when you are Analyzing individual Field
values.

but since i have very little understanding of your problem, and what you
are trying to achieve, i may be way off base.

: <tag> <#1> <#2>
:
: <tag> is a fixed list of words
: <#x> are small numbers <100
:
: My idea is to simply build a TokenFilter that will look for those... do I
: have it right ?
:
: Some side questions:
: what if I want to index <tag> <#1> <#2> as keywords ?
: what if I also want to give full text search on the select documents ?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

MilleB
OK, not clear enough.

I have documents in which I'm looking for 3 consecutive elements :
<string1> <#1> <#2> (string1 is a predefined list)

I want to disregard those without this sequence and reverse index those with
these markers... it looks to me that parsing won't do the job since my
documents are unstructured and do not have a specific grammar. The triplets
are to be found in the body of the document (if any)

At this moment I have the impression that I need to have a double pass on
the document stream :
1. pass 1 extract the triplets (with TokenFilter ???)  - if there are no
triplets disregard the document
2. pass 2 index the document stream with keywords for the found triplets and
text (standardanalyzer) for the actual body.

additionnal issue, I'm finding that the numbers might be in Roman
notation... any idea how tocheck if a token could be roman number or just
another random stream

-Ray-


On Tue, Sep 2, 2008 at 7:43 PM, Chris Hostetter <[hidden email]>wrote:

> I may be missunderstanding your question, but i wouldn't attempt to tackle
> this with a TokenFilter unless you want both the "tag" and the numbers to
> appear in the same field.  i think what you want to do is first parse
> whatever file format you are dealing with, then build Documents based on
> the individual Fields.
>
> a TokenFilter comes into play when you are Analyzing individual Field
> values.
>
> but since i have very little understanding of your problem, and what you
> are trying to achieve, i may be way off base.
>
> : <tag> <#1> <#2>
> :
> : <tag> is a fixed list of words
> : <#x> are small numbers <100
> :
> : My idea is to simply build a TokenFilter that will look for those... do I
> : have it right ?
> :
> : Some side questions:
> : what if I want to index <tag> <#1> <#2> as keywords ?
> : what if I also want to give full text search on the select documents ?
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

hossman

Honestly: your problem doesn't sound like a Lucene problem to me at all
... i would write custom code to cehck your files for the pattern you are
looking for.  if you find it *then* construct a Document object, and add
your 3 fields.  I probably wouldn't even use an analyzer.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

MilleB
I understand your point, I did not say it was a Lucene problem but was
rather checking if I my intended design was correct... basically not.
Since I thought that I would first break my stream in token to do my special
filter, I thought I could do it in one step...

Interesting if you are not going to use an analyser... what then ? I'm
thinking of using javacc, because I oversimplified somewhat the 3 field
string structure, so I need a kind of small grammar for that.

Thx anyway, preventing me for doing odd things.




On Fri, Sep 5, 2008 at 6:48 AM, Chris Hostetter <[hidden email]>wrote:

>
> Honestly: your problem doesn't sound like a Lucene problem to me at all
> ... i would write custom code to cehck your files for the pattern you are
> looking for.  if you find it *then* construct a Document object, and add
> your 3 fields.  I probably wouldn't even use an analyzer.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

hossman

: Interesting if you are not going to use an analyser... what then ? I'm
: thinking of using javacc, because I oversimplified somewhat the 3 field
: string structure, so I need a kind of small grammar for that.

Well, the specifics of "what else" is in your files is going to be the
biggest factor in deciding how to find the bits of info you need.  

Let me try to put in perspective for you how your question sounded to me,
as someone unfamiliar with your specific problem.  the question sounded
equivilent to if someone had asked;


"I have a bunch of XML files, some of these XML files contain syntax that
loks like this...
   <property name="${keyword}" min="${x}" max="${y}" />
where ${x} and ${y} are small numbers, and ${keyword} is from a fixed list
of words.  My idea is to simply build a TokenFilter that will look for
those... do I have it right ?"

...and i would say: "Not really.  Use an XML parser to parse your XML and
extract your structured data, then add them to your Lucene Document."

You're files may not be XML, but the basic premise is the same; use
whatever code makes the most sense to parse whatever file format you are
dealing with given what you know aboutthe files (not just the parts you
want, but the other parts as well)


Where an Analyzer might make sense is if you want to do processing on
those bits of data after you find them ... stemming your keywords, or
mapping them to synonyms, etc...


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

MilleB
I think I'm getting you. But the files I'm  going to parse have many formats
: PDF, HTML, Word.
they don't have a particular structure, memos if you will. But the ones I'm
interested in will have the triplets I described

Yes building a TokenFilter as you suggest should do the job.
I guess my initial idea was to use lucene to give me the Token stream so I
can run this specific filter on them.
But I want to do two contradictory things on the same token stream
+ low-band filter which only will gives the "triplets" (actually I do want
to add synonyms to the keywords too, you  are bringing up a good point, but
I could handle it in the query maybe)
+ high-band filter which indexes the text as normal/general text
(StarndarAnalyzer from lucene)

I just have to pass the token stream twice.
Thx a lot, for helping me to clarify my ideas.


On Fri, Sep 5, 2008 at 7:46 PM, Chris Hostetter <[hidden email]>wrote:

>
> : Interesting if you are not going to use an analyser... what then ? I'm
> : thinking of using javacc, because I oversimplified somewhat the 3 field
> : string structure, so I need a kind of small grammar for that.
>
> Well, the specifics of "what else" is in your files is going to be the
> biggest factor in deciding how to find the bits of info you need.
>
> Let me try to put in perspective for you how your question sounded to me,
> as someone unfamiliar with your specific problem.  the question sounded
> equivilent to if someone had asked;
>
>
> "I have a bunch of XML files, some of these XML files contain syntax that
> loks like this...
>   <property name="${keyword}" min="${x}" max="${y}" />
> where ${x} and ${y} are small numbers, and ${keyword} is from a fixed list
> of words.  My idea is to simply build a TokenFilter that will look for
> those... do I have it right ?"
>
> ...and i would say: "Not really.  Use an XML parser to parse your XML and
> extract your structured data, then add them to your Lucene Document."
>
> You're files may not be XML, but the basic premise is the same; use
> whatever code makes the most sense to parse whatever file format you are
> dealing with given what you know aboutthe files (not just the parts you
> want, but the other parts as well)
>
>
> Where an Analyzer might make sense is if you want to do processing on
> those bits of data after you find them ... stemming your keywords, or
> mapping them to synonyms, etc...
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

hossman

: I think I'm getting you. But the files I'm  going to parse have many formats
: : PDF, HTML, Word.
: they don't have a particular structure, memos if you will. But the ones I'm
: interested in will have the triplets I described

Ahhhh...  see this is something i completley didn't realize.  "Lucene" as
a library really doesn't provide any sort of mechanism for doing text
extraction from unknown file formats ... With some small exceptions (like
the HTMLStripTokenizer in Solr) the TokenStream concept is much more about
finding "Tokens" from a stream of plain text -- not about finding "Text"
in arbitrary (possibly binary) files.

You'll probably wantto check out the Tika subproject...
    http://incubator.apache.org/tika/
...or some of the various "How do i index _____ documents?" FAQs...
    http://wiki.apache.org/lucene-java/LuceneFAQ


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beginner: Specific indexing

MilleB
Well that is well explained in "Lucene in Action" if you want to search
files you have to build a file parser and there is a good example given. So
not really my problem.

But I thought I could go thru the token stream only once, where I have to go
twice 1. for detecting my triplets , 2. for indexing the text.

-Raymond-

On Tue, Sep 9, 2008 at 12:27 AM, Chris Hostetter
<[hidden email]>wrote:

>
> : I think I'm getting you. But the files I'm  going to parse have many
> formats
> : : PDF, HTML, Word.
> : they don't have a particular structure, memos if you will. But the ones
> I'm
> : interested in will have the triplets I described
>
> Ahhhh...  see this is something i completley didn't realize.  "Lucene" as
> a library really doesn't provide any sort of mechanism for doing text
> extraction from unknown file formats ... With some small exceptions (like
> the HTMLStripTokenizer in Solr) the TokenStream concept is much more about
> finding "Tokens" from a stream of plain text -- not about finding "Text"
> in arbitrary (possibly binary) files.
>
> You'll probably wantto check out the Tika subproject...
>    http://incubator.apache.org/tika/
> ...or some of the various "How do i index _____ documents?" FAQs...
>    http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Beginner: Specific indexing

steve_rowe
Hi Raymond,

Check out SinkTokenizer/TeeTokenFilter:
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/TeeTokenFilter.html>

Look at the unit tests for usage hints:
<http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/analysis/TeeSinkTokenTest.java?revision=687357&view=markup>

Steve

On 09/09/2008 at 9:11 AM, Raymond Balm├Ęs wrote:

> Well that is well explained in "Lucene in Action" if you want
> to search files you have to build a file parser and there is a good
> example given. So not really my problem.
>
> But I thought I could go thru the token stream only once,
> where I have to go twice 1. for detecting my triplets ,
> 2. for indexing the text.
>
> -Raymond-
>
> On Tue, Sep 9, 2008 at 12:27 AM, Chris Hostetter
> <[hidden email]>wrote:
>
> >
> > > I think I'm getting you. But the files I'm  going to
> parse have many
> > formats
> > > > PDF, HTML, Word.
> > > they don't have a particular structure, memos if you
> will. But the ones
> > I'm
> > > interested in will have the triplets I described
> >
> > Ahhhh...  see this is something i completley didn't realize.  "Lucene"
> > as a library really doesn't provide any sort of mechanism for doing
> > text extraction from unknown file formats ... With some small
> > exceptions (like the HTMLStripTokenizer in Solr) the TokenStream
> > concept is much more about finding "Tokens" from a stream of plain text
> > -- not about finding "Text" in arbitrary (possibly binary) files.
> >
> > You'll probably wantto check out the Tika subproject...
> >    http://incubator.apache.org/tika/ ...or some of the various "How do
> >    i index _____ documents?" FAQs...
> >    http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> > -Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]