skip document header while indexing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

skip document header while indexing

Pablo Gomes Ludermir
Hello all,

Is it possible to skip the first "xx" words while indexing a document?
For instance, on the code bellow, I would like to skip the "xx" first
words of "file" on the "CONTENTS_FIELD". Is that possible?

Document doc = new Document();
FileInputStream is = new FileInputStream(file);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(Field.Text(PATH_FIELD, artifactModel));
doc.add(Field.Text(CONTENTS_FIELD, reader, true));

Regards,
Pablo
--
Pablo Gomes Ludermir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: skip document header while indexing

Erik Hatcher

On Apr 29, 2005, at 7:50 AM, Pablo Gomes Ludermir wrote:

> Hello all,
>
> Is it possible to skip the first "xx" words while indexing a document?
> For instance, on the code bellow, I would like to skip the "xx" first
> words of "file" on the "CONTENTS_FIELD". Is that possible?
>
> Document doc = new Document();
> FileInputStream is = new FileInputStream(file);
> Reader reader = new BufferedReader(new InputStreamReader(is));
> doc.add(Field.Text(PATH_FIELD, artifactModel));
> doc.add(Field.Text(CONTENTS_FIELD, reader, true));

I believe your best bet will be to put in a custom Analyzer that does
this.  It wouldn't be too hard to code a wrapper around an analyzer
that did this.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: skip document header while indexing

Pablo Gomes Ludermir
Could you give me some pointers (example or website) to how I could do that?

On 4/29/05, Erik Hatcher <[hidden email]> wrote:

>
> On Apr 29, 2005, at 7:50 AM, Pablo Gomes Ludermir wrote:
>
> > Hello all,
> >
> > Is it possible to skip the first "xx" words while indexing a document?
> > For instance, on the code bellow, I would like to skip the "xx" first
> > words of "file" on the "CONTENTS_FIELD". Is that possible?
> >
> > Document doc = new Document();
> > FileInputStream is = new FileInputStream(file);
> > Reader reader = new BufferedReader(new InputStreamReader(is));
> > doc.add(Field.Text(PATH_FIELD, artifactModel));
> > doc.add(Field.Text(CONTENTS_FIELD, reader, true));
>
> I believe your best bet will be to put in a custom Analyzer that does
> this.  It wouldn't be too hard to code a wrapper around an analyzer
> that did this.
>
>        Erik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Pablo Gomes Ludermir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: skip document header while indexing

Erik Hatcher
On Apr 29, 2005, at 8:30 AM, Pablo Gomes Ludermir wrote:
> Could you give me some pointers (example or website) to how I could do
> that?

Lucene's own source code has several analyzers that are worth
investigating.  We also include several in Lucene in Action that
demonstrate additional features like incorporating synonym lookup with
WordNet and metaphone (soundex-like) replacements.  
http://www.lucenebook.com to grab the source code download.

The trick would be to add a TokenFilter that dropped Tokens until N
number of tokens had been dropped.

For an example, here's the Analyzer I wrote for the lucenebook.com site:

public class LiaAnalyzer extends Analyzer {
   private Set stopSet;
   boolean stem = true;

   public LiaAnalyzer() {
     stopSet = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

     // just a few words that would not be queried on
     stopSet.add("isn");
     stopSet.add("xyz");
     stopSet.add("bcd");
     stopSet.add("blt");
     stopSet.add("dhb");
     stopSet.add("ttc");
     stopSet.add("you");
     stopSet.add("our");
   }

   public LiaAnalyzer(boolean stem) {
     this();
     this.stem = stem;
   }

   public TokenStream tokenStream(String fieldName, Reader reader) {
     TokenFilter filter = new DashSplitterFilter(
               new HyphenatedFilter(
                 new DashDashFilter(
                   new LiaTokenizer(reader))));

     filter = new LengthFilter(3, filter);
     filter = new StopFilter(filter, stopSet);

     if (stem) {
       filter = new SnowballFilter(filter, "English");
     }

     return filter;
   }
}

        Erik


>
> On 4/29/05, Erik Hatcher <[hidden email]> wrote:
>>
>> On Apr 29, 2005, at 7:50 AM, Pablo Gomes Ludermir wrote:
>>
>>> Hello all,
>>>
>>> Is it possible to skip the first "xx" words while indexing a
>>> document?
>>> For instance, on the code bellow, I would like to skip the "xx" first
>>> words of "file" on the "CONTENTS_FIELD". Is that possible?
>>>
>>> Document doc = new Document();
>>> FileInputStream is = new FileInputStream(file);
>>> Reader reader = new BufferedReader(new InputStreamReader(is));
>>> doc.add(Field.Text(PATH_FIELD, artifactModel));
>>> doc.add(Field.Text(CONTENTS_FIELD, reader, true));
>>
>> I believe your best bet will be to put in a custom Analyzer that does
>> this.  It wouldn't be too hard to code a wrapper around an analyzer
>> that did this.
>>
>>        Erik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Pablo Gomes Ludermir
> [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...