Indexing bigrams and trigrams in Lucene

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing bigrams and trigrams in Lucene

Venkateshprasanna
I need to index bigrams and trigrams in a document. Here is an example:

Text:
This is a text document written by someone. Read this and post your comments

words that must be indexed:
text
document
written
someone
read
post
your
comments
text document
document written
post your
your comments
text document written
post your comments

So, I made changes to StandardAnalyzer.java and StandardTokenizer.jj to try and achieve this.

I increased the LOOKAHEAD option value to 4:

options {
  LOOKAHEAD = 4;
  FORCE_LA_CHECK = true;
  .
  .
}


I made the following changes to StandardTokenizer.jj :

org.apache.lucene.analysis.Token next() throws IOException :
  :
  :
    {
      if (token.kind == EOF) {
        return null;
      }
     
      else if(token.kind == ALPHANUM) {
     
      Token nextToken = token.next;
      if(token.next.kind ==ALPHANUM) {
        return
           new org.apache.lucene.analysis.Token(token.image+" "+nextToken.image,
                                        token.beginColumn,nextToken.endColumn,
                                        tokenImage[token.kind]);
        }
      }
 
      else {
        return
          new org.apache.lucene.analysis.Token(token.image,
                                        token.beginColumn,token.endColumn,
                                        tokenImage[token.kind]);
      }
    }


That is, I am using token.next to get info about the next token. But it is returning null. What is the reason and is there a better way of doing this?

Reply | Threaded
Open this post in threaded view
|

Re: Indexing bigrams and trigrams in Lucene

Chris Hostetter-3

: This is a text document written by someone. Read this and post your comments
:
: words that must be indexed:
: text
: document
        ...
: text document
: document written

typically when people talk about indexing n-grams -- they mean character
wise (so they can find words with simple spelling mistakes) .. it's not
relaly clear why you would need word wise n-grams, why not just search for
phrases with no slop?


: So, I made changes to StandardAnalyzer.java and StandardTokenizer.jj to try
: and achieve this.

if you really need/want an analyzer that produces those tokens, i would
suggest you do it with a TokenFilter -- no reason to complicate the
tokenizing process when you can leave that alone and combine the tokens
instead.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]