Custom TokenStream + custom Attributes

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Custom TokenStream + custom Attributes

Michal Krajňanský
Dear Lucene users,

I have implemented a custom tokenizer (derived from TokenStream).

I need to pass additional attributes to those standard in Lucene
(PositionIncrementAttribute, OffsetAttribute), that would represent the
word position in the tokenized sentence in the number of words and not
characters, as one usually passes through OffsetAttribute. (I need both.)

Is there a way of achieving this?

I tried to implement own Attribute class (derive a new interface and
implementing class). The code compiles ok but I am getting exception at
runtime about the class casting.

Thank you a lot in advance,


MK



FYI the code looks like this:

/**
 *
 */
package com.newstin.nlp.analysis;

import java.io.IOException;
import java.util.Iterator;
import java.util.List;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

/**
 * @author michal
 */
public class TermsListTokenizer extends TokenStream
{
    private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt =
addAttribute(OffsetAttribute.class);
    private final PositionIncrementAttribute posIncrAtt =
addAttribute(PositionIncrementAttribute.class);

    private final Iterator<Term> termIterator;
    private int lastTermPos;

    public TermsListTokenizer(List<Term> terms)
    {
        termIterator = terms.iterator();
        lastTermPos = -1;
    }

    @Override
    public boolean incrementToken() throws IOException
    {
        clearAttributes();

        // TODO: check: compute the positions right for term variants !!!
        if (termIterator.hasNext()) {
            Term term = termIterator.next();

            termAtt.append(term.getTerm());
            offsetAtt.setOffset(term.getStart(), term.getEnd()); // need to
also save position in the number of words
            posIncrAtt.setPositionIncrement(term.getWordIndex() -
lastTermPos);
            lastTermPos = term.getWordIndex();
            return true;
        }

        return false;
    }
}
Reply | Threaded
Open this post in threaded view
|

Re: Custom TokenStream + custom Attributes

sarowe
Hi Michal,

Please repost on the lucene-user list.  [hidden email] has fewer subscribers, and it’s not focussed on Lucene usage questions.

More info: <http://lucene.apache.org/core/discussion.html#java-user-list-java-userlucene>

--
Steve
www.lucidworks.com

> On May 31, 2016, at 9:58 AM, Michal Krajňanský <[hidden email]> wrote:
>
> Dear Lucene users,
>
> I have implemented a custom tokenizer (derived from TokenStream).
>
> I need to pass additional attributes to those standard in Lucene
> (PositionIncrementAttribute, OffsetAttribute), that would represent the
> word position in the tokenized sentence in the number of words and not
> characters, as one usually passes through OffsetAttribute. (I need both.)
>
> Is there a way of achieving this?
>
> I tried to implement own Attribute class (derive a new interface and
> implementing class). The code compiles ok but I am getting exception at
> runtime about the class casting.
>
> Thank you a lot in advance,
>
>
> MK
>
>
>
> FYI the code looks like this:
>
> /**
> *
> */
> package com.newstin.nlp.analysis;
>
> import java.io.IOException;
> import java.util.Iterator;
> import java.util.List;
>
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
> import
> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>
> /**
> * @author michal
> */
> public class TermsListTokenizer extends TokenStream
> {
>    private final CharTermAttribute termAtt =
> addAttribute(CharTermAttribute.class);
>    private final OffsetAttribute offsetAtt =
> addAttribute(OffsetAttribute.class);
>    private final PositionIncrementAttribute posIncrAtt =
> addAttribute(PositionIncrementAttribute.class);
>
>    private final Iterator<Term> termIterator;
>    private int lastTermPos;
>
>    public TermsListTokenizer(List<Term> terms)
>    {
>        termIterator = terms.iterator();
>        lastTermPos = -1;
>    }
>
>    @Override
>    public boolean incrementToken() throws IOException
>    {
>        clearAttributes();
>
>        // TODO: check: compute the positions right for term variants !!!
>        if (termIterator.hasNext()) {
>            Term term = termIterator.next();
>
>            termAtt.append(term.getTerm());
>            offsetAtt.setOffset(term.getStart(), term.getEnd()); // need to
> also save position in the number of words
>            posIncrAtt.setPositionIncrement(term.getWordIndex() -
> lastTermPos);
>            lastTermPos = term.getWordIndex();
>            return true;
>        }
>
>        return false;
>    }
> }