contrib: keywordTokenStream

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

contrib: keywordTokenStream

Wolfgang Hoschek
Here's a convenience add-on method to MemoryIndex. If it turns out that
this could be of wider use, it could be moved into the core analysis
package. For the moment the MemoryIndex might be a better home.
Opinions, anyone?

Wolfgang.

        /**
         * Convenience method; Creates and returns a token stream that
generates a
         * token for each keyword in the given collection, "as is", without any
         * transforming text analysis. The resulting token stream can be fed
into
         * {@link #addField(String, TokenStream)}, perhaps wrapped into another
         * {@link org.apache.lucene.analysis.TokenFilter}, as desired.
         *
         * @param keywords
         *            the keywords to generate tokens for
         * @return the corresponding token stream
         */
        public TokenStream keywordTokenStream(final Collection keywords) {
                if (keywords == null)
                        throw new IllegalArgumentException("keywords must not be null");
               
                return new TokenStream() {
                        Iterator iter = keywords.iterator();
                        int pos = 0;
                        int start = 0;
                        public Token next() {
                                if (!iter.hasNext()) return null;
                               
                                Object obj = iter.next();
                                if (obj == null)
                                        throw new IllegalArgumentException("keyword must not be null");
                               
                                String term = obj.toString();
                                Token token = new Token(term, start, start + term.length());
                                start += term.length() + 1; // separate words by 1 (blank) character
                                pos++;
                                return token;
                        }
                };
        }


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: contrib: keywordTokenStream

Erik Hatcher
Wolfgang,

I've now added this.  I'm not seeing how this could be generally  
useful.  I'm curious how you are using it and why it is better suited  
for what you're doing than any other analyzer.

"keyword tokenizer" is a bit overloaded terminology-wise, though -  
look in the contrib/analyzers/src/java area to see what I mean.

     Erik

On May 3, 2005, at 4:26 PM, Wolfgang Hoschek wrote:

> Here's a convenience add-on method to MemoryIndex. If it turns out  
> that this could be of wider use, it could be moved into the core  
> analysis package. For the moment the MemoryIndex might be a better  
> home. Opinions, anyone?
>
> Wolfgang.
>
>     /**
>      * Convenience method; Creates and returns a token stream that  
> generates a
>      * token for each keyword in the given collection, "as is",  
> without any
>      * transforming text analysis. The resulting token stream can  
> be fed into
>      * {@link #addField(String, TokenStream)}, perhaps wrapped into  
> another
>      * {@link org.apache.lucene.analysis.TokenFilter}, as desired.
>      *
>      * @param keywords
>      *            the keywords to generate tokens for
>      * @return the corresponding token stream
>      */
>     public TokenStream keywordTokenStream(final Collection keywords) {
>         if (keywords == null)
>             throw new IllegalArgumentException("keywords must not  
> be null");
>
>         return new TokenStream() {
>             Iterator iter = keywords.iterator();
>             int pos = 0;
>             int start = 0;
>             public Token next() {
>                 if (!iter.hasNext()) return null;
>
>                 Object obj = iter.next();
>                 if (obj == null)
>                     throw new IllegalArgumentException("keyword  
> must not be null");
>
>                 String term = obj.toString();
>                 Token token = new Token(term, start, start +  
> term.length());
>                 start += term.length() + 1; // separate words by 1  
> (blank) character
>                 pos++;
>                 return token;
>             }
>         };
>     }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: contrib: keywordTokenStream

Wolfgang Hoschek
On May 3, 2005, at 5:26 PM, Erik Hatcher wrote:

> Wolfgang,
>
> I've now added this.

Thanks :-)

> I'm not seeing how this could be generally useful.  I'm curious how
> you are using it and why it is better suited for what you're doing
> than any other analyzer.
>
> "keyword tokenizer" is a bit overloaded terminology-wise, though -
> look in the contrib/analyzers/src/java area to see what I mean.
>
>     Erik

The difference between this and the KeywordTokenizer from the
contrib/analyzer is that it

- can operate on multiple keywords rather than just a single one. So
it's slighly more general.
- Takes a collection (typically of String values) as a input rather
than a Reader. I can see the java.io.Reader scalability rationale used
throughout the analysis APIs, but for many use cases (including my own)
Strings are a lot handier (and more efficient to deal with) - the
string values are small anyway.

So it's a convenient way to add terms (keywords if you like) that have
been parsed/massaged into string(s) by some existing external means
(e.g. grouped regex scanning of legacy formatted text files into
various fields, etc) into an index "as is", without any further
transforming analysis. Most folks could write such a (non-essential)
utility themselves but it's handy in a similar way that you have the
Field.Keyword convenience infrastructure...

> "keyword tokenizer" is a bit overloaded terminology-wise, though

If you come up with a better name feel free to rename it.

Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...