Solr Analysis Package

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Solr Analysis Package

Elmo Bleek
I'd like to use the filter factories in the org.apache.solr.analysis package
for tokenizing text in a separate application. I need to chain a couple
tokenizers together like Solr does on indexing and query parsing. I have
looked into the TokenizerChain class to do this. I have successfully
implemented a tokenization chain, but was wondering if there is an
established way to do this. I just hacked together something that happened
to work. Below is a code snippet. Any advise would be appreciated.
Dependencies: solr-core-1.4.0, lucene-core-2.9.3, lucene-snowball-2.9.3. I
am not tied to these and could use different versions.
P.S. Is this more of a question for the solr-dev mailing list?

<code>
TokenizerFactory tokenizer = new WhitespaceTokenizerFactory();
Map<String,String> args = new HashMap<String,String>();
SnowballPorterFilterFactory porterFilter = new
SnowballPorterFilterFactory();
porterFilter.init(args);

args = new HashMap<String,String>();
args.put("generateWordParts", "1");
args.put("generateNumberParts", "1");
args.put("catenateWords", "1");
args.put("catenateNumbers", "1");
args.put("catenateAll", "0");
WordDelimiterFilterFactory wordFilter = new WordDelimiterFilterFactory();
wordFilter.init(args);

LowerCaseFilterFactory lowercaseFilter = new LowerCaseFilterFactory();
TokenFilterFactory[] filters = new TokenFilterFactory[] {
wordFilter, lowercaseFilter, porterFilter
};
TokenizerChain chain = new TokenizerChain(tokenizer, filters);
        TokenStream stream = chain.tokenStream(null, new
StringReader(builder.toString()));
        TermAttribute tm =
(TermAttribute)stream.getAttribute(TermAttribute.class);
        while (stream.incrementToken()) {
            System.out.println(tm.term());
        }
</code>