Use two Analyzers in Lucene

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Use two Analyzers in Lucene

Kostas Vel
Hello,
I'm new to Java and in Lucene as well and I have a little problem.
I have to index and search with Lucene some papers that are written both in
English and Greek. When I say both I mean that in the same txt there are
both Greek na d English words.

I have the Analyzers for both languages (they do stemming as well) but I
don't know how to use them together. I imagine that I have to do two passes
for each paper  ?? or this is not correct?
The following line is how I use my English Analyzer

IndexWriter writer = new IndexWriter(indexPath,new PorterStemAnalyzer() ,
true);


And this about the Greek

IndexWriter writer = new IndexWriter(indexPath,new GreekAnalyzer() , true);


Is it possible?
And when I make the search, how the program can use both Analyzers as well?
They told me to make a mixed Analyzer but I don't know if this is possible.

 
Thanks in advance everyone for your help.

Kostas

Reply | Threaded
Open this post in threaded view
|

Re: Use two Analyzers in Lucene

Daniel Noll-3
Kostas V. wrote:

> I have the Analyzers for both languages (they do stemming as well) but I
> don't know how to use them together. I imagine that I have to do two passes
> for each paper  ?? or this is not correct?
> The following line is how I use my English Analyzer
>
> IndexWriter writer = new IndexWriter(indexPath,new PorterStemAnalyzer() ,
> true);
>
> And this about the Greek
>
> IndexWriter writer = new IndexWriter(indexPath,new GreekAnalyzer() , true);
>
> Is it possible?
> And when I make the search, how the program can use both Analyzers as well?
> They told me to make a mixed Analyzer but I don't know if this is possible.

The general idea would be to make an analyser which chooses which
analyser to pass the text to.  In general this would be rather
difficult, but in your particular situation, Greek and English use
different alphabets so it may not be too hard.

Having a quick look at the GreekAnalyzer, it still uses the
StandardTokenizer.  And it looks like the filters that are being used
for this and the English analyser wouldn't interfere with each other
either.  So you could probably make an analyser which performs both,
something like this:

   public class CombinedAnalyser extends Analyzer {
     private GreekAnalyzer greek = new GreekAnalyzer();
     public TokenStream tokenStream(String fieldName, Reader reader) {
       // Filters greek
       TokenStream tokens = greek.tokenStream(fieldName, reader);

       // Filters english
       tokens = new StandardFilter(tokens);
       tokens = new LowerCaseFilter(tokens);
       tokens = new StopFilter(tokens);
       tokens = new PorterStemFilter(tokens);

       return tokens;
     }
   }

Another way to go about it would be to detect the greek fragments of the
text up-front and pass those fragments through the greek analyser, and
anything else through the other analyser.

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]