whats the correct way to do normalisation?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

whats the correct way to do normalisation?

joe-2
Hi,
   
  Lucene indexes documents from 3 different countries here
(English, German and French). I want to normalize some
characters like umlauts. ä -> ae
  I did it in the following way:
  New Analyzer:
public class SpecialCharsAnalyzer extends StandardAnalyzer {
 public SpecialCharsAnalyzer() {
 }
   public SpecialCharsAnalyzer(Set stopWords) {
  super(stopWords);
 }
   public SpecialCharsAnalyzer(String[] stopWords) {
  super(stopWords);
 }
   public SpecialCharsAnalyzer(File stopwords) throws IOException {
  super(stopwords);
 }
   public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
  super(stopwords);
 }
   @Override
 public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = super.tokenStream(fieldName, reader);
  ts = new SpecialCharacterFilter(ts);
  return ts;
 }
}
  Is the SpecialCharsAnalyzer::tokenStream implemented correctly?
 
New Filter:
public class SpecialCharacterFilter extends TokenFilter {
 public SpecialCharacterFilter(TokenStream input) {
  super(input);
 }
   @Override
 public Token next() throws IOException {
  Token t = input.next();
    if (t == null)
   return null;
    String str = t.termText();
  if (str.indexOf("ä") != -1) {
   str = str.replaceAll("ä", "ae");
   t = new Token(str, t.startOffset(), t.endOffset() + 1);
  }
  return t;
 }
}
  Is the SpecialCharacterFilter::next implemented correctly,
in case of the "ä"?
 
Is this way the correct way to do normalisation?
  thx

 
---------------------------------
NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf Yahoo! Clever.
Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

Patrek
Hi,

Did you take a look at IsoLatin1AccentFilter ?

Patrick

On 11/6/06, hans meiser <[hidden email]> wrote:

>
> Hi,
>
>   Lucene indexes documents from 3 different countries here
> (English, German and French). I want to normalize some
> characters like umlauts. ä -> ae
>   I did it in the following way:
>   New Analyzer:
> public class SpecialCharsAnalyzer extends StandardAnalyzer {
> public SpecialCharsAnalyzer() {
> }
>    public SpecialCharsAnalyzer(Set stopWords) {
>   super(stopWords);
> }
>    public SpecialCharsAnalyzer(String[] stopWords) {
>   super(stopWords);
> }
>    public SpecialCharsAnalyzer(File stopwords) throws IOException {
>   super(stopwords);
> }
>    public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
>   super(stopwords);
> }
>    @Override
> public TokenStream tokenStream(String fieldName, Reader reader) {
>     TokenStream ts = super.tokenStream(fieldName, reader);
>   ts = new SpecialCharacterFilter(ts);
>   return ts;
> }
> }
>   Is the SpecialCharsAnalyzer::tokenStream implemented correctly?
>
> New Filter:
> public class SpecialCharacterFilter extends TokenFilter {
> public SpecialCharacterFilter(TokenStream input) {
>   super(input);
> }
>    @Override
> public Token next() throws IOException {
>   Token t = input.next();
>     if (t == null)
>    return null;
>     String str = t.termText();
>   if (str.indexOf("ä") != -1) {
>    str = str.replaceAll("ä", "ae");
>    t = new Token(str, t.startOffset(), t.endOffset() + 1);
>   }
>   return t;
> }
> }
>   Is the SpecialCharacterFilter::next implemented correctly,
> in case of the "ä"?
>
> Is this way the correct way to do normalisation?
>   thx
>
>
> ---------------------------------
> NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf
> Yahoo! Clever.
>
Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

joe-2
Hi,
   
  > Did you take a look at IsoLatin1AccentFilter ?
   
  It nearly do the same i need, but not perfectly.
   
   public final Token next() throws java.io.IOException {
 final Token t = input.next();
   if (t == null)
   return null;  
 return new Token(removeAccents(t.termText()), t.startOffset(), t.endOffset(), t.type());
 }
   
  Here also a new Token is created. The question i have, why the endoffset is not
  corrected for the new created token? Some times the new token is bigger than before.
  Complete code link:
  http://developer.spikesource.com/spikewatch.logs/fedora-3-i386/2221/lucene/reports/clover/org/apache/lucene/analysis/ISOLatin1AccentFilter.html
 


 

 
---------------------------------
Keine Lust auf Tippen? Rufen Sie Ihre Freunde einfach an.
  Yahoo! Messenger. Jetzt installieren .
Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

Erik Hatcher

On Nov 6, 2006, at 11:27 AM, hans meiser wrote:

> Hi,
>
>> Did you take a look at IsoLatin1AccentFilter ?
>
>   It nearly do the same i need, but not perfectly.
>
>    public final Token next() throws java.io.IOException {
>  final Token t = input.next();
>    if (t == null)
>    return null;
>  return new Token(removeAccents(t.termText()), t.startOffset(),  
> t.endOffset(), t.type());
>  }
>
>   Here also a new Token is created. The question i have, why the  
> endoffset is not
>   corrected for the new created token? Some times the new token is  
> bigger than before.
>   Complete code link:
>   http://developer.spikesource.com/spikewatch.logs/fedora-3- 
> i386/2221/lucene/reports/clover/org/apache/lucene/analysis/
> ISOLatin1AccentFilter.html

For highlighting purposes, it's best to keep the offsets in the  
original text, not adjusted for token mutation.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

joe-2
Hi,
 
On Nov 6, 2006, at 11:27 AM, hans meiser wrote:
>> public final Token next() throws java.io.IOException {
>> final Token t = input.next();
>> if (t == null)
>> return null;
>> return new Token(removeAccents(t.termText()), t.startOffset(),
>> t.endOffset(), t.type());
>> }
>>

> For highlighting purposes, it's best to keep the offsets in the
> original text, not adjusted for token mutation.
   
  Ok, i corrected it.
   
  For a  "normal" search without a "*" it works now. But when i do a
  search with an "*" or a "?" my newly implemented filter is not called and for example my umlauts are not replaced by the analyzer(filter).
   
  I do a:
  Analyzer analyzer = new SpecialCharsAnalyzer();
  QueryParser parser = new QueryParser(DocumentFields.TEXT, analyzer);
  query = parser.parse(queryStr);
   
  For wildcards the tokenStream method of my analyzer isnt called.
  Whats my fault?

 
---------------------------------
Yahoo! 360° – Bloggen und Leute treffen. Erstellen Sie jetzt Ihre eigene Seite – kostenlos!.
Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

Daniel Naber-5
On Tuesday 07 November 2006 12:41, hans meiser wrote:

>   For a  "normal" search without a "*" it works now. But when i do a
>   search with an "*" or a "?" my newly implemented filter is not called
> and for example my umlauts are not replaced by the analyzer(filter).

See
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

Chris Hostetter-3
In reply to this post by joe-2

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a

Are Wildcard, Prefix, and Fuzzy queries case sensitive?

Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries
are not passed through the Analyzer, which is the component that performs
operations such as stemming and lowercasing.

The reason for skipping the Analyzer is that if you were searching for
"dogs*" you would not want "dogs" first stemmed to "dog", since that would
then match "dog*", which is not the intended query.


: Date: Tue, 7 Nov 2006 12:41:58 +0100 (CET)
: From: hans meiser <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: Re: whats the correct way to do normalisation?
:
: Hi,
:
: On Nov 6, 2006, at 11:27 AM, hans meiser wrote:
: >> public final Token next() throws java.io.IOException {
: >> final Token t = input.next();
: >> if (t == null)
: >> return null;
: >> return new Token(removeAccents(t.termText()), t.startOffset(),
: >> t.endOffset(), t.type());
: >> }
: >>
:
: > For highlighting purposes, it's best to keep the offsets in the
: > original text, not adjusted for token mutation.
:
:   Ok, i corrected it.
:
:   For a  "normal" search without a "*" it works now. But when i do a
:   search with an "*" or a "?" my newly implemented filter is not called and for example my umlauts are not replaced by the analyzer(filter).
:
:   I do a:
:   Analyzer analyzer = new SpecialCharsAnalyzer();
:   QueryParser parser = new QueryParser(DocumentFields.TEXT, analyzer);
:   query = parser.parse(queryStr);
:
:   For wildcards the tokenStream method of my analyzer isnt called.
:   Whats my fault?
:
:
: ---------------------------------
: Yahoo! 360° – Bloggen und Leute treffen. Erstellen Sie jetzt Ihre eigene Seite – kostenlos!.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

joe-2
Hi,
> http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a
>
> Are Wildcard, Prefix, and Fuzzy queries case sensitive?
>
> Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries
> are not passed through the Analyzer, which is the component that performs
> operations such as stemming and lowercasing

Ok, thx

I want "Überraschung" is found by

Überr*
Ueberr*

So the best i can do is to do the normalisation manually(not by an
analyzer) before the indexing/searching process?


               
___________________________________________________________
Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

Chris Hostetter-3

: I want "Überraschung" is found by
:
: Überr*
: Ueberr*
:
: So the best i can do is to do the normalisation manually(not by an
: analyzer) before the indexing/searching process?

Or use an Analyzer at index time that puts both the UTF-8 version of the
string and the Latin-1 version of the string in the same field (at the
same position so they still work with phrases) and at query time just
search for the text the user types in as is ... that should work for both
straight term queries and prefix/wildcard queries that don't get analyzed
at query time.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: whats the correct way to do normalisation?

joe-2
Hi,

> : I want "Überraschung" is found by
> :
> : Überr*
> : Ueberr*
> :
> : So the best i can do is to do the normalisation manually(not by an
> : analyzer) before the indexing/searching process?
>
> Or use an Analyzer at index time that puts both the UTF-8 version of the
> string and the Latin-1 version of the string in the same field (at the
> same position so they still work with phrases) and at query time just
> search for the text the user types in as is ... that should work for both
> straight term queries and prefix/wildcard queries that don't get analyzed
> at query time.
>  
Oh yes thats sounds good too.




               
___________________________________________________________
Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]