Removing diacritics with ISOLatin1AccentFilter

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing diacritics with ISOLatin1AccentFilter

luther.blisset
Hi folks,
I just upgrading Hibernate Search library of my app and so I had to upgrade Lucene too and pass from 2.2 to 2.4 version.
In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure how it works.
I use a TwoWayFieldBridge to index the data and this is my set method:

public void set(String s, Object o, Document document, Field.Store store, Field.Index index, Float aFloat){

        //MyObject has a field name
        MyObject objectToIndex;

        //casting from Object to MyObject
        try{
            objectToIndex = MyObject.class.cast(o);
        }catch(ClassCastException cEx ){}



        if (objectToIndex.getName() != null) {
           
            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new StandardTokenizer(new StringReader(objectToIndex.getName())));
            filter.removeAccents(objectToIndex.getName().toCharArray(), objectToIndex.getName().length());
            Field name = new Field( "name", String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES, Field.Index.UN_TOKENIZED );

            document.add(name);
        }
}


but it doesn't work. And if pass an accented word for the property objectToIndex.getName(), it remains with accent :(
I think there is something wrong in my code when I create the new instance of ISOLatin1AccentFilter  but I can' t get it works properly.
Could someone help me?
thanks a lot
Reply | Threaded
Open this post in threaded view
|

Re: Removing diacritics with ISOLatin1AccentFilter

Simon Willnauer
On Fri, Jul 24, 2009 at 11:41 AM, luther blisset<[hidden email]> wrote:

>
> Hi folks,
> I just upgrading Hibernate Search library of my app and so I had to upgrade
> Lucene too and pass from 2.2 to 2.4 version.
> In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure
> how it works.
> I use a TwoWayFieldBridge to index the data and this is my set method:
>
> public void set(String s, Object o, Document document, Field.Store store,
> Field.Index index, Float aFloat){
>
>        //MyObject has a field name
>        MyObject objectToIndex;
>
>        //casting from Object to MyObject
>        try{
>            objectToIndex = MyObject.class.cast(o);
>        }catch(ClassCastException cEx ){}
>
>
>
>        if (objectToIndex.getName() != null) {
>
>            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
> StandardTokenizer(new StringReader(objectToIndex.getName())));
>            filter.removeAccents(objectToIndex.getName().toCharArray(),
> objectToIndex.getName().length());
>            Field name = new Field( "name",
> String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
> Field.Index.UN_TOKENIZED );
>
>            document.add(name);
>        }
> }
>
I do not really understand what you are trying to do. do you just
wanna remove the accents from the string and index it without passing
it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
field value to an analyzer).
do you wanna index this without an analyzer?!

If you pass an array to ISOLantin1AccentFilter#removeAccents() the
processed chars will be written to an private internal char array
inside the ISOLantin1AccentFilter. You can not use the removeAccents
method just removing the accents. what you could do as a dirty
workaround is the following:
String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
  ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
      new Tokenizer(new StringReader(foo)){
        private boolean isRead = false;
        public Token next(final Token reusableToken) throws IOException {
          if(isRead){
            return null;
          }
          BufferedReader reader = new BufferedReader(this.input);
          StringBuilder builder = new StringBuilder();

         char[] buffer = new char[1024];
         int read = -1;
         while((read = reader.read(buffer)) > 0){
           builder.append(buffer, 0, read);
         }
         reusableToken.setTermText(builder.toString());
         isRead = true;
         return reusableToken;
        }
  });
    Token t = filter.next();
    String foo_without_accents = t.term();
    System.out.println(foo_without_accents);
yields: HAllo HAllo HAllo HAllo HAllo


simon

>
> but it doesn't work. And if pass an accented word for the property
> objectToIndex.getName(), it remains with accent :(
> I think there is something wrong in my code when I create the new instance
> of ISOLatin1AccentFilter  but I can' t get it works properly.
> Could someone help me?
> thanks a lot
> --
> View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Removing diacritics with ISOLatin1AccentFilter

iorixxx
In reply to this post by luther.blisset

Or alternatively:

String test = "HÄllo HÄllo HÄllo HÄllo HÄllo";

ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
                KeywordTokenizer(new StringReader(test)));

    final Token reusableToken = new Token();
    Token nextToken;
       
    if ((nextToken = filter.next(reusableToken)) != null)
         System.out.print(nextToken.term());
       
    filter.close();




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Removing diacritics with ISOLatin1AccentFilter

luther.blisset
In reply to this post by Simon Willnauer
I'm trying to index all the words without accent.
I do the same when I'm querying, I remove the accent and lower case the search term.
Why should I pass the string through the analyzer?
or what is wrong if don't pass it through the analyzer?
and what are the benefits?
I'm just a newbie with Lucene..
Thanks a lot for your reply :]



Simon Willnauer wrote
On Fri, Jul 24, 2009 at 11:41 AM, luther blisset<sabri.bruni@gmail.com> wrote:
>
> Hi folks,
> I just upgrading Hibernate Search library of my app and so I had to upgrade
> Lucene too and pass from 2.2 to 2.4 version.
> In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure
> how it works.
> I use a TwoWayFieldBridge to index the data and this is my set method:
>
> public void set(String s, Object o, Document document, Field.Store store,
> Field.Index index, Float aFloat){
>
>        //MyObject has a field name
>        MyObject objectToIndex;
>
>        //casting from Object to MyObject
>        try{
>            objectToIndex = MyObject.class.cast(o);
>        }catch(ClassCastException cEx ){}
>
>
>
>        if (objectToIndex.getName() != null) {
>
>            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
> StandardTokenizer(new StringReader(objectToIndex.getName())));
>            filter.removeAccents(objectToIndex.getName().toCharArray(),
> objectToIndex.getName().length());
>            Field name = new Field( "name",
> String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
> Field.Index.UN_TOKENIZED );
>
>            document.add(name);
>        }
> }
>
I do not really understand what you are trying to do. do you just
wanna remove the accents from the string and index it without passing
it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
field value to an analyzer).
do you wanna index this without an analyzer?!

If you pass an array to ISOLantin1AccentFilter#removeAccents() the
processed chars will be written to an private internal char array
inside the ISOLantin1AccentFilter. You can not use the removeAccents
method just removing the accents. what you could do as a dirty
workaround is the following:
String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
  ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
      new Tokenizer(new StringReader(foo)){
        private boolean isRead = false;
        public Token next(final Token reusableToken) throws IOException {
          if(isRead){
            return null;
          }
          BufferedReader reader = new BufferedReader(this.input);
          StringBuilder builder = new StringBuilder();

         char[] buffer = new char[1024];
         int read = -1;
         while((read = reader.read(buffer)) > 0){
           builder.append(buffer, 0, read);
         }
         reusableToken.setTermText(builder.toString());
         isRead = true;
         return reusableToken;
        }
  });
    Token t = filter.next();
    String foo_without_accents = t.term();
    System.out.println(foo_without_accents);
yields: HAllo HAllo HAllo HAllo HAllo


simon
>
> but it doesn't work. And if pass an accented word for the property
> objectToIndex.getName(), it remains with accent :(
> I think there is something wrong in my code when I create the new instance
> of ISOLatin1AccentFilter  but I can' t get it works properly.
> Could someone help me?
> thanks a lot
> --
> View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Removing diacritics with ISOLatin1AccentFilter

luther.blisset
In reply to this post by iorixxx
yes Ahmet Arslan ...this works!!
I've just tested it and works nicely...
really thanks..



Ahmet Arslan wrote
Or alternatively:

String test = "HÄllo HÄllo HÄllo HÄllo HÄllo";

ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
                KeywordTokenizer(new StringReader(test)));

    final Token reusableToken = new Token();
    Token nextToken;
       
    if ((nextToken = filter.next(reusableToken)) != null)
         System.out.print(nextToken.term());
       
    filter.close();




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org