Question about special characters

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about special characters

Dan Wiggin
I need some functionality and I don't know how to do.
The problem is special characters like à, ä , ç or ñ latin characters in the
text.
Now I use iso latin filter, but the problem is when I want to obtain most
term used. These term are stored without ` ´ ^ or another "character
attribute".
For example "plàntïuç" (it isn't a real word) is stored like the term
"plantiuc".
How can I do to have in term vector the word "plàntïuç".

thks for all replies.
PD: excuse if this question is solved somewhere, but I don't saw it.
Reply | Threaded
Open this post in threaded view
|

Re: Question about special characters

Dan Wiggin
My own solution until I have another one better, I use FuzzyQuery for every
term in the phrase.
For example "My work is the worst" ->> My~ work~ is~ the~ worst
What do you think about this uggly solution? I don't have anything more
ideas.

2006/5/24, Dan Wiggin <[hidden email]>:

>
> I need some functionality and I don't know how to do.
> The problem is special characters like à, ä , ç or ñ latin characters in
> the text.
> Now I use iso latin filter, but the problem is when I want to obtain most
> term used. These term are stored without ` ´ ^ or another "character
> attribute".
> For example "plàntïuç" (it isn't a real word) is stored like the term
> "plantiuc".
> How can I do to have in term vector the word "plàntïuç".
>
> thks for all replies.
> PD: excuse if this question is solved somewhere, but I don't saw it.
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about special characters

Chris Hostetter-3

I think I'm missing something here.  the whole point of the
ISOLatin1AccentFilter is to replace accented characters with their
unaccented equivalent -- it sounds like that's working just fine, If you
want teh words in teh term vector to contain the accents, why don't you
stop using that filter?

if the problem is that you need to be able to match on both the accented
form and the non accented form, perhaps you should have two fields, or
modify the ISOLatin1AccentFilter so it puts both versions of the token in
the TokenStream with the same position?


: > The problem is special characters like à, ä , ç or ñ latin characters in
: > the text.
: > Now I use iso latin filter, but the problem is when I want to obtain most
: > term used. These term are stored without ` ´ ^ or another "character
: > attribute".
: > For example "plàntïuç" (it isn't a real word) is stored like the term
: > "plantiuc".
: > How can I do to have in term vector the word "plàntïuç".
: >
: > thks for all replies.
: > PD: excuse if this question is solved somewhere, but I don't saw it.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about special characters

Dan Wiggin
Thks for the reply, ut I don't know how to do this change in
SOLatin1AccentFilter.
Can you give me some advice in this action?

2006/5/25, Chris Hostetter <[hidden email]>:

>
>
> I think I'm missing something here.  the whole point of the
> ISOLatin1AccentFilter is to replace accented characters with their
> unaccented equivalent -- it sounds like that's working just fine, If you
> want teh words in teh term vector to contain the accents, why don't you
> stop using that filter?
>
> if the problem is that you need to be able to match on both the accented
> form and the non accented form, perhaps you should have two fields, or
> modify the ISOLatin1AccentFilter so it puts both versions of the token in
> the TokenStream with the same position?
>
>
> : > The problem is special characters like à, ä , ç or ñ latin characters
> in
> : > the text.
> : > Now I use iso latin filter, but the problem is when I want to obtain
> most
> : > term used. These term are stored without ` ´ ^ or another "character
> : > attribute".
> : > For example "plàntïuç" (it isn't a real word) is stored like the term
> : > "plantiuc".
> : > How can I do to have in term vector the word "plàntïuç".
> : >
> : > thks for all replies.
> : > PD: excuse if this question is solved somewhere, but I don't saw it.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about special characters

Chris Hostetter-3

: Thks for the reply, ut I don't know how to do this change in
: SOLatin1AccentFilter.
: Can you give me some advice in this action?

I've never really looked at the internals of ISOLatin1AccentFilter, but
the basic idea is to subclass it with a new TokenFilter that maintains a
one token "buffer" of the token stream, and every other time next is
called you either return the token from the buffer (as is) or you return a
token with the accents striped. sinve ISOLatin1AccentFilter has a method
called removeAccents i'm guessing it would look soemthing like
this...

   public class YourTokenFilter extends
     private Token bufToken = null;
     public Token next() {
       if (null != bufToken) {
          Token t = bufToken;
          bufToken=null;
          return t;
       }
       Token t = input.next
       bufToken = new Token(removeAccents(t.termText()),
                            t.startOffset(),t.endOffset(),t.type());
       bufToken.setPositionIncrement(0);
       return t;
     }
   }


...but i haven't tested that (or ever written a TokenFilter of my own for
that matter.)


:
: 2006/5/25, Chris Hostetter <[hidden email]>:
: >
: >
: > I think I'm missing something here.  the whole point of the
: > ISOLatin1AccentFilter is to replace accented characters with their
: > unaccented equivalent -- it sounds like that's working just fine, If you
: > want teh words in teh term vector to contain the accents, why don't you
: > stop using that filter?
: >
: > if the problem is that you need to be able to match on both the accented
: > form and the non accented form, perhaps you should have two fields, or
: > modify the ISOLatin1AccentFilter so it puts both versions of the token in
: > the TokenStream with the same position?
: >
: >
: > : > The problem is special characters like à, ä , ç or ñ latin characters
: > in
: > : > the text.
: > : > Now I use iso latin filter, but the problem is when I want to obtain
: > most
: > : > term used. These term are stored without ` ´ ^ or another "character
: > : > attribute".
: > : > For example "plàntïuç" (it isn't a real word) is stored like the term
: > : > "plantiuc".
: > : > How can I do to have in term vector the word "plàntïuç".
: > : >
: > : > thks for all replies.
: > : > PD: excuse if this question is solved somewhere, but I don't saw it.
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [hidden email]
: > For additional commands, e-mail: [hidden email]
: >
: >
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]