UTF8 accents & umlauts filter?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

UTF8 accents & umlauts filter?

Michael Imbeault
Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it
planned to add such a filter (which would be very useful, as
ISOLatin1AccentFilter isn't able to remove some complex accents on some
languages encoded in UTF8. I would paste examples but I'm not sure that
they would display correctly).? I think I saw a post long ago on this
mailing list about something like that, but it has never been released
officially.

See

2001, first post about utf8 accents:
http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648
2004, a good solution, but still incomplete :
http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792
2006, best attempt yet, but sadly undelivered :
http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142

I think Lucene would benefit from a complete UTF8 accents remover...
right now the best solution I have is to process everything in PHP
before indexing and at query time (and its a little slow).

Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8 accents & umlauts filter?

Yonik Seeley-2
Thanks for the links Michael... this one does look interesting:
http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
The challenge would be to make it fast... perhaps a custom hash table,
or look into the cost of a perfect hash function.

Just to clear up some unicode/terminology issues:

There are latin1 characters (the actual glyphs) represented by unicode
code points 0->255
There is also a latin1 encoding for unicode (which can only represent
unicode code points 0->255)
UTF8 is another encoding for unicode characters (or code points), but
that's not really relevant to a filter.

So ISOLatin1AccentFilter removes accents from characters <= 255, and
it doesn't matter what the original encoding was (ascii, latin1, UTF8,
UTF16, etc)

-Yonik


On 9/12/06, Michael Imbeault <[hidden email]> wrote:

> Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
> remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it
> planned to add such a filter (which would be very useful, as
> ISOLatin1AccentFilter isn't able to remove some complex accents on some
> languages encoded in UTF8. I would paste examples but I'm not sure that
> they would display correctly).? I think I saw a post long ago on this
> mailing list about something like that, but it has never been released
> officially.
>
> See
>
> 2001, first post about utf8 accents:
> http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648
> 2004, a good solution, but still incomplete :
> http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792
> 2006, best attempt yet, but sadly undelivered :
> http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142
>
> I think Lucene would benefit from a complete UTF8 accents remover...
> right now the best solution I have is to process everything in PHP
> before indexing and at query time (and its a little slow).

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8 accents & umlauts filter?

kkrugler
>Thanks for the links Michael... this one does look interesting:
>http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
>The challenge would be to make it fast... perhaps a custom hash table,
>or look into the cost of a perfect hash function.
>
>Just to clear up some unicode/terminology issues:

Some additional clarification below:

>There are latin1 characters (the actual glyphs) represented by unicode
>code points 0->255

Just U+00A0-> U+00FF would be considered Latin-1 by Unicode.

Unicode calls the block of Unicode code points from U+0000 -> U+007F
"C0 Controls and Basic Latin".

 From U+0080 to U+00FF is "C1 Controls and Latin-1 Supplement".

>There is also a latin1 encoding for unicode (which can only represent
>unicode code points 0->255)

There's an ISO 8859-1 charset (combination of character set, code
points and encoding) that matches Unicode code points for 0x00 ->
0x7F and 0xA0 -> 0xFF. Or rather, the Unicode code points for these
two ranges were selected to match ISO 8859-1.

>UTF8 is another encoding for unicode characters (or code points), but
>that's not really relevant to a filter.
>
>So ISOLatin1AccentFilter removes accents from characters <= 255, and
>it doesn't matter what the original encoding was (ascii, latin1, UTF8,
>UTF16, etc)

This isn't really about the "original encoding" - by the time
ISOLatin1AccentFilter is called, it's dealing with Java strings,
which use the UTF-16 Unicode encoding.

I think what Michael is asking for is the implementation of one of
the Unicode-defined normalization forms (see
http://www.unicode.org/reports/tr15/) along with diacritical
stripping and other folding. Basically it's a way of mapping
characters to a primary sort key.

This is pretty complex, especially when you start considering
locale-specific details - we used ICU support for this in the past,
which is where I'd probably start. ICU needs a lot of data to handle
this properly across most locales, so it's not lightweight, but it
would give you a general (albeit slower) solution.

-- Ken


>On 9/12/06, Michael Imbeault <[hidden email]> wrote:
>>Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
>>remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it
>>planned to add such a filter (which would be very useful, as
>>ISOLatin1AccentFilter isn't able to remove some complex accents on some
>>languages encoded in UTF8. I would paste examples but I'm not sure that
>>they would display correctly).? I think I saw a post long ago on this
>>mailing list about something like that, but it has never been released
>>officially.
>>
>>See
>>
>>2001, first post about utf8 accents:
>>http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648
>>2004, a good solution, but still incomplete :
>>http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792
>>2006, best attempt yet, but sadly undelivered :
>>http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142
>>
>>I think Lucene would benefit from a complete UTF8 accents remover...
>>right now the best solution I have is to process everything in PHP
>>before indexing and at query time (and its a little slow).
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8 accents & umlauts filter?

Michael Imbeault
In reply to this post by Yonik Seeley-2
Thanks Yonik & Ken for both answers; I think the explanations went a
little over my head, but I think you understood what I was talking
about! Basically, a better filter to remove all possible accents (&
umlauts as a bonus, for completeness sake; I personally would have no
use for it).

I think it's way more work and way more complicated than I initially
thought it would be. Anyone feels able to do this?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

> Thanks for the links Michael... this one does look interesting:
> http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
> The challenge would be to make it fast... perhaps a custom hash table,
> or look into the cost of a perfect hash function.
>
> Just to clear up some unicode/terminology issues:
>
> There are latin1 characters (the actual glyphs) represented by unicode
> code points 0->255
> There is also a latin1 encoding for unicode (which can only represent
> unicode code points 0->255)
> UTF8 is another encoding for unicode characters (or code points), but
> that's not really relevant to a filter.
>
> So ISOLatin1AccentFilter removes accents from characters <= 255, and
> it doesn't matter what the original encoding was (ascii, latin1, UTF8,
> UTF16, etc)
>
> -Yonik
>
>
> On 9/12/06, Michael Imbeault <[hidden email]> wrote:
>> Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
>> remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it
>> planned to add such a filter (which would be very useful, as
>> ISOLatin1AccentFilter isn't able to remove some complex accents on some
>> languages encoded in UTF8. I would paste examples but I'm not sure that
>> they would display correctly).? I think I saw a post long ago on this
>> mailing list about something like that, but it has never been released
>> officially.
>>
>> See
>>
>> 2001, first post about utf8 accents:
>> http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648 
>>
>> 2004, a good solution, but still incomplete :
>> http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792 
>>
>> 2006, best attempt yet, but sadly undelivered :
>> http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142 
>>
>>
>> I think Lucene would benefit from a complete UTF8 accents remover...
>> right now the best solution I have is to process everything in PHP
>> before indexing and at query time (and its a little slow).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: UTF8 accents & umlauts filter?

pbinkley
We use ICU4J to do the filtering based on Unicode blocks. See
http://icu.sourceforge.net/userguide/Transform.html for a sense of what
you can do. It's worth it for us because we need to normalize cyrillic
as well as roman text; it might be overkill for other situations. But it
does good work. The first example on the page linked above shows
accent-stripping: you normalize to NFD (decomposed unicode, where
accents are represented as non-spacing characters), then delete all the
non-spacing characters, and finally normalize back to composed unicode.

Peter

-----Original Message-----
From: Michael Imbeault [mailto:[hidden email]]
Sent: Wednesday, September 13, 2006 9:34 PM
To: [hidden email]
Subject: Re: UTF8 accents & umlauts filter?

Thanks Yonik & Ken for both answers; I think the explanations went a
little over my head, but I think you understood what I was talking
about! Basically, a better filter to remove all possible accents (&
umlauts as a bonus, for completeness sake; I personally would have no
use for it).

I think it's way more work and way more complicated than I initially
thought it would be. Anyone feels able to do this?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:
> Thanks for the links Michael... this one does look interesting:
> http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
> The challenge would be to make it fast... perhaps a custom hash table,

> or look into the cost of a perfect hash function.
>
> Just to clear up some unicode/terminology issues:
>
> There are latin1 characters (the actual glyphs) represented by unicode

> code points 0->255 There is also a latin1 encoding for unicode (which
> can only represent unicode code points 0->255)
> UTF8 is another encoding for unicode characters (or code points), but
> that's not really relevant to a filter.
>
> So ISOLatin1AccentFilter removes accents from characters <= 255, and
> it doesn't matter what the original encoding was (ascii, latin1, UTF8,

> UTF16, etc)
>
> -Yonik
>
>
> On 9/12/06, Michael Imbeault <[hidden email]> wrote:
>> Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
>> remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is
>> it planned to add such a filter (which would be very useful, as
>> ISOLatin1AccentFilter isn't able to remove some complex accents on
>> some languages encoded in UTF8. I would paste examples but I'm not
>> sure that they would display correctly).? I think I saw a post long
>> ago on this mailing list about something like that, but it has never
>> been released officially.
>>
>> See
>>
>> 2001, first post about utf8 accents:
>> http://www.gossamer-threads.com/lists/lucene/java-user/648?search_str
>> ing=accent;#648
>>
>> 2004, a good solution, but still incomplete :
>> http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_s
>> tring=accent;#10792
>>
>> 2006, best attempt yet, but sadly undelivered :
>> http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_s
>> tring=accent;#32142
>>
>>
>> I think Lucene would benefit from a complete UTF8 accents remover...
>> right now the best solution I have is to process everything in PHP
>> before indexing and at query time (and its a little slow).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...