Accented search

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Accented search

climbingrose
Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập
Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "Lập Trình Viên", then Doc A should be given higher
score than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

Re: Accented search

petercline
I'm not sure about a way to boost scores in this case, but you can
achieve the basic matching by applying a filter to the index and the
queries.  The ISOLatin1Accent Filter seems like it may work for you,
though I'm not entirely certain if that will cover all the accent
characters you need.

My approach has been to write new filters, one to normalize the unicode
into the "decomposed" version, then one to manually strip out all of the
"add-on" characters (with decimal codepoint greater than 256).  I don't
know if this will always work, but it's worked well for me so far.

I would test out adding a <filter class="ISOLatin1AccentFilterFactory"/>
to your analyzer.  It might do the trick.  Once again, with this
approach I'm not sure how to boost either score, so someone else may
have better ideas.  I'm pretty new to all of this stuff.

Peter

climbingrose wrote:

> Hi guys,
>
> I'm running to some problems with accented (UTF-8) language. I'd love to
> hear some ideas about how to use Solr with those languages. Basically, I
> want to achieve what Google did with UTF-8 language.
>
> My requirements including:
> 1) Accent insensitive search and proper highlighting:
>   For example, we have 2 documents:
>
>   Doc A (title:Lập Trình Viên)
>   Doc B (title:Lap Trinh Vien)
>
>   if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập
> Trình Viên" is highlighted.
>   On the other hand, if the query is "Lap Trinh Vien", Doc A is also
> matched.
> 2) Assign proper scores to accented or non-accented searches:
>   if the user enters "Lập Trình Viên", then Doc A should be given higher
> score than DOC B.
>   if the query is "Lap Trinh Vien", Doc A should be given higher score.
>
> Any ideas guys? Thanks in advance!
>
>  
Reply | Threaded
Open this post in threaded view
|

RE: Accented search

pbinkley
In reply to this post by climbingrose
We've done this in a pre-Solr Lucene context by using the position increment: when a token contains accented characters, you add a stripped version of that token with a zero increment, so that for matching purposes the original and the stripped version are at the same position. Accents are not stripped from queries. The effect is that an accented search matches your Doc A, and an unaccented search matches Docs A and B. We do that after lower-casing the token.

There are some limitations: users might start to expect that they can freely add accents to restrict their search to accented hits, but if they don't match the accents exactly they won't get any hits: e.g. if a word contains two accented characters and the user only accents one of them in their query, they won't match the accented or the unaccented version.

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [hidden email]

~ The code is willing, but the data is weak. ~


-----Original Message-----
From: climbingrose [mailto:[hidden email]]
Sent: Monday, March 10, 2008 10:01 PM
To: [hidden email]
Subject: Accented search

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to hear some ideas about how to use Solr with those languages. Basically, I want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "Lập Trình Viên", then Doc A should be given higher score than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

RE: Accented search

Renaud Waldura-5
Peter:

Very interesting. To take care of the issue you mention, could you add
multiple "synonyms" with progressively less accents?

E.g. you'd index "préférence" as 4 tokens:
 préférence (unchanged)
 preférence (stripped one accent)
 préference (stripped the other accent)
 preference (stripped both accents)

Or does it yield too many tokens to be useful?

And how does this take care of scoring? Do you get a higher score with a
closer match?


 

-----Original Message-----
From: Binkley, Peter [mailto:[hidden email]]
Sent: Tuesday, March 11, 2008 8:37 AM
To: [hidden email]
Subject: RE: Accented search

We've done this in a pre-Solr Lucene context by using the position
increment: when a token contains accented characters, you add a stripped
version of that token with a zero increment, so that for matching purposes
the original and the stripped version are at the same position. Accents are
not stripped from queries. The effect is that an accented search matches
your Doc A, and an unaccented search matches Docs A and B. We do that after
lower-casing the token.

There are some limitations: users might start to expect that they can freely
add accents to restrict their search to accented hits, but if they don't
match the accents exactly they won't get any hits: e.g. if a word contains
two accented characters and the user only accents one of them in their
query, they won't match the accented or the unaccented version.

Peter

Peter Binkley
Digital Initiatives Technology Librarian Information Technology Services
4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [hidden email]

~ The code is willing, but the data is weak. ~


-----Original Message-----
From: climbingrose [mailto:[hidden email]]
Sent: Monday, March 10, 2008 10:01 PM
To: [hidden email]
Subject: Accented search

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:L?p Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "L?p Trình Viên", then Doc B is also matched and "L?p
Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "L?p Trình Viên", then Doc A should be given higher
score than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang


Reply | Threaded
Open this post in threaded view
|

Re: Accented search

Walter Underwood, Netflix
Generally, the accented version will have a higher IDF, so it
will score higher.

wunder

On 3/11/08 8:44 AM, "Renaud Waldura" <[hidden email]>
wrote:

> Peter:
>
> Very interesting. To take care of the issue you mention, could you add
> multiple "synonyms" with progressively less accents?
>
> E.g. you'd index "préférence" as 4 tokens:
>  préférence (unchanged)
>  preférence (stripped one accent)
>  préference (stripped the other accent)
>  preference (stripped both accents)
>
> Or does it yield too many tokens to be useful?
>
> And how does this take care of scoring? Do you get a higher score with a
> closer match?
>
>
>  
>
> -----Original Message-----
> From: Binkley, Peter [mailto:[hidden email]]
> Sent: Tuesday, March 11, 2008 8:37 AM
> To: [hidden email]
> Subject: RE: Accented search
>
> We've done this in a pre-Solr Lucene context by using the position
> increment: when a token contains accented characters, you add a stripped
> version of that token with a zero increment, so that for matching purposes
> the original and the stripped version are at the same position. Accents are
> not stripped from queries. The effect is that an accented search matches
> your Doc A, and an unaccented search matches Docs A and B. We do that after
> lower-casing the token.
>
> There are some limitations: users might start to expect that they can freely
> add accents to restrict their search to accented hits, but if they don't
> match the accents exactly they won't get any hits: e.g. if a word contains
> two accented characters and the user only accents one of them in their
> query, they won't match the accented or the unaccented version.
>
> Peter
>
> Peter Binkley
> Digital Initiatives Technology Librarian Information Technology Services
> 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta
> Canada T6G 2J8
> Phone: (780) 492-3743
> Fax: (780) 492-9243
> e-mail: [hidden email]
>
> ~ The code is willing, but the data is weak. ~
>
>
> -----Original Message-----
> From: climbingrose [mailto:[hidden email]]
> Sent: Monday, March 10, 2008 10:01 PM
> To: [hidden email]
> Subject: Accented search
>
> Hi guys,
>
> I'm running to some problems with accented (UTF-8) language. I'd love to
> hear some ideas about how to use Solr with those languages. Basically, I
> want to achieve what Google did with UTF-8 language.
>
> My requirements including:
> 1) Accent insensitive search and proper highlighting:
>   For example, we have 2 documents:
>
>   Doc A (title:L?p Trình Viên)
>   Doc B (title:Lap Trinh Vien)
>
>   if the user enters "L?p Trình Viên", then Doc B is also matched and "L?p
> Trình Viên" is highlighted.
>   On the other hand, if the query is "Lap Trinh Vien", Doc A is also
> matched.
> 2) Assign proper scores to accented or non-accented searches:
>   if the user enters "L?p Trình Viên", then Doc A should be given higher
> score than DOC B.
>   if the query is "Lap Trinh Vien", Doc A should be given higher score.
>
> Any ideas guys? Thanks in advance!
>
> --
> Regards,
>
> Cuong Hoang
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Accented search

climbingrose
In reply to this post by pbinkley
Hi Peter,

It looks like a very promising approach for us. I'm going to implement an
custom Tokeniser based on your suggestions and see how it goes. Thank you
all for your comments!

Cheers

On Wed, Mar 12, 2008 at 2:37 AM, Binkley, Peter <[hidden email]>
wrote:

> We've done this in a pre-Solr Lucene context by using the position
> increment: when a token contains accented characters, you add a stripped
> version of that token with a zero increment, so that for matching purposes
> the original and the stripped version are at the same position. Accents are
> not stripped from queries. The effect is that an accented search matches
> your Doc A, and an unaccented search matches Docs A and B. We do that after
> lower-casing the token.
>
> There are some limitations: users might start to expect that they can
> freely add accents to restrict their search to accented hits, but if they
> don't match the accents exactly they won't get any hits: e.g. if a word
> contains two accented characters and the user only accents one of them in
> their query, they won't match the accented or the unaccented version.
>
> Peter
>
> Peter Binkley
> Digital Initiatives Technology Librarian
> Information Technology Services
> 4-30 Cameron Library
> University of Alberta Libraries
> Edmonton, Alberta
> Canada T6G 2J8
> Phone: (780) 492-3743
> Fax: (780) 492-9243
> e-mail: [hidden email]
>
> ~ The code is willing, but the data is weak. ~
>
>
> -----Original Message-----
> From: climbingrose [mailto:[hidden email]]
> Sent: Monday, March 10, 2008 10:01 PM
> To: [hidden email]
> Subject: Accented search
>
> Hi guys,
>
> I'm running to some problems with accented (UTF-8) language. I'd love to
> hear some ideas about how to use Solr with those languages. Basically, I
> want to achieve what Google did with UTF-8 language.
>
> My requirements including:
> 1) Accent insensitive search and proper highlighting:
>  For example, we have 2 documents:
>
>  Doc A (title:Lập Trình Viên)
>  Doc B (title:Lap Trinh Vien)
>
>  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập
> Trình Viên" is highlighted.
>  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
> matched.
> 2) Assign proper scores to accented or non-accented searches:
>  if the user enters "Lập Trình Viên", then Doc A should be given higher
> score than DOC B.
>  if the query is "Lap Trinh Vien", Doc A should be given higher score.
>
> Any ideas guys? Thanks in advance!
>
> --
> Regards,
>
> Cuong Hoang
>



--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

Re: Accented search

hossman
: It looks like a very promising approach for us. I'm going to implement
: an custom Tokeniser based on your suggestions and see how it goes. Thank
: you all for your comments!

you don't really need a custom tokenizer -- just a buffered TokenFilter
that clones the original token if it contains accent chars, mutates the
clone, and then emits it next with a positionIncrement of 0.

i'm kind of suprised ISOLatin1AccentFilter doesn't have an option to do
this already -- it would certianly be a worthy patch to commit if someone
wants to submit it back to lucene-java.

: > don't match the accents exactly they won't get any hits: e.g. if a word
: > contains two accented characters and the user only accents one of them in
: > their query, they won't match the accented or the unaccented version.

this could be accounted for by generating all of the permuations of
unaccented characters when indexing -- it wouldn't solve the problem of a
source term containing only one accent and the user quering with only one
accent but on a different character ... you could work arround this by
puting all of the permutations in at index time, but querying on the exact
term and the no-accent term at query time.


-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: Accented search

Phillip Farber
I've seen mention of these filters:

  <filter class="schema.UnicodeNormalizationFilterFactory"/>
  <filter class="schema.DiacriticsFilterFactory"/>

But I don't see them in the 1.2 distribution.  Am I looking in the wrong
place?  What will the UnicodeNormalizationFilterFactory do for me?  I
can't find any documentation on it.

Thanks,

Phil
Reply | Threaded
Open this post in threaded view
|

UnicodeNormalizationFilterFactory

Phillip Farber

Apologies for reposting.  I should have posted this in a new thread.

I've seen mention of these filters:

  <filter class="schema.UnicodeNormalizationFilterFactory"/>
  <filter class="schema.DiacriticsFilterFactory"/>

But I don't see them in the 1.2 distribution.  Am I looking in the wrong
place?  What will the UnicodeNormalizationFilterFactory do for me?  I
can't find any documentation on it.

Thanks,

Phil
Reply | Threaded
Open this post in threaded view
|

Re: Accented search

Phillip Farber
In reply to this post by hossman
Regarding indexing words with accented and unaccented characters with
positionIncrement zero:

Chris Hostetter wrote:
>
> you don't really need a custom tokenizer -- just a buffered TokenFilter
> that clones the original token if it contains accent chars, mutates the
> clone, and then emits it next with a positionIncrement of 0.
>

Could someone expand on how to implement this technique of buffering and
cloning?

Thanks,

Phil
Reply | Threaded
Open this post in threaded view
|

Re: Accented search

climbingrose
Here is how I did it (the code is from memory so it might not be correct
100%):
private boolean hasAccents;
private Token filteredToken;

public final Token next() throws IOException {
  if (hasAccents) {
    hasAccents = false;
    return filteredToken;
  }
  Token t = input.next();
  String filteredText = removeAccents(t.termText());
  if (filteredText.equals(t.termText()) { //no accents
    return t;
  } else {
    filteredToken = (Token) t.clone();
    filteredToken.setTermText(filteredText):
    filteredToken.setPositionIncrement(0);
    hasAccents = true;
  }
  return t;
}

On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[hidden email]> wrote:

> Regarding indexing words with accented and unaccented characters with
> positionIncrement zero:
>
> Chris Hostetter wrote:
>
>>
>> you don't really need a custom tokenizer -- just a buffered TokenFilter
>> that clones the original token if it contains accent chars, mutates the
>> clone, and then emits it next with a positionIncrement of 0.
>>
>>
> Could someone expand on how to implement this technique of buffering and
> cloning?
>
> Thanks,
>
> Phil
>



--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

Re: Accented search

Haschart, Robert J (rh9ec)

climbingrose wrote:

>Here is how I did it (the code is from memory so it might not be correct
>100%):
>private boolean hasAccents;
>private Token filteredToken;
>
>public final Token next() throws IOException {
>  if (hasAccents) {
>    hasAccents = false;
>    return filteredToken;
>  }
>  Token t = input.next();
>  String filteredText = removeAccents(t.termText());
>  if (filteredText.equals(t.termText()) { //no accents
>    return t;
>  } else {
>    filteredToken = (Token) t.clone();
>    filteredToken.setTermText(filteredText):
>    filteredToken.setPositionIncrement(0);
>    hasAccents = true;
>  }
>  return t;
>}
>
>On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[hidden email]> wrote:
>
>  
>
>>Regarding indexing words with accented and unaccented characters with
>>positionIncrement zero:
>>
>>Chris Hostetter wrote:
>>
>>    
>>
>>>you don't really need a custom tokenizer -- just a buffered TokenFilter
>>>that clones the original token if it contains accent chars, mutates the
>>>clone, and then emits it next with a positionIncrement of 0.
>>>
>>>
>>>      
>>>
>>Could someone expand on how to implement this technique of buffering and
>>cloning?
>>
>>Thanks,
>>
>>Phil
>>
>>    
>>
>
>  
>
I just was facing the same issue and came up with the following as a
solution.

I changed the Schema.xml file so that for the text field the analyzers
and filters are as follows:


   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

These two lines are the new ones:
        <filter class="schema.UnicodeNormalizationFilterFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>

the first line invokes a custom filter that I borrowed and modified that
turns decomposed unicode ( like Pe'rez ) to the composed form ( Pérez )
the second line replaces accented characters with their unaccented
equivalents ( Perez )

For the custom filter to work, you must create a lib directory as a
sibling to the conf directory and place the jar files containing the
custom filter there.

The Jars can be downloaded from the blacklight subversion repository at:

http://blacklight.rubyforge.org/svn/trunk/solr/lib/

The SolrPlugin.jar contains the classes UnicodeNormalizationFilter and
UnicodeNormalizationFilterFactory which merely invokes the
Normalizer.normalize function in the normalizer jar (which is taken from
the marc4j distribution and which is a subset og the icu4j library)  

-Robert Haschart
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeNormalizationFilterFactory

hossman
In reply to this post by Phillip Farber

: I've seen mention of these filters:
:
:  <filter class="schema.UnicodeNormalizationFilterFactory"/>
:  <filter class="schema.DiacriticsFilterFactory"/>

Are you asking because you saw these in Robert Haschart's reply to your
previous question?  I think those are custom Filters that he has in his
project ... not open source (but i may be wrong)

they are certainly not something that comes out of the box w/ Solr.


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: UnicodeNormalizationFilterFactory

Lance Norskog-2
ISOLatin1AccentFilterFactory works quite well for us. It solves our basic
euro-text keyboard searching problem, where "protege" should find protégé.
("protege" with two accents.)

-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Tuesday, June 24, 2008 4:05 PM
To: [hidden email]
Subject: Re: UnicodeNormalizationFilterFactory


: I've seen mention of these filters:
:
:  <filter class="schema.UnicodeNormalizationFilterFactory"/>
:  <filter class="schema.DiacriticsFilterFactory"/>

Are you asking because you saw these in Robert Haschart's reply to your
previous question?  I think those are custom Filters that he has in his
project ... not open source (but i may be wrong)

they are certainly not something that comes out of the box w/ Solr.


-Hoss




Reply | Threaded
Open this post in threaded view
|

Re: UnicodeNormalizationFilterFactory

Haschart, Robert J (rh9ec)
Lance Norskog wrote:

>ISOLatin1AccentFilterFactory works quite well for us. It solves our basic
>euro-text keyboard searching problem, where "protege" should find protégé.
>("protege" with two accents.)
>
>-----Original Message-----
>From: Chris Hostetter [mailto:[hidden email]]
>Sent: Tuesday, June 24, 2008 4:05 PM
>To: [hidden email]
>Subject: Re: UnicodeNormalizationFilterFactory
>
>
>: I've seen mention of these filters:
>:
>:  <filter class="schema.UnicodeNormalizationFilterFactory"/>
>:  <filter class="schema.DiacriticsFilterFactory"/>
>
>Are you asking because you saw these in Robert Haschart's reply to your
>previous question?  I think those are custom Filters that he has in his
>project ... not open source (but i may be wrong)
>
>they are certainly not something that comes out of the box w/ Solr.
>
>
>-Hoss
>  
>
The ISOLatin1AccentFilter works well in the case above described by
Lance Norskog, ie. for words containing characters with accents where
the accented character is a single unicode character for the letter with
the accent mark as in protégé. However in the data that we work with,
often accented characters will be represented by a plain unaccented
character followed by the Unicode combining character for the accent
mark, roughly like this: prote'ge' which emerge from the
ISOLatin1AccentFilter unchanged.

After some research I found the UnicodeNormalizationFilter mentioned
above, which did not work on my development system (because it relies
features only available in java 6), and which when combined with the
DiacriticsFilter also mentioned above would remove diacritics from
characters, but also discard any Chinese characters or Russian
characters, or anything else outside the 0x0--0x7f range. Which is bad.

I first modified the filter to normalize the characters to the composed
normalized form, (changing prote'ge' to protégé) and then pass the
results through the ISOLatin1AccentFilter. However for accented
characters for which there is no composed normailzed form (such as the n
and s in Zarin̦š) the accents are not removed.

So I took the approach of decomposing the accented characters, and then
only removing the valid diacritics and zero-width composing characters
from the result, and the resulting filter works quite well. And since it
was developed as a part of the blacklight project at the University of
Virginia it is Open Source under the Apache License.

If anyone is interested in evaluating of using the
UnicodeNormalizationFilter in conjunction with their Solr installation
get the UnicodeNormalizeFilter.jar from:

http://blacklight.rubyforge.org/svn/trunk/solr/lib/

and place it in a lib directory next to the conf directory in your Solr
home directory.

Robert Haschart







Reply | Threaded
Open this post in threaded view
|

RE: UnicodeNormalizationFilterFactory

steve_rowe
Hi Robert,

Could you create a JIRA issue and attach your code to it?  That makes it easier for people to evaluate it (rather than just binary distribution).

This sounds general enough to me that it would be a useful addition to Lucene itself.  Solr's factory could just be sugar on top then.

Thanks,
Steve

On 06/26/2008 at 4:41 PM, Robert Haschart wrote:

> Lance Norskog wrote:
>
> > ISOLatin1AccentFilterFactory works quite well for us. It solves our
> > basic euro-text keyboard searching problem, where "protege" should find
> > protégé. ("protege" with two accents.)
> >
> > -----Original Message-----
> > From: Chris Hostetter [mailto:[hidden email]]
> > Sent: Tuesday, June 24, 2008 4:05 PM
> > To: [hidden email]
> > Subject: Re: UnicodeNormalizationFilterFactory
> >
> >
> > > I've seen mention of these filters:
> > >
> > >  <filter class="schema.UnicodeNormalizationFilterFactory"/>
> > >  <filter class="schema.DiacriticsFilterFactory"/>
> >
> > Are you asking because you saw these in Robert Haschart's reply to your
> > previous question?  I think those are custom Filters that he has in his
> > project ... not open source (but i may be wrong)
> >
> > they are certainly not something that comes out of the box w/ Solr.
> >
> >
> > -Hoss
> >
> >
> The ISOLatin1AccentFilter works well in the case above described by
> Lance Norskog, ie. for words containing characters with accents where
> the accented character is a single unicode character for the
> letter with
> the accent mark as in protégé. However in the data that we work with,
> often accented characters will be represented by a plain unaccented
> character followed by the Unicode combining character for the accent
> mark, roughly like this: prote'ge' which emerge from the
> ISOLatin1AccentFilter unchanged.
>
> After some research I found the UnicodeNormalizationFilter mentioned
> above, which did not work on my development system (because it relies
> features only available in java 6), and which when combined with the
> DiacriticsFilter also mentioned above would remove diacritics from
> characters, but also discard any Chinese characters or Russian
> characters, or anything else outside the 0x0--0x7f range.
> Which is bad.
>
> I first modified the filter to normalize the characters to
> the composed
> normalized form, (changing prote'ge' to protégé) and then pass the
> results through the ISOLatin1AccentFilter. However for accented
> characters for which there is no composed normailzed form
> (such as the n
> and s in Zarin̦š) the accents are not removed.
>
> So I took the approach of decomposing the accented characters, and then
> only removing the valid diacritics and zero-width composing characters
> from the result, and the resulting filter works quite well. And since it
> was developed as a part of the blacklight project at the University of
> Virginia it is Open Source under the Apache License.
>
> If anyone is interested in evaluating of using the
> UnicodeNormalizationFilter in conjunction with their Solr installation
> get the UnicodeNormalizeFilter.jar from:
>
> http://blacklight.rubyforge.org/svn/trunk/solr/lib/
>
> and place it in a lib directory next to the conf directory in
> your Solr
> home directory.
>
> Robert Haschart
>
>
>
>
>
>
>
>