dealing with utf-8 characters

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

dealing with utf-8 characters

carlos orrego
I have this issue:
If i query for pérez i should get results including pérez and perez (without the accent).

This is the case on google and on solr which i use on other projects. Why nutch is not giving me the same results??

any ideas?

thanks
Reply | Threaded
Open this post in threaded view
|

Re: dealing with utf-8 characters

Otis Gospodnetic-2
I cannot tell for sure without looking at the code, but my guess is diacritics are simply not being stripped anywhere.  I imagine you could modify the NutchAnalyzer to include that ISO...Filter, the same class that you must have configured in your Solr schema.xml.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: carlos orrego <[hidden email]>
To: [hidden email]
Sent: Saturday, April 5, 2008 12:50:23 AM
Subject: dealing with utf-8 characters


I have this issue:
If i query for pérez i should get results including pérez and perez (without
the accent).

This is the case on google and on solr which i use on other projects. Why
nutch is not giving me the same results??

any ideas?

thanks
--
View this message in context: http://www.nabble.com/dealing-with-utf-8-characters-tp16502905p16502905.html
Sent from the Nutch - User mailing list archive at Nabble.com.