Cyrillic characters

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Cyrillic characters

pgwillia
Hi all,

    I'm trying to adapt our old cocoon/lucene based web search application
to one that is more solrish.  Our old web app was capable of searching for
queries with cyrillic characters in them.  I'm finding that using the
packaged example admin interface entering a query with a string of
cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException.
I've also noted that the url built from the search form is not utf-8
encoded.  So obviously if I try to manipulate the query string by
inserting a utf-8 encoded string in the q= parameter the values are
interpreted incorrectly and as such I cannot use this approach as a
work-around.  My sample query is: ...... (the english word _canada_
translated into russian) or
%D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or
%26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B
(solr url encoding)

    I would appreciate any advice or suggestions that would allow me
to search for cyrillics in solr.  If anyone knows why solr is behaving as
it does with the strange encoding, a brief explanation of what causes this
behaviour could be helpful and what the encoding is (unicode?).  If anyone
else has force solr to accept utf-8 encoded q= parameters with success I
would love to know how you did it.

Thanks in advance!
Tricia

ps.  I am using mozilla firefox as my main browser which leads to the
behaviour I reported above.  IE 6.0 works fine for cyrillics although
there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for
the same query as before).
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

WHIRLYCOTT
Crap, you're right.  I have a well-tested application that's using  
UTF-8 everywhere possible and I just tested with some Russian text.  
Solr's coughing up this as an exception:

Jul 18, 2006 6:00:05 PM org.apache.solr.core.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: 1
         at org.apache.solr.search.QueryParsing.parseSort
(QueryParsing.java:141)
         at  
org.apache.solr.request.StandardRequestHandler.handleRequest
(StandardRequestHandler.java:96)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:592)
         at org.apache.solr.servlet.SolrServlet.doGet
(SolrServlet.java:94)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
         at org.mortbay.jetty.servlet.ServletHolder.handle
(ServletHolder.java:428)
         at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch
(WebApplicationHandler.java:473)
         at org.mortbay.jetty.servlet.ServletHandler.handle
(ServletHandler.java:568)
         at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
         at org.mortbay.jetty.servlet.WebApplicationContext.handle
(WebApplicationContext.java:633)
         at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
         at org.mortbay.http.HttpServer.service(HttpServer.java:909)
         at org.mortbay.http.HttpConnection.service
(HttpConnection.java:820)
         at org.mortbay.http.HttpConnection.handleNext
(HttpConnection.java:986)
         at org.mortbay.http.HttpConnection.handle
(HttpConnection.java:837)
         at org.mortbay.http.SocketListener.handleConnection
(SocketListener.java:245)
         at org.mortbay.util.ThreadedServer.handle
(ThreadedServer.java:357)
         at org.mortbay.util.ThreadPool$PoolThread.run
(ThreadPool.java:534)

You're going directly against Solr/Jetty, right?  Not proxied or  
mod_rewrite'd through to Apache?

Solr isn't properly encoding the data being received by the servlet.  
I think that I can fix this using some of the tricks that I've  
learned in building my site.  More later.

How much testing have people done using UTF-8 data on Solr?

phil.



On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:

> Hi all,
>
>    I'm trying to adapt our old cocoon/lucene based web search  
> application to one that is more solrish.  Our old web app was  
> capable of searching for queries with cyrillic characters in them.  
> I'm finding that using the packaged example admin interface  
> entering a query with a string of cyrillic characters causes a  
> java.lang.ArrayIndexOutOfBoundsException. I've also noted that the  
> url built from the search form is not utf-8 encoded.  So obviously  
> if I try to manipulate the query string by inserting a utf-8  
> encoded string in the q= parameter the values are interpreted  
> incorrectly and as such I cannot use this approach as a work-
> around.  My sample query is: ...... (the english word _canada_  
> translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0  
> (utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%
> 231076%3B%26%231072%3B (solr url encoding)
>
>    I would appreciate any advice or suggestions that would allow me  
> to search for cyrillics in solr.  If anyone knows why solr is  
> behaving as it does with the strange encoding, a brief explanation  
> of what causes this behaviour could be helpful and what the  
> encoding is (unicode?).  If anyone else has force solr to accept  
> utf-8 encoded q= parameters with success I would love to know how  
> you did it.
>
> Thanks in advance!
> Tricia
>
> ps.  I am using mozilla firefox as my main browser which leads to  
> the behaviour I reported above.  IE 6.0 works fine for cyrillics  
> although there is still a strange but different encoding (%CA%E0%ED%
> E0%E4%E0 for the same query as before).


--
                                    Whirlycott
                                    Philip Jacob
                                    [hidden email]
                                    http://www.whirlycott.com/phil/


Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Yonik Seeley
On 7/18/06, WHIRLYCOTT <[hidden email]> wrote:
> How much testing have people done using UTF-8 data on Solr?

UTF-8 query *output* is well tested with Resin within CNET.
Indexing UTF-8 is also well tested (again, mostly with Resin).
UTF-8 query input is not really tested at all AFAIK (the q param to
the standard request handler).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Yonik Seeley
OK, lets split up the indexing side from the query side for a moment
and assume that you are indexing correctly (setting the content-type
correctly, etc).

I just added a new value to the multi-valued features field to the
solr.xml example document:
  "Good unicode support: héllo (hello with an accent over the e)"
or in the XML:
  <field name="features">Good unicode support: h&#xE9;llo (hello with
an accent over the e)</field>

I used a numeric entity because post.sh doesn't specify any
content-type (ascii or latin1 may be assumed).  But as I said, let's
assume things are indexed correctly for now.

The URI standard says the following:
'''When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
percent-encoded. For example, the character A would be represented as
"A", the character LATIN CAPITAL LETTER A WITH GRAVE would be
represented as "%C3%80", and the character KATAKANA LETTER A would be
represented as "%E3%82%A2".'''

http://www.gbiv.com/protocols/uri/rfc/rfc3986.html

So, the unicode code point for the e with an accute accent is \u00E9.
In UTF8 encoding it's a two byte sequence: 0xc3,0xa9

In both Firefox and IE, the following URI works fine to find the document:
http://localhost:8983/solr/select/?stylesheet=&q=h%C3%A9llo

If I try pasting héllo from notepad directly into the URL, IE works
fine, but Firefox substitutes the accented e with %E9, which is
incorrect.

I haven't tried more complicated examples yet, and I haven't tried
wget, etc, but things look like they are working as expected so far
(with the exception of a firefox bug).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Chris Hostetter-3
In reply to this post by pgwillia

: ps.  I am using mozilla firefox as my main browser which leads to the
: behaviour I reported above.  IE 6.0 works fine for cyrillics although
: there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for
: the same query as before).

The problem may not be in the Solr internals as much as in the form on the
admin screen -- i'm not on a computer where i can do any testing, but the
problem may be that the <form> tag in index.jsp/form.jsp doesn't specify
any charset options, so the browser is making an assumption (and the Solr
internals are making a different one)

Another possibility is that this is "yet another jetty issue"

Things I'd try if i had the time/resources:

1) Make a Junit test that executes the query you are trying -- this should
rule out the possibility of a Lucene/SOlrCore bug

2) Try running SOlr in tomcat and see if that has the same problem.

3) Try adding an accept-charset param to the form on the admin screens and
see if that fixes the problem.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Yonik Seeley
In reply to this post by Yonik Seeley
Definitely some Firefox bugs with UTF8 at least:
If I go to the admin screen, and paste in héllo into the query box,
then kill Solr and run netcat to see exactly what I get, it's the
following:

$ nc -l -p 8983
GET /solr/select/?stylesheet=&q=h%E9llo&version=2.1&start=0&rows=10&indent=on HT
TP/1.1
Host: localhost:8983
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20
060508 Firefox/1.5.0.4
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plai
n;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://localhost:8983/solr/admin/
Cookie: JSESSIONID=3nqupchdew5mh


URLs should be percent-encoded UTF-8 bytes, or at least UTF-8 bytes.
ISO-latin1 isn't acceptable.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

WHIRLYCOTT
In reply to this post by Yonik Seeley
I've started poking around and have fixed already one bug related to  
URL encoding of data.  I'm going to work some more on this tonight  
and will hopefully have a patch for you soon.

phil.

On Jul 18, 2006, at 6:19 PM, Yonik Seeley wrote:

> On 7/18/06, WHIRLYCOTT <[hidden email]> wrote:
>> How much testing have people done using UTF-8 data on Solr?
>
> UTF-8 query *output* is well tested with Resin within CNET.
> Indexing UTF-8 is also well tested (again, mostly with Resin).
> UTF-8 query input is not really tested at all AFAIK (the q param to
> the standard request handler).
>
> -Yonik


--
                                    Whirlycott
                                    Philip Jacob
                                    [hidden email]
                                    http://www.whirlycott.com/phil/


Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Yonik Seeley
In reply to this post by pgwillia
On 7/18/06, Tricia Williams <[hidden email]> wrote:
>  My sample query is: ...... (the english word _canada_
> translated into russian) or
> %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or
> %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B
> (solr url encoding)

Hi Tricia,
Could you clarify what you mean by "solr url encoding"?  Where do you see this?
The servlet container decodes URLs, and I'm not sure where in Solr
that URLs are encoded.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

WHIRLYCOTT
In reply to this post by pgwillia
On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:

> that using the packaged example admin interface entering a query  
> with a string of cyrillic characters causes a  
> java.lang.ArrayIndexOutOfBoundsException

... I have this much fixed as well.

However, I'm still walking data through the stack and I'm not yet  
convinced that my data is being stored properly as UTF-8 strings.  It  
could be a character encoding issue in the client that I'm using to  
hit the /solr/update servlet or it could be something more insidious.

But I need this stuff working for my own site (www.stylefeeder.com,  
in case you care...), so I will continue with this and report back.

phil.


--
                                    Whirlycott
                                    Philip Jacob
                                    [hidden email]
                                    http://www.whirlycott.com/phil/


Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

pgwillia
In reply to this post by Yonik Seeley
Hi Yonik,

    I was incorrect to describe it as _solr encoding_.  Hoss suggested that
it might be a form error - I haven't checked this yet but it sound
plausible.  What I called the _solr url encoding_ was the q= parameter
translated into <I'm not sure what> encoding in the url.  As I mention in
my ps this translated value is not the same as when I use IE to post the
same form values.

    You mentioned in another earlier post that q=h%c3%e9 would find
matching hits.  My experience shows that while the UTF-8 encoded query
doesn't generate any exceptions, no results are matched.  However
q=h%e9llo would find matching results (the result set I'd match in Luke).
So assuming that I can fix the form encoding errors so that the characters
are encoded as UTF-8, I believe that I would continue to return incorrect
results.  Will cyrillic characters be treated any differently than the
diacritic in your example?

    I have solr running in tomcat 5.5.17.

Thanks for all you help,
Tricia


On Tue, 18 Jul 2006, Yonik Seeley wrote:

> On 7/18/06, Tricia Williams <[hidden email]> wrote:
>>  My sample query is: ...... (the english word _canada_
>> translated into russian) or
>> %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or
>> %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B
>> (solr url encoding)
>
> Hi Tricia,
> Could you clarify what you mean by "solr url encoding"?  Where do you see
> this?
> The servlet container decodes URLs, and I'm not sure where in Solr
> that URLs are encoded.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Cyrillic characters

Bertrand Delacretaz
On 7/19/06, Tricia Williams <[hidden email]> wrote:

> ...What I called the _solr url encoding_ was the q= parameter
> translated into <I'm not sure what> encoding in the url...

I think I've seen the same problem, haven't investigated deeper but
IIUC the encoding used when posting a form is related to both the
encoding indicated by the web server in the HTTP headers, and the
encoding indicated (optionally) in the HTML page with something like
<meta content="text/html; charset=UTF-8" http-equiv="content-type"/>

In my case I've found that, running SOLR from start.jar with default settings:

-If I search "désormais" from the solr/admin page, it is translated to
q=d%E9sormais in the URL, and nothing's found (the word is in my
index)

-If I replace the q= value with q=d%C3%A9sormais (which is the
encoding that I get when entering this word in the Google search
form), my query works

I haven't seen the problem with my own search form, which includes the
above http-equiv meta and is served as a static page from my web
server.

So I think something's wrong with the encoding on the solr/admin/
search page, but I haven't investigated further.

Hope this helps...not sure if it does but the above scenario looks
similar to yours.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Yonik Seeley
In reply to this post by pgwillia
On 7/19/06, Tricia Williams <[hidden email]> wrote:
>     You mentioned in another earlier post that q=h%c3%e9 would find
> matching hits.  My experience shows that while the UTF-8 encoded query
> doesn't generate any exceptions, no results are matched.  However
> q=h%e9llo would find matching results.

Confirmed in Tomcat 5.5.17, LOL!

So Firefox->Tomcat works for latin1 at least
and IE->Jetty also works for latin1

By my reading of the standards, UTF8 (or percent encoded UTF8 bytes)
is the only correct format for a URI to be in.

Can anyone else shed some light on this?

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

WHIRLYCOTT
I submitted two patches that fix one problem with URL encoding and  
another with the screens on the webapp.

        http://issues.apache.org/jira/browse/SOLR-35

phil.

On Jul 19, 2006, at 11:58 AM, Yonik Seeley wrote:

> On 7/19/06, Tricia Williams <[hidden email]> wrote:
>>     You mentioned in another earlier post that q=h%c3%e9 would find
>> matching hits.  My experience shows that while the UTF-8 encoded  
>> query
>> doesn't generate any exceptions, no results are matched.  However
>> q=h%e9llo would find matching results.
>
> Confirmed in Tomcat 5.5.17, LOL!
>
> So Firefox->Tomcat works for latin1 at least
> and IE->Jetty also works for latin1
>
> By my reading of the standards, UTF8 (or percent encoded UTF8 bytes)
> is the only correct format for a URI to be in.
>
> Can anyone else shed some light on this?
>
> -Yonik


--
                                    Whirlycott
                                    Philip Jacob
                                    [hidden email]
                                    http://www.whirlycott.com/phil/


Reply | Threaded
Open this post in threaded view
|

Re: Re: Cyrillic characters

Bertrand Delacretaz
In reply to this post by Yonik Seeley
On 7/19/06, Yonik Seeley <[hidden email]> wrote:

> ...Can anyone else shed some light on this?..

I have to run now but I *think* there are encoding settings in
web.xml, and IIRC they might be different for Tomcat or Jetty. Setting
UTF-8 everywhere should help.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

WHIRLYCOTT
In reply to this post by Bertrand Delacretaz
On Jul 19, 2006, at 11:44 AM, Bertrand Delacretaz wrote:

> -If I search "désormais" from the solr/admin page, it is translated to
> q=d%E9sormais in the URL, and nothing's found (the word is in my
> index)

http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset

"The default value for this attribute is the reserved string  
“UNKNOWN”. User agents may interpret this value as the character  
encoding that was used to transmit the document containing this FORM  
element."

Solr-trunk currently uses ISO-8859-1 as the character encoding for  
the admin pages.  One of the patches I submitted changes the admin  
pages to use UTF-8 and that fixes the problem.

phil.


--
                                    Whirlycott
                                    Philip Jacob
                                    [hidden email]
                                    http://www.whirlycott.com/phil/


Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Yonik Seeley
On 7/19/06, WHIRLYCOTT <[hidden email]> wrote:
> Solr-trunk currently uses ISO-8859-1 as the character encoding for
> the admin pages.  One of the patches I submitted changes the admin
> pages to use UTF-8 and that fixes the problem.

OK, we are closer to working correctly.  It appears that the web
browsers are trying to be smart when submitting form data and using
the encoding of the received page to submit the HTTP-GET (non-standard
behaviour as I read it, but it may be to support legacy stuff).

So changing the admin pages to use UTF-8, and clearing the browser
caches, does indeed make both Firefox and IE send percent-encoded
UTF-8 (h%C3%A9llo).

Now the problem: Tomcat 5.5.17 isn't decoding percent-encoded UTF-8,
but instead treating %C3%A9 as two separate characters.  Soooo, I
think Bertrand is right about there being some web.xml setting....
time to hit the tomcat docs, and if that fails, grab Yoav's attention
:-)

I would be interested to know what some of the built-in http client
libs out there do:
  - HTTPClient, python, ruby, rhino, etc
Hopefully most do the right thing w.r.t. UTF-8, but if not, one can
always post queries with a content-type of UTF-8.


-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Cyrillic characters

Yonik Seeley
On 7/19/06, Yonik Seeley <[hidden email]> wrote:
> Now the problem: Tomcat 5.5.17 isn't decoding percent-encoded UTF-8,
> but instead treating %C3%A9 as two separate characters.

Here's the magic for Tomcat:
http://split-s.blogspot.com/2005/12/internationalized-get-parameters-with.html

edit server.xml and add the following parameter to the connector element:


<Server ...>
  <Service ...>
    <Connector ... URIEncoding="UTF-8"/>
      ...
    </Connector>
  </Service>
</Server>


-Yonik