UTF-8 encoding problem on one of two Solr setups

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF-8 encoding problem on one of two Solr setups

Mario Knezovic
Hi all,

I have set up an identical Solr 1.1 on two different machines. One works
fine, the other one has a UTF-8 encoding problem.

#1 is my local Windows XP machine. Solr is running basically in a
configuration like in the tutorial example with Jetty/5.1.11RC0 (Windows
XP/5.1 x86 java/1.6.0). Everything works fine here as expected.

#2 is a Linux machine with Solr running inside Tomcat 6. The problem happens
here. This is the place where Solr will be running finally.

To rule out all problems in my PHP and Java code, I tested the problem with
the Solr admin page and it happens there as well. (Tested with Firefox 2
with site's char encoding UTF-8.)

When entering an arbitrary search string containing UTF-8 chars I get a
correct response from the local Windows Solr setup:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">0</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="start">0</str>
  <str name="q">München</str>  <-- sample string containing a German
umlaut-u
  <str name="rows">10</str>
  <str name="version">2.2</str>
 </lst>
</lst>
[...]

When I do exactly the same, just on the admin page of the other Solr setup
(but from exactly the same browser), I get the following response:

[...]
<str name="q">item$searchstring_de:München</str>
[...]

Obviously the umlaut-u UTF-8 bytes 0xC3 0xB6 had been interpreted as two
8-bit chars instead of one UTF-8 char.

Unfortunately I am pretty new to Solr, Tomcat and related topics, so I was
not able to find the problem yet. My guess is that it is outside of Solr,
maybe in the Tomcat configuration, but so far I spent the entire day without
a further clue.

But apart from that Solr really rocks. Indexing tons of content and
searching works just fine and fast and it was pretty easy to get into
everything. Now I am changing all data to UTF-8 and ran into my first
serious obstacle... after a few weeks of Solr usage!

Any hint/help appreciated. Thank you very much.

Mario

Reply | Threaded
Open this post in threaded view
|

RE: UTF-8 encoding problem on one of two Solr setups

Charlie Jackson
You might want to check out this page
http://wiki.apache.org/solr/SolrTomcat

Tomcat needs a small config change out of the box to properly support UTF-8.


Thanks,
Charlie


-----Original Message-----
From: Mario Knezovic [mailto:[hidden email]]
Sent: Friday, August 17, 2007 12:58 PM
To: [hidden email]
Subject: UTF-8 encoding problem on one of two Solr setups

Hi all,

I have set up an identical Solr 1.1 on two different machines. One works
fine, the other one has a UTF-8 encoding problem.

#1 is my local Windows XP machine. Solr is running basically in a
configuration like in the tutorial example with Jetty/5.1.11RC0 (Windows
XP/5.1 x86 java/1.6.0). Everything works fine here as expected.

#2 is a Linux machine with Solr running inside Tomcat 6. The problem happens
here. This is the place where Solr will be running finally.

To rule out all problems in my PHP and Java code, I tested the problem with
the Solr admin page and it happens there as well. (Tested with Firefox 2
with site's char encoding UTF-8.)

When entering an arbitrary search string containing UTF-8 chars I get a
correct response from the local Windows Solr setup:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">0</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="start">0</str>
  <str name="q">München</str>  <-- sample string containing a German
umlaut-u
  <str name="rows">10</str>
  <str name="version">2.2</str>
 </lst>
</lst>
[...]

When I do exactly the same, just on the admin page of the other Solr setup
(but from exactly the same browser), I get the following response:

[...]
<str name="q">item$searchstring_de:München</str>
[...]

Obviously the umlaut-u UTF-8 bytes 0xC3 0xB6 had been interpreted as two
8-bit chars instead of one UTF-8 char.

Unfortunately I am pretty new to Solr, Tomcat and related topics, so I was
not able to find the problem yet. My guess is that it is outside of Solr,
maybe in the Tomcat configuration, but so far I spent the entire day without
a further clue.

But apart from that Solr really rocks. Indexing tons of content and
searching works just fine and fast and it was pretty easy to get into
everything. Now I am changing all data to UTF-8 and ran into my first
serious obstacle... after a few weeks of Solr usage!

Any hint/help appreciated. Thank you very much.

Mario

Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 encoding problem on one of two Solr setups

Sean Timm
In reply to this post by Mario Knezovic
This may be your problem.  The below docs are for the HTTP connector, simlar configuration can be made to the AJP and other connectors

See
http://tomcat.apache.org/tomcat-6.0-doc/config/http.html

URIEncoding

This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

-Sean

[hidden email] wrote:
Hi all,

I have set up an identical Solr 1.1 on two different machines. One works
fine, the other one has a UTF-8 encoding problem.

#1 is my local Windows XP machine. Solr is running basically in a
configuration like in the tutorial example with Jetty/5.1.11RC0 (Windows
XP/5.1 x86 java/1.6.0). Everything works fine here as expected.

#2 is a Linux machine with Solr running inside Tomcat 6. The problem happens
here. This is the place where Solr will be running finally.

To rule out all problems in my PHP and Java code, I tested the problem with
the Solr admin page and it happens there as well. (Tested with Firefox 2
with site's char encoding UTF-8.)

When entering an arbitrary search string containing UTF-8 chars I get a
correct response from the local Windows Solr setup:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">0</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="start">0</str>
  <str name="q">München</str>  <-- sample string containing a German
umlaut-u
  <str name="rows">10</str>
  <str name="version">2.2</str>
 </lst>
</lst>
[...]

When I do exactly the same, just on the admin page of the other Solr setup
(but from exactly the same browser), I get the following response:

[...]
<str name="q">item$searchstring_de:München</str>
[...]

Obviously the umlaut-u UTF-8 bytes 0xC3 0xB6 had been interpreted as two
8-bit chars instead of one UTF-8 char.

Unfortunately I am pretty new to Solr, Tomcat and related topics, so I was
not able to find the problem yet. My guess is that it is outside of Solr,
maybe in the Tomcat configuration, but so far I spent the entire day without
a further clue.

But apart from that Solr really rocks. Indexing tons of content and
searching works just fine and fast and it was pretty easy to get into
everything. Now I am changing all data to UTF-8 and ran into my first
serious obstacle... after a few weeks of Solr usage!

Any hint/help appreciated. Thank you very much.

Mario
  
Reply | Threaded
Open this post in threaded view
|

RE: UTF-8 encoding problem on one of two Solr setups

Mario Knezovic
In reply to this post by Charlie Jackson
> You might want to check out this page
> http://wiki.apache.org/solr/SolrTomcat
>
> Tomcat needs a small config change out
> of the box to properly support UTF-8.

This exactly solved the problem.

Thanks a lot!

Mario