Problem with surrogate characters in utf-8

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with surrogate characters in utf-8

Burkamp, Christian
Problem with surrogate characters in utf-8

Hi all,

I have a problem after updating to solr 1.2. I'm using the bundled jetty that comes with the latest solr release.
Some of the contents that are stored in my index contain characters from the unicode private section above 0x100000. (They are used by some proprietary software and the text extraction does not throw them out).

Contrasting to solr 1.1, the current release returns these characters coded as sequence of two surrogate characters. This could result from some utf-16 conversion that is taking place somewhere in the system? In fact a look into the index with luke suggests that lucene is storing it's data in utf-16 encoding. The code point 0x100058 is stored as the two surrogate characters 0xDBC0 and 0xDC58. This is the same behaviour in solr 1.1 and 1.2. But while in solr 1.1 the character is put together to form one 4-byte utf-8 character in the result, solr 1.2 returns the utf-8 codes for the two surrogate characters that I see using luke. Unfortunately this results in an invalid utf-8 encoded text that (for example) can not be displayed by Internet Explorer.

A request like http://localhost:8983/solr/select?q=*:* results in an error message from the browser.

This is easy to reproduce if someone would try to debug. I have attached a valid utf-8 encoded xml document that contains the 4-byte encoded codepoint 0x100058. It can be indexed with post.jar. Sending this request via Internet Explorer now results in an error: http://localhost:8983/solr/select?q=*:*

<<utf.xml>>
I tried the new solr 1.2 war file with the old example distribution (solr 1.1 and jetty 5.1). Suprisingly enough this does not reveal the problem. So the whole story might even be a jetty issue.

Any ideas?

-- Christian


utf.xml (214 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with surrogate characters in utf-8

Yonik Seeley-2
On 6/14/07, Burkamp, Christian <[hidden email]> wrote:
> I tried the new solr 1.2 war file with the old example distribution (solr
> 1.1 and jetty 5.1). Suprisingly enough this does not reveal the problem. So
> the whole story might even be a jetty issue.

That definitely points to it being a Jetty issue.

-Yonik