International Charsets in embedded XML

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

International Charsets in embedded XML

Fabio Confalonieri
(sorry the last one got wrongly posted)

Here I am again with charset encoding problems:

I need to store XML in a document field. I declare it as string and surround it in CData when I post the add xml.
Now the problem is I have some Iternational char in the XML: say  ì or à and also € (i don't know if You can read these).

When i get back from Solr the XML field strange things happens:

- first one: € get converted to ? (I see it in the index looking with luke)

- if there is an ì (accented ì) I get malformed XML back using with firefox and IE:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <responseHeader><status>0</status><QTime>0</QTime></responseHeader>
  <result numFound="1" start="0">
    <doc>
      <str name="categoryid">/relazioni/</str>
      <str name="facetXML"><?xml version="1.0" encoding="UTF-8"?><xml>
        <filter field="typecamper_s">
        <item value="autocaravanmansardato">Autocaravan ìMansardato</item>
                                                                                   ^ HERE begins the problem: from now on no more shielding of "<"

        <item value="semintegrale">Semintegrale</item>
        </filter>
        </xml>
       
        HERE continues the output, as it should have been shielded after the problem above:
       
        </item><item value="semintegrale">Semintegrale</item></filter>
        </xml>
      </str>
      ...
    </doc>
  </result>
</response>

But if i get the same document in my request handler (as a Document structure) I don't have any problem parsing the XML and get the correct char.
I have traced the XML.escape and the problem is not there so it's somewere between XMLWriter and Jetty (I've tried the last one 5.1.11).

- if i put some international char in a normal string field I see Solr stores the UTF-8 (i Think) encoded char in a string as in a text field type.

The question is: apart from the malformed XML issue, what is the better way to deal with internationa charsets ?

Thank You

Fabio
Reply | Threaded
Open this post in threaded view
|

Re: International Charsets in embedded XML

Mike Klaas
On 6/13/06, Fabio Confalonieri <[hidden email]> wrote:
>
> (sorry the last one got wrongly posted)

Are you sending Content-Type headers with appropriate charset
indicated?  Is your xml fully-escpaed in your update message?

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: International Charsets in embedded XML

Fabio Confalonieri
Klaas-2 wrote
Are you sending Content-Type headers with appropriate charset
indicated?  Is your xml fully-escpaed in your update message?
...no, actually I simply make a

                        URLConnection conn = url.openConnection();
                        conn.setRequestProperty("ContentType", "text/xml");
                        conn.setDoOutput(true);
                        wr = new OutputStreamWriter(conn.getOutputStream());
                        wr.write(data);
                        wr.flush();

to post the add xml command and my XML is embedded in a CData without further escaping... have I to do something else.

I'm getting data from a MySQL db and I found some problems where in retrieving data from there.

I've made some step forword connecting to the db with "characterEncodingutf8" in the jdbc URL, and then converting with:

new String(mysqlXMLField.getBytes("latin1"));

But I'm really not into charsets and encodings...

Reply | Threaded
Open this post in threaded view
|

Re: International Charsets in embedded XML

kkrugler
>Klaas-2 wrote:
>>
>>  Are you sending Content-Type headers with appropriate charset
>>  indicated?  Is your xml fully-escpaed in your update message?
>>
>
>...no, actually I simply make a
>
> URLConnection conn = url.openConnection();
> conn.setRequestProperty("ContentType", "text/xml");
> conn.setDoOutput(true);
> wr = new OutputStreamWriter(conn.getOutputStream());
> wr.write(data);
> wr.flush();
>
>to post del add xml and my XML is embedded in a CData without further
>escaping... have I to to something else.
>
>I'm getting data from a MySQL db and I found some problems where in
>retrieving data from there.
>
>I've made some step forword connecting to the db with
>"characterEncodingutf8" in the jdbc URL, and then converting with:
>
>new String(mysqlXMLField.getBytes("latin1"));

If you use "characterEncodingutf8", then I think you'll get back a
stream of UTF-8 bytes from the DB.

I don't know what mysqlXMLField's type is (from above), but you
should start with the array of bytes returned from the JDBC call, and
then create the string from this array using "UTF-8" as the encoding
name. Or just use those bytes directly when writing out the XML.

>But I'm really not into charsets and encodings...

The best thing to do is:

1. Make sure the XML you send to Solr starts with this line:

<?xml version="1.0" encoding="utf-8"?>

2. Make sure you've converted all of the text in the XML fields to
the UTF-8 character set.

Then don't wrap those fields with CDATA.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: International Charsets in embedded XML

Mike Klaas
In reply to this post by Fabio Confalonieri
On 6/13/06, Fabio Confalonieri <[hidden email]> wrote:
> Klaas-2 wrote:
> > Are you sending Content-Type headers with appropriate charset
> > indicated?  Is your xml fully-escpaed in your update message?

> ...no, actually I simply make a
>
>                         URLConnection conn = url.openConnection();
>                         conn.setRequestProperty("ContentType", "text/xml");

That should be 'Content-Type' (not the dash).  To specify a charset,
set the value to
"text/xml; charset=utf-8".

> to post del add xml and my XML is embedded in a CData without further
> escaping... have I to to something else.

Yes.  How are you creating your xml?  You should be using an XML
writer, which should do the necessary escaping of <>'s.  I'm
unfamiliar with the specifics of java, so I'll leave others to suggest
something.

> I'm getting data from a MySQL db and I found some problems where in
> retrieving data from there.
>
> I've made some step forword connecting to the db with
> "characterEncodingutf8" in the jdbc URL, and then converting with:

Is the column collation utf-8 as well?  Both the connection and the
column collation should be set to utf-8.

> new String(mysqlXMLField.getBytes("latin1"));

Why are you converting using the latin1 charset when you opened the
connection in utf-8?

> But I'm really not into charsets and encodings...

I'm afraid basic knowledge in this area is essential for all developers:
http://www.joelonsoftware.com/articles/Unicode.html

-MIke
Reply | Threaded
Open this post in threaded view
|

Re: International Charsets in embedded XML

Fabio Confalonieri
In reply to this post by kkrugler
Ok, thanks to Your posts, I've read some basic on encoding and made some changes to my code: now it's all much more clear... but I still have some problems.

This is what I do (don't know if this can help someone having same problems I had):

- I get data from a DB telling JDBC connector to use UTF-8.

- then i convert in Java string internal encoding (UTF-16 I have learned) in this way:

                new String(rs.getBytes(rsField), "UTF-8")

this way I get the UTF8 byte array from my resultset (from MySQL) then telling String constructor that the array is to be interpreted in UTF8.

When I have to write the update XML document to solr:

                URLConnection conn = url.openConnection();
                conn.setRequestProperty("Content-Type", "text/xml; charset=utf-8");
                conn.setDoOutput(true);
                wr = new OutputStreamWriter(conn.getOutputStream(), "UTF-8");
                wr.write(data);
                wr.flush();

So I'm sure everything is converted back to UTF8 when writing to the update solr url.

This way everything is fine getting normal field from documents (we can get back all our diacritical chars and Euro sign)... but:

-  I cannot search using diacritical.
If i have a doc with a field containing "città", I cannot get it back with q=field:città (in the url the à get converted to utf8 E0 like this "citt%E0").
The strange thing is that using an old solr with Jetty 6.0.beta the search with diacritical was ok, but responses got back from solr doubly utf8 encoded (we had to decode two times). Using last version of Solr with jetty 5.1.X responses are single utf8 encode (as You would expect) but diacritical search is not running. Is there a particular way to do this ?

- I still have problems getting back fields stored in XML that contain diacritical (I've followed your advices and have escaped myself the < sign but the result is the same as usig CData (i dont use DOM here), by the way, why did You said not to use CData?):
I get the same problem I showed You in my first post of a malformed XML.

Thank You again

   Fabio
Reply | Threaded
Open this post in threaded view
|

Re: International Charsets in embedded XML

Fabio Confalonieri
Ok, I found the clue:
the problem is Jetty, using Tomcat everything works fine.

I can search diacritics (I found Jetty required an extra UTF8 encoding on query values in the url)
AND
no more problems in responses with field containing XML with diacritics and Euro sign (and everything else I suppose).

It's a Pity because Jetty is much more slimmer to deploy and install and perhaps faster, but anyway I think these problems should be documented in some manner.

Thanks to all

    Fabio
Reply | Threaded
Open this post in threaded view
|

RE: International Charsets in embedded XML

Brian Lucas
It sounds like many people (myself included) try to avoid using yet another
application server (Tomcat) initially.  While I think it's already been
alluded to pretty well, it might be a good idea to stress on the Solr wiki
that the Jetty instance isn't fully debugged and is only recommended for
proof-of-concept.


-----Original Message-----
From: Fabio Confalonieri [mailto:[hidden email]]
Sent: Friday, June 16, 2006 3:55 AM
To: [hidden email]
Subject: Re: International Charsets in embedded XML


Ok, I fould de clue:
the problem is Jetty, using Tomcat everything works fine.

I can search diacritics (I found Jetty required an extra UTF8 encoding on
query values in the url)
AND
no more problems in responses with field containing XML with diacritics and
Euro sign (and everything else I suppose).

It's a Pity because Jetty is much more slimmer to deploy and install and
perhaps faster, but anyway I think these problems should be documented in
some manner.

Thanks to all

    Fabio

--
View this message in context:
http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a
4897795
Sent from the Solr - User forum at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

RE: International Charsets in embedded XML

Chris Hostetter-3

: application server (Tomcat) initially.  While I think it's already been
: alluded to pretty well, it might be a good idea to stress on the Solr wiki
: that the Jetty instance isn't fully debugged and is only recommended for
: proof-of-concept.

I think the goal was to avoid advocating for or against any particular
application server in all the documentation, becuse we didn't want people
to assume that they had to use a specific implimentation (although if
providing Jetty for the example makes people think that Jetty is the way
to go, then maybe our problem isn't documentation).

But I have added a FAQ about multibyte characters that mentions people
have more success with Tomcat then Jetty in that regard ... if anyone has
any additional tips/comments to make on the subject feel free to add them
to the Wiki...

http://wiki.apache.org/solr/FAQ#head-e4563a0de1698b4933b4056a7db2df8faec16cf1



-Hoss