Adding data as UTF-8

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding data as UTF-8

Morten Fangel-3
Hi,

I've been working on adding some Solr-integration into my current project, but
have run into a problem with non-ascii characters.

I send a document like the following:

---
<?xml version="1.0" encoding="UTF-8"?>
<add><doc>
  <field name="question_id">228</field>
  <field name="question_title">Vedhæft billede til min formular</field>
  <field name="userid">26</field>
  <field name="question_text">Jeg har lavet en side som skal info om
værkstedet Badsetuen i Odense, som er under kraftig omlægning af kommunen -
dvs nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om
deres håndværk udført på stedet.
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?</field>
  <field name="question_date">2006-05-17T08:44:23Z</field>
  <field name="question_tags">Upload</field>
  <field name="question_tags">HTML</field>
  <field name="question_tags">Email</field>
  <field name="question_tags">Vedhæftning</field>
</doc></add>
---

But when I do a search like "/solr/select/?q=billede" (default search is the
field "text" which is a multiValued copyField from question_title and
question_text)

I will get the document back as

---
?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 ...
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <date name="question_date">2006-05-17T08:44:23Z</date>
  <int name="question_id">228</int>
  <arr name="question_tags"><str>Upload</str><str>HTML</str><str>Email</str>
        <str>Vedhæftning</str></arr>
  <str name="question_text">Jeg har lavet en side som skal info om værkstedet
Badsetuen i Odense, som er under kraftig omlægning af kommunen - dvs
nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om
deres håndværk udført på stedet.
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?</str>
  <str name="question_title">Vedhæft billede til min formular</str>
  <str name="userid">26</str>
 </doc>
</result>
</response>
---

Which is basicly the same text, but displayed as ISO-8859-1. How can this be?
Do I have to send off some header saying it is UTF-8, or should I just send
the data as UTF-8 (that produces the correct encoding in answers, but sounds
like a silly way of doing it)

Any ideas?

Btw, the install-script listed at http://wiki.apache.org/solr/SolrTomcat is a
bit wrong. Should I just contribute the fixes (new solr dir and name to
fetch) to the wiki, or will any of you guys rather do it yourself?

Regards
 -fangel
Reply | Threaded
Open this post in threaded view
|

Re: Adding data as UTF-8

Bertrand Delacretaz
On 3/10/07, Morten Fangel <[hidden email]> wrote:

> ...I send a document like the following:
>
> ---
> <?xml version="1.0" encoding="UTF-8"?>...

I assume you're using your own code to "send" the document?

Currently you need to include a "Content-type: text/xml;
charset=UTF-8" header in your HTTP POST request, and (as you're doing)
the XML needs to be encoded in UTF-8.

See the source code of
src/java/org/apache/solr/util/SimplePostTool.java for example.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Adding data as UTF-8

Morten Fangel-3
On Saturday 10 March 2007 21:39, Bertrand Delacretaz wrote:
> On 3/10/07, Morten Fangel <[hidden email]> wrote:
> > ...I send a document like the following:
> >
> > ---
> > <?xml version="1.0" encoding="UTF-8"?>...
>
> I assume you're using your own code to "send" the document?
Indeed. Solr will be integrated (almost) transparently into my framework.. ;)

It'll work pretty much like the act_as_solr RoR implementation, if I'm not
totally mistaken about that particular implementation..
>
> Currently you need to include a "Content-type: text/xml;
> charset=UTF-8" header in your HTTP POST request, and (as you're doing)
> the XML needs to be encoded in UTF-8.
Super. Indeed that fixed it, yes...

-fangel

Reply | Threaded
Open this post in threaded view
|

Re: Adding data as UTF-8

Walter Underwood, Netflix
In reply to this post by Bertrand Delacretaz
It is better to use "application/xml". See RFC 3023.
Using "text/xml; charset=UTF-8" will override the XML
encoding declaration. "application/xml" will not.

wunder

On 3/10/07 12:39 PM, "Bertrand Delacretaz" <[hidden email]> wrote:

> On 3/10/07, Morten Fangel <[hidden email]> wrote:
>
>> ...I send a document like the following:
>>
>> ---
>> <?xml version="1.0" encoding="UTF-8"?>...
>
> I assume you're using your own code to "send" the document?
>
> Currently you need to include a "Content-type: text/xml;
> charset=UTF-8" header in your HTTP POST request, and (as you're doing)
> the XML needs to be encoded in UTF-8.
>
> See the source code of
> src/java/org/apache/solr/util/SimplePostTool.java for example.
>
> -Bertrand

Reply | Threaded
Open this post in threaded view
|

Re: Adding data as UTF-8

Morten Fangel-3
On Saturday 10 March 2007 22:18, Walter Underwood wrote:
> It is better to use "application/xml". See RFC 3023.
> Using "text/xml; charset=UTF-8" will override the XML
> encoding declaration. "application/xml" will not.
Thanks for the info. I've changed the header accordingly.

-fangel
Reply | Threaded
Open this post in threaded view
|

Re: Adding data as UTF-8

Bertrand Delacretaz
In reply to this post by Walter Underwood, Netflix
On 3/10/07, Walter Underwood <[hidden email]> wrote:
> It is better to use "application/xml". See RFC 3023.
> Using "text/xml; charset=UTF-8" will override the XML
> encoding declaration. "application/xml" will not...

I agree, but did you try this with our example setup, started with
"java -jar start.jar"?

It doesn't seem to work here: If I change our example/exampledocs/post.sh to use

   curl $URL --data-binary @$f -H 'Content-type:application/xml'

instead of

  curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'

the encoding declaration of my posted XML is ignored, characters are
interpreted according to my JVM encoding (-Dfile.encoding makes a
difference in that case).

Are you seeing something different, or do you know why this is so?

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Adding data as UTF-8

Walter Underwood, Netflix
If it does something different, that is a bug. RFC 3023 is clear. --wunder

On 3/10/07 1:49 PM, "Bertrand Delacretaz" <[hidden email]> wrote:

> On 3/10/07, Walter Underwood <[hidden email]> wrote:
>> It is better to use "application/xml". See RFC 3023.
>> Using "text/xml; charset=UTF-8" will override the XML
>> encoding declaration. "application/xml" will not...
>
> I agree, but did you try this with our example setup, started with
> "java -jar start.jar"?
>
> It doesn't seem to work here: If I change our example/exampledocs/post.sh to
> use
>
>    curl $URL --data-binary @$f -H 'Content-type:application/xml'
>
> instead of
>
>   curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
>
> the encoding declaration of my posted XML is ignored, characters are
> interpreted according to my JVM encoding (-Dfile.encoding makes a
> difference in that case).
>
> Are you seeing something different, or do you know why this is so?
>
> -Bertrand

Reply | Threaded
Open this post in threaded view
|

Re: Adding data as UTF-8

Bertrand Delacretaz
On 3/10/07, Walter Underwood <[hidden email]> wrote:
> If it does something different, that is a bug. RFC 3023 is clear. --wunder..

Sure - just wanted to confirm what I'm seeing, thanks!

-Bertrand