Invalid XML returned from Solr

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Invalid XML returned from Solr

OneWhoMikes
I have a application that I recently ported to Solr and am running
into a few problems with the XML responses from Solr.  An XML response
which came from a Solr query, returned XML data that was not properly
escaped (no CDATA tag, or entity substitution).  In particular the
"summary" field contains '<' characters. An example of such a response
can be found here: http://www.willetts.com/mike/response.xml

I looked through the source code for XMLWriter and it appears to be
using util.XML.escape to escape the data, so I do not see how this
response able to occur.  Does anyone have any ideas?

Here is the requestHandler tag in the Solr config file:
<requestHandler name="standard" class="solr.StandardRequestHandler" />

On another note:
I also noticed that I get non-utf8 characters in the response even
though the encoding line at the top of the XML document specifies utf8
encoding.  I did not see anywhere in the XMLWriter code that checked
the encoding of the output.  Is this by design, or am I missing
something?


Thanks in advance, the feedback I have received from the user lists
has been invaluable.


Regards,

Mike
Reply | Threaded
Open this post in threaded view
|

Re: Invalid XML returned from Solr

Yonik Seeley
On 6/20/06, Mike Richmond <[hidden email]> wrote:
> I have a application that I recently ported to Solr and am running
> into a few problems with the XML responses from Solr.  An XML response
> which came from a Solr query, returned XML data that was not properly
> escaped (no CDATA tag, or entity substitution).  In particular the
> "summary" field contains '<' characters. An example of such a response
> can be found here: http://www.willetts.com/mike/response.xml

Hmmm, that is interesting... I haven't seen that before.
I'll try and duplicate it with your example "summary" field.

> On another note:
> I also noticed that I get non-utf8 characters in the response even
> though the encoding line at the top of the XML document specifies utf8
> encoding.

Are you using the bundled version of Jetty?  People have been having
problems with international chars with that.  You might try using
Tomcat.

> I did not see anywhere in the XMLWriter code that checked
> the encoding of the output.  Is this by design, or am I missing
> something?

By design... XMLWriter writes java characters and strings, and the
servlet container handles encoding to UTF-8.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Invalid XML returned from Solr

OneWhoMikes
Hi Yonik,

Thanks for the quick reply.  I am willing to give you access to my
index, config files, or any other pieces that you may need if it would
help.  I am basically running the example application (which uses
Jetty), but with a modified schema.xml and a couple other small
changes.

I'll look into giving Tomcat a try over Jetty.


--Mike


On 6/20/06, Yonik Seeley <[hidden email]> wrote:

> On 6/20/06, Mike Richmond <[hidden email]> wrote:
> > I have a application that I recently ported to Solr and am running
> > into a few problems with the XML responses from Solr.  An XML response
> > which came from a Solr query, returned XML data that was not properly
> > escaped (no CDATA tag, or entity substitution).  In particular the
> > "summary" field contains '<' characters. An example of such a response
> > can be found here: http://www.willetts.com/mike/response.xml
>
> Hmmm, that is interesting... I haven't seen that before.
> I'll try and duplicate it with your example "summary" field.
>
> > On another note:
> > I also noticed that I get non-utf8 characters in the response even
> > though the encoding line at the top of the XML document specifies utf8
> > encoding.
>
> Are you using the bundled version of Jetty?  People have been having
> problems with international chars with that.  You might try using
> Tomcat.
>
> > I did not see anywhere in the XMLWriter code that checked
> > the encoding of the output.  Is this by design, or am I missing
> > something?
>
> By design... XMLWriter writes java characters and strings, and the
> servlet container handles encoding to UTF-8.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Invalid XML returned from Solr

Yonik Seeley
In reply to this post by Yonik Seeley
I've confirmed this is a Jetty bug related to international chars
(>=128) and their output writer.  When I moved the example to Tomcat
5.5, everything worked as expected.

For the exact same Lucene index file,
Tomcat outputs
  <str>I¹ll &lt;email></str>
and Jetty outputs
  <str>I¹ll <email>&lt;email></str>

We should really look into switching the appserver we bundle for the example.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Invalid XML returned from Solr

OneWhoMikes
Hi Yonik,

Thanks again for the quick help.  I switched to Tomcat and all the
problems went away.

Not sure what the process would be but I'd be willing to migrate the
example application to tomcat and update the existing documentation.
I would like to give back to this project as it has done quite a bit
for me.


--Mike


On 6/20/06, Yonik Seeley <[hidden email]> wrote:

> I've confirmed this is a Jetty bug related to international chars
> (>=128) and their output writer.  When I moved the example to Tomcat
> 5.5, everything worked as expected.
>
> For the exact same Lucene index file,
> Tomcat outputs
>   <str>I¹ll &lt;email></str>
> and Jetty outputs
>   <str>I¹ll <email>&lt;email></str>
>
> We should really look into switching the appserver we bundle for the example.
>
> -Yonik
>