invalid XML character

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

invalid XML character

Brian Whitman
Once in a while we get this

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[4,790470]
[14:32:21.877] Message: An invalid XML character (Unicode: 0x6) was  
found in the element content of the document.
[14:32:21.877] at  
com
.sun
.org
.apache
.xerces
.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
[14:32:21.877] at  
org
.apache
.solr
.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:
318)
[14:32:21.877] at  
org
.apache
.solr
.handler
.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
...

Our data comes from all sorts of places and although we've tried to be  
utf8 wherever we can, there are still cracks.

I would much rather a document get added with replacement character  
than to have this error prevent the addition of 8K documents (as has  
happened here, this one character was in a 8K <add><doc>..<doc... run,  
and only the ones before this character were added.)

Is there something I can do on the solr side to ignore/replace invalid  
characters?





Reply | Threaded
Open this post in threaded view
|

Re: invalid XML character

Yonik Seeley-2
On Sat, Mar 1, 2008 at 4:22 PM, Brian Whitman <[hidden email]> wrote:
> Once in a while we get this
>
>  javax.xml.stream.XMLStreamException: ParseError at [row,col]:[4,790470]
>  [14:32:21.877] Message: An invalid XML character (Unicode: 0x6) was
[...]
>  Our data comes from all sorts of places and although we've tried to be
>  utf8 wherever we can, there are still cracks.

The issue is that unfortunately XML cannot represent full unicode (it
prohibits some values).
This means even if they are escaped... so &#6; will cause the XML
parser to throw an exception.

$ echo '<foo>&#6;</foo>' | xmllint -
-:1: parser error : xmlParseCharRef: invalid xmlChar value 6
<foo>&#6;</foo>


>  I would much rather a document get added with replacement character
>  than to have this error prevent the addition of 8K documents (as has
>  happened here, this one character was in a 8K <add><doc>..<doc... run,
>  and only the ones before this character were added.)
>
>  Is there something I can do on the solr side to ignore/replace invalid
>  characters?

Since it's the XML parser, not really.

If your documents are basic (no index-time boost, fixed fields), you
could try using CSV.
You could also scan for such chars on the client side before the XML
is produced.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: invalid XML character

Leonardo Santagada

On 01/03/2008, at 18:35, Yonik Seeley wrote:

> You could also scan for such chars on the client side before the XML
> is produced.


Can't he put this code on the server before the xml parsing somehow? I  
would do like you said and do it on the client, but just out of  
curiosity is this really impossible?

--
Leonardo Santagada



Reply | Threaded
Open this post in threaded view
|

Re: invalid XML character

Yonik Seeley-2
On Sat, Mar 1, 2008 at 6:47 PM, Leonardo Santagada <[hidden email]> wrote:
>  On 01/03/2008, at 18:35, Yonik Seeley wrote:
>  > You could also scan for such chars on the client side before the XML
>  > is produced.
>
>  Can't he put this code on the server before the xml parsing somehow? I
>  would do like you said and do it on the client, but just out of
>  curiosity is this really impossible?

We'd have to implement our own xml parser (or a subset of one) for that.
A simple search+replace of &#xx; could do the wrong thing I think
(might be an actual literal in a CDATA block for example).  The
easiest place to fix it is before the field values are serialized into
XML.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: invalid XML character

Christian Wittern
Yonik Seeley wrote:
> On Sat, Mar 1, 2008 at 6:47 PM, Leonardo Santagada <[hidden email]> wrote:
>  
>>  Can't he put this code on the server before the xml parsing somehow? I
>>  would do like you said and do it on the client, but just out of
>>  curiosity is this really impossible?
>>    
>
> We'd have to implement our own xml parser (or a subset of one) for that.
>  
I am not sure this is such a good idea.  After all, XML does not allow
these characters, so if you write your own parser, that would not be a
standard compliant XML parser and you would need to more or less
re-invent the whole tool-chain for your
slightly-modified-but-not-quite-XML format.

A better strategy I think would be to put the responsibility on the
client to send correct XML if they say they send XML.  If necessary, a
different escaping mechanism like the \u<codepoint> used in many
programming languages could be used for the XML transport layer.


> A simple search+replace of &#xx; could do the wrong thing I think
> (might be an actual literal in a CDATA block for example).  
This would also not get you beyond the XML parser, since to the parser
&#6; looks exactly the same as the character expressed with its binary
value.

> The
> easiest place to fix it is before the field values are serialized into
> XML.
>  

Indeed!

All the best,

Christian

--

 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Reply | Threaded
Open this post in threaded view
|

Re: invalid XML character

Yonik Seeley-2
On Sat, Mar 1, 2008 at 11:26 PM, Christian Wittern <[hidden email]> wrote:

> Yonik Seeley wrote:
>  > On Sat, Mar 1, 2008 at 6:47 PM, Leonardo Santagada <[hidden email]> wrote:
> >>  Can't he put this code on the server before the xml parsing somehow? I
>  >>  would do like you said and do it on the client, but just out of
>  >>  curiosity is this really impossible?
>  >>
>  >
>  > We'd have to implement our own xml parser (or a subset of one) for that.
>  >
>  I am not sure this is such a good idea.

I'm pretty sure it's a bad idea :-)  I was just explaining why it
wasn't really feasible to do on the server side.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: invalid XML character

Brian Whitman
>
>
> I'm pretty sure it's a bad idea :-)  I was just explaining why it
> wasn't really feasible to do on the server side.


This particular case came from this solr.py: https://issues.apache.org/jira/browse/SOLR-216

By the way, is that going to become the official 1.3 solr python  
client? It would be nice because then someone who knows more about  
unicode/python will go in and fix this :) If someone wants to point me  
to a place that lists invalid xml characters I can probably figure it  
out.






Reply | Threaded
Open this post in threaded view
|

Re: invalid XML character

Walter Underwood, Netflix
Section 2.2 of the XML spec. Three characters from the 0x00-0x19 block
are allowed: 0x09, 0x0A, 0x0D.

Annotated version: http://www.xml.com/axml/testaxml.htm

Section 2.2 in current official spec: http://www.w3.org/TR/REC-xml/#charsets

wunder

On 3/2/08 6:44 AM, "Brian Whitman" <[hidden email]> wrote:

> If someone wants to point me
> to a place that lists invalid xml characters I can probably figure it
> out.