Illegal xml/html character; unicode problems near solr

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Illegal xml/html character; unicode problems near solr

petercline
Hi all,

I'm new to the list, but I've been struggling with this problem for some
time. I'm getting Illegal xml/html character errors and I'm trying to
track down the source. The characters in question seem to be in the
128-159 (decimal) range, which is illegal in XML. The characters are
mostly diacritics and other types of accents.

The original data is encoded in UTF-8. I have verified that the data
doesn't contain any of these characters prior to indexing, and when I
get the records in question back in a list of results, they display
fine. The problem arises when the characters occur in a facet value and
I try to pass it through the URL.

As an example, consider a facet value:
Brasseur de Bourbourg, abb%C3%A9, 1814-1874, former owner

The %C3%A9 is an e with a diacritic, so roughly abbe'.

The following is a snippet of a link to use a facet:
search-faceted.html?q=[* TO
*]&facet=true&rows=25&fq=name_facet:"Brasseur de
Bourbourg, abb%C3%A9, 1814-1874, former owner""

These characters are correctly specified. When it returns, I get an
illegal character error. Examining the XML, I get an fq value of:
name_facet:"Brasseur de Bourbourg, abbé, 1814-1874, former owner"

I'm not sure how that will display in the email, but in short, it's not
what I put in. Further, it's not legal html and things break.

Does anyone have any thoughts about this? I apologize if this has been
asked somewhere in the past, but I did some digging and couldn't come up
with anything. I welcome any input.

Regards,

Peter

----
Peter Cline, Digital Library Applications Programmer
University of Pennsylvania Library
email: pcline at pobox dot upenn dot edu
Reply | Threaded
Open this post in threaded view
|

Re: Illegal xml/html character; unicode problems near solr

Yonik Seeley-2
On Fri, Mar 7, 2008 at 12:30 PM, Peter Cline <[hidden email]> wrote:
>  The following is a snippet of a link to use a facet:
>  search-faceted.html?q=[* TO
>  *]&amp;facet=true&amp;rows=25&amp;fq=name_facet:&#34;Brasseur de
>  Bourbourg, abb%C3%A9, 1814-1874, former owner&#34;"
>
>  These characters are correctly specified. When it returns, I get an
>  illegal character error. Examining the XML, I get an fq value of:
>  name_facet:"Brasseur de Bourbourg, abbÃÂ(c), 1814-1874, former owner"

Is this bad XML part of the responseHeader (parameters that are simply
being echoed back)?
If so, it's most likely the config on whatever servlet container you
are using... you need to configure it to accept UTF-8 URLs rather than
latin-1 (Tomcat defaults to the old-style latin-1 AFAIK)

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: Illegal xml/html character; unicode problems near solr

nicolas.dessaigne
In reply to this post by petercline
I think Tomcat defaults to the operating system default, e.g. cp1252 on a
classic windows.

You need to add an attribute URIEncoding="UTF-8" to the Connector you use in
the server.xml conf.

Nicolas

-----Message d'origine-----
De : [hidden email] [mailto:[hidden email]] De la part de Yonik Seeley
Envoyé : vendredi 7 mars 2008 18:53
À : [hidden email]
Objet : Re: Illegal xml/html character; unicode problems near solr

On Fri, Mar 7, 2008 at 12:30 PM, Peter Cline <[hidden email]> wrote:
>  The following is a snippet of a link to use a facet:
>  search-faceted.html?q=[* TO
>  *]&amp;facet=true&amp;rows=25&amp;fq=name_facet:&#34;Brasseur de
>  Bourbourg, abb%C3%A9, 1814-1874, former owner&#34;"
>
>  These characters are correctly specified. When it returns, I get an
>  illegal character error. Examining the XML, I get an fq value of:
>  name_facet:"Brasseur de Bourbourg, abbÃÂ(c), 1814-1874, former owner"

Is this bad XML part of the responseHeader (parameters that are simply
being echoed back)?
If so, it's most likely the config on whatever servlet container you
are using... you need to configure it to accept UTF-8 URLs rather than
latin-1 (Tomcat defaults to the old-style latin-1 AFAIK)

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Illegal xml/html character; unicode problems near solr

petercline
Nicolas and Yonik,

Thank you both for your excellent responses--this fixed my problem.  Now
it's time to go back and remove all the hacks I was using to pin this
thing together without proper utf-8 support.

Thanks again,
Peter

[hidden email] wrote:

> I think Tomcat defaults to the operating system default, e.g. cp1252 on a
> classic windows.
>
> You need to add an attribute URIEncoding="UTF-8" to the Connector you use in
> the server.xml conf.
>
> Nicolas
>
> -----Message d'origine-----
> De : [hidden email] [mailto:[hidden email]] De la part de Yonik Seeley
> Envoyé : vendredi 7 mars 2008 18:53
> À : [hidden email]
> Objet : Re: Illegal xml/html character; unicode problems near solr
>
> On Fri, Mar 7, 2008 at 12:30 PM, Peter Cline <[hidden email]> wrote:
>  
>>  The following is a snippet of a link to use a facet:
>>  search-faceted.html?q=[* TO
>>  *]&amp;facet=true&amp;rows=25&amp;fq=name_facet:&#34;Brasseur de
>>  Bourbourg, abb%C3%A9, 1814-1874, former owner&#34;"
>>
>>  These characters are correctly specified. When it returns, I get an
>>  illegal character error. Examining the XML, I get an fq value of:
>>  name_facet:"Brasseur de Bourbourg, abbÃÂ(c), 1814-1874, former owner"
>>    
>
> Is this bad XML part of the responseHeader (parameters that are simply
> being echoed back)?
> If so, it's most likely the config on whatever servlet container you
> are using... you need to configure it to accept UTF-8 URLs rather than
> latin-1 (Tomcat defaults to the old-style latin-1 AFAIK)
>
> -Yonik
>
>