Question about solr config files encoding.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about solr config files encoding.

Dawid Weiss-2
Guys should the encoding of config files really be platform-dependent?
Currently Solr tests fail massively on setup because of things like
this:

    public OpenExchangeRates(InputStream ratesStream) throws IOException {
      parser = new JSONParser(new InputStreamReader(ratesStream));

this reader, when confronted with UTF-16 as file.encoding results in
funky exceptions like:

   > Caused by: org.apache.noggit.JSONParser$ParseException: JSON
Parse Error: char=笊,position=0 BEFORE='笊'
AFTER='†≤楳捬慩浥爢㨠≔桩猠摡瑡⁩猠捯汬散瑥搠晲潭⁶慲楯畳⁰牯癩摥牳⁡湤⁰牯癩摥搠晲'
   > at org.apache.noggit.JSONParser.err(JSONParser.java:221)
   > at org.apache.noggit.JSONParser.next(JSONParser.java:620)
   > at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661)
   > at org.apache.solr.schema.OpenExchangeRatesOrgProvider$OpenExchangeRates.<init>(OpenExchangeRatesOrgProvider.java:189)
   > at org.apache.solr.schema.OpenExchangeRatesOrgProvider.reload(OpenExchangeRatesOrgProvider.java:129)

Can we fix the encoding of these input files to UTF-8 or something?
According to JSON RFC:

http://tools.ietf.org/html/rfc4627#section-3

JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

We could just enforce/require UTF-8? Alternatively, auto-detect this
from a binary stream as a custom Reader class.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Question about solr config files encoding.

Uwe Schindler
Config fiules are XML and I changed them to be handled by the XML parser (InputStreams), so XML parser reads encoding from Header.

But JSON is defined to be UTF-8, so we must supply the encoding (IOUtils.UTF8_CHARSET).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Dawid Weiss [mailto:[hidden email]]
> Sent: Thursday, July 05, 2012 5:00 PM
> To: [hidden email]
> Subject: Question about solr config files encoding.
>
> Guys should the encoding of config files really be platform-dependent?
> Currently Solr tests fail massively on setup because of things like
> this:
>
>     public OpenExchangeRates(InputStream ratesStream) throws IOException {
>       parser = new JSONParser(new InputStreamReader(ratesStream));
>
> this reader, when confronted with UTF-16 as file.encoding results in funky
> exceptions like:
>
>    > Caused by: org.apache.noggit.JSONParser$ParseException: JSON Parse
> Error: char=笊,position=0 BEFORE='笊'
> AFTER='†≤楳捬慩浥爢㨠≔桩猠摡瑡⁩猠捯汬散瑥搠晲潭⁶慲楯畳⁰牯癩摥牳⁡
> 湤⁰牯癩摥搠晲'
>    > at org.apache.noggit.JSONParser.err(JSONParser.java:221)
>    > at org.apache.noggit.JSONParser.next(JSONParser.java:620)
>    > at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661)
>    > at
> org.apache.solr.schema.OpenExchangeRatesOrgProvider$OpenExchangeRates.
> <init>(OpenExchangeRatesOrgProvider.java:189)
>    > at
> org.apache.solr.schema.OpenExchangeRatesOrgProvider.reload(OpenExchang
> eRatesOrgProvider.java:129)
>
> Can we fix the encoding of these input files to UTF-8 or something?
> According to JSON RFC:
>
> http://tools.ietf.org/html/rfc4627#section-3
>
> JSON text SHALL be encoded in Unicode.  The default encoding is
>    UTF-8.
>
>    Since the first two characters of a JSON text will always be ASCII
>    characters [RFC0020], it is possible to determine whether an octet
>    stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
>    at the pattern of nulls in the first four octets.
>
>            00 00 00 xx  UTF-32BE
>            00 xx 00 xx  UTF-16BE
>            xx 00 00 00  UTF-32LE
>            xx 00 xx 00  UTF-16LE
>            xx xx xx xx  UTF-8
>
> We could just enforce/require UTF-8? Alternatively, auto-detect this from a
> binary stream as a custom Reader class.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] For additional
> commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about solr config files encoding.

Dawid Weiss
> But JSON is defined to be UTF-8, so we must supply the encoding (IOUtils.UTF8_CHARSET).

That RFC says it can be any unicode... this said I agree with you that
we can probably assume it's UTF-8 and not worry about anything else.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Question about solr config files encoding.

Uwe Schindler
3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

:-)

I think we can safely assume it is UTF-8, otherwise we must do the same shit like XML parsers with mark() on BufferedInputStream.... Most libraries out there can only read UTF-8 and SOLR itself produces only UTF8 JSON, right? Those tests only check response from solr.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of
> Dawid Weiss
> Sent: Thursday, July 05, 2012 5:35 PM
> To: [hidden email]
> Subject: Re: Question about solr config files encoding.
>
> > But JSON is defined to be UTF-8, so we must supply the encoding
> (IOUtils.UTF8_CHARSET).
>
> That RFC says it can be any unicode... this said I agree with you that we can
> probably assume it's UTF-8 and not worry about anything else.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] For additional
> commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about solr config files encoding.

Yonik Seeley-2-2
In reply to this post by Dawid Weiss-2
On Thu, Jul 5, 2012 at 10:59 AM, Dawid Weiss <[hidden email]> wrote:
> According to JSON RFC:
>
> http://tools.ietf.org/html/rfc4627#section-3
>
> JSON text SHALL be encoded in Unicode.

One of my little pet peeves with the RFC - I think this was a bad
requirement.  JSON should have been text, and then their should have
been an optional way to detect encoding if other mechanisms don't
cover it (like HTTP headers, etc).  This effectively means that
something like
["hi"] is not valid JSON for many of you reading this email (if your
email client is internally representing it as something other than
unicode encoded for example).


> We could just enforce/require UTF-8?

Yes, Solr has normally always required/assumed UTF-8 for config files.
 It's simply an oversight in any places that don't.

-Yonik
http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Question about solr config files encoding.

Uwe Schindler
I just add:

Solr's XML files are parsed according to XML spec, so you can choose any
charset, you only have to define it according to XML spec! Also XML POST to
updatehandler can be any encoding (it does not need to be declared in header
anymore, the <?xml...> header is fine). There is already a test! I Fixed all
this in endless sessions, but I was happy to do it, as my favourite data
format is: XML :-) [I refuse to fix this for DIH, but that's another story,
SOLR-2347].

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik
> Seeley
> Sent: Thursday, July 05, 2012 5:43 PM
> To: [hidden email]
> Subject: Re: Question about solr config files encoding.
>
> On Thu, Jul 5, 2012 at 10:59 AM, Dawid Weiss <[hidden email]>
> wrote:
> > According to JSON RFC:
> >
> > http://tools.ietf.org/html/rfc4627#section-3
> >
> > JSON text SHALL be encoded in Unicode.
>
> One of my little pet peeves with the RFC - I think this was a bad
requirement.
> JSON should have been text, and then their should have been an optional
way
> to detect encoding if other mechanisms don't cover it (like HTTP headers,
etc).
> This effectively means that something like ["hi"] is not valid JSON for
many of
> you reading this email (if your email client is internally representing it
as

> something other than unicode encoded for example).
>
>
> > We could just enforce/require UTF-8?
>
> Yes, Solr has normally always required/assumed UTF-8 for config files.
>  It's simply an oversight in any places that don't.
>
> -Yonik
> http://lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] For additional
> commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Question about solr config files encoding.

Uwe Schindler
> updatehandler can be any encoding (it does not need to be declared in
header

...HTTP header..., sorry

> > -----Original Message-----
> > From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik
> > Seeley
> > Sent: Thursday, July 05, 2012 5:43 PM
> > To: [hidden email]
> > Subject: Re: Question about solr config files encoding.
> >
> > On Thu, Jul 5, 2012 at 10:59 AM, Dawid Weiss <[hidden email]>
> > wrote:
> > > According to JSON RFC:
> > >
> > > http://tools.ietf.org/html/rfc4627#section-3
> > >
> > > JSON text SHALL be encoded in Unicode.
> >
> > One of my little pet peeves with the RFC - I think this was a bad
> requirement.
> > JSON should have been text, and then their should have been an
> > optional
> way
> > to detect encoding if other mechanisms don't cover it (like HTTP
> > headers,
> etc).
> > This effectively means that something like ["hi"] is not valid JSON
> > for
> many of
> > you reading this email (if your email client is internally
> > representing it
> as
> > something other than unicode encoded for example).
> >
> >
> > > We could just enforce/require UTF-8?
> >
> > Yes, Solr has normally always required/assumed UTF-8 for config files.
> >  It's simply an oversight in any places that don't.
> >
> > -Yonik
> > http://lucidimagination.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email] For
> > additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] For additional
> commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Question about solr config files encoding.

Dawid Weiss
In reply to this post by Uwe Schindler
Sure, I don't have a problem with XML. I'll assume UTF-8 for json and
go through the issues later today.

Dawid

On Thu, Jul 5, 2012 at 5:47 PM, Uwe Schindler <[hidden email]> wrote:

> I just add:
>
> Solr's XML files are parsed according to XML spec, so you can choose any
> charset, you only have to define it according to XML spec! Also XML POST to
> updatehandler can be any encoding (it does not need to be declared in header
> anymore, the <?xml...> header is fine). There is already a test! I Fixed all
> this in endless sessions, but I was happy to do it, as my favourite data
> format is: XML :-) [I refuse to fix this for DIH, but that's another story,
> SOLR-2347].
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik
>> Seeley
>> Sent: Thursday, July 05, 2012 5:43 PM
>> To: [hidden email]
>> Subject: Re: Question about solr config files encoding.
>>
>> On Thu, Jul 5, 2012 at 10:59 AM, Dawid Weiss <[hidden email]>
>> wrote:
>> > According to JSON RFC:
>> >
>> > http://tools.ietf.org/html/rfc4627#section-3
>> >
>> > JSON text SHALL be encoded in Unicode.
>>
>> One of my little pet peeves with the RFC - I think this was a bad
> requirement.
>> JSON should have been text, and then their should have been an optional
> way
>> to detect encoding if other mechanisms don't cover it (like HTTP headers,
> etc).
>> This effectively means that something like ["hi"] is not valid JSON for
> many of
>> you reading this email (if your email client is internally representing it
> as
>> something other than unicode encoded for example).
>>
>>
>> > We could just enforce/require UTF-8?
>>
>> Yes, Solr has normally always required/assumed UTF-8 for config files.
>>  It's simply an oversight in any places that don't.
>>
>> -Yonik
>> http://lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email] For additional
>> commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]