character encoding issue...

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

character encoding issue...

Chris-3
Hi All,

I get characters like -

������������������ - CTA������������ -

in the solr index. I am adding Java beans to solr by the addBean() function.

This seems to be a character encoding issue. Any pointers on how to
resolve this one?

I have seen that this occurs  mostly for japanese chinese characters.
Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

Rajinimaski
Hi,

   If you are using Apache Tomcat Server, hope you are not missing the
below mentioned configuration:

 <Connector port=”port Number″ protocol=”HTTP/1.1″
connectionTimeout=”20000″
redirectPort=”8443″ *URIEncoding=”UTF-8″*/>

I had faced similar issue with Chinese Characters and had resolved with the
above config.

Links for reference :
http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8


Thanks



On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:

> Hi All,
>
> I get characters like -
>
> ������������������ - CTA������������ -
>
> in the solr index. I am adding Java beans to solr by the addBean()
> function.
>
> This seems to be a character encoding issue. Any pointers on how to
> resolve this one?
>
> I have seen that this occurs  mostly for japanese chinese characters.
>
Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

Chris-3
Hi Rajani,

I followed the steps exactly as in
http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/

However, when i send a query to this new instance in tomcat, i again get
the error -

  <str name="fulltxt">Scheduled Groups Maintenance
In preparation for the new release roll-out,���� Diigo groups won’t be
accessible on Sept 28 (Mon) around midnight 0:00 PST for several
hours.
Stay tuned to say hello to Diigo V4 soon!

location of the text  -
http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/

All text in title comes like -

������������������������������������ - ��������������������� ������������</str>
    <arr name="text">
      <str>������������������������������������ -
��������������������� ������������</str>
    </arr>


Can you please advice?

Chris




On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[hidden email]>wrote:

> Hi,
>
>    If you are using Apache Tomcat Server, hope you are not missing the
> below mentioned configuration:
>
>  <Connector port=”port Number″ protocol=”HTTP/1.1″
> connectionTimeout=”20000″
> redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
>
> I had faced similar issue with Chinese Characters and had resolved with the
> above config.
>
> Links for reference :
>
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
>
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
>
>
> Thanks
>
>
>
> On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:
>
> > Hi All,
> >
> > I get characters like -
> >
> > ������������������ - CTA������������ -
> >
> > in the solr index. I am adding Java beans to solr by the addBean()
> > function.
> >
> > This seems to be a character encoding issue. Any pointers on how to
> > resolve this one?
> >
> > I have seen that this occurs  mostly for japanese chinese characters.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

Rajinimaski
How are you extracting the text that is there in the website[1] you are
referring to? Apache Nutch or any other crawler? If yes, initially check
whether that crawler engine is giving you data in correct format before you
invoke solr index method.

[1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

URI encoding should resolve this problem.




On Fri, Nov 1, 2013 at 10:50 AM, Chris <[hidden email]> wrote:

> Hi Rajani,
>
> I followed the steps exactly as in
>
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
>
> However, when i send a query to this new instance in tomcat, i again get
> the error -
>
>   <str name="fulltxt">Scheduled Groups Maintenance
> In preparation for the new release roll-out,���� Diigo groups won’t be
> accessible on Sept 28 (Mon) around midnight 0:00 PST for several
> hours.
> Stay tuned to say hello to Diigo V4 soon!
>
> location of the text  -
> http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>
> same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
>
> All text in title comes like -
>
> ������������������������������������ - ���������������������
> ������������</str>
>     <arr name="text">
>       <str>������������������������������������ -
> ��������������������� ������������</str>
>     </arr>
>
>
> Can you please advice?
>
> Chris
>
>
>
>
> On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[hidden email]
> >wrote:
>
> > Hi,
> >
> >    If you are using Apache Tomcat Server, hope you are not missing the
> > below mentioned configuration:
> >
> >  <Connector port=”port Number″ protocol=”HTTP/1.1″
> > connectionTimeout=”20000″
> > redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
> >
> > I had faced similar issue with Chinese Characters and had resolved with
> the
> > above config.
> >
> > Links for reference :
> >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> >
> >
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
> >
> >
> > Thanks
> >
> >
> >
> > On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:
> >
> > > Hi All,
> > >
> > > I get characters like -
> > >
> > > ������������������ - CTA������������ -
> > >
> > > in the solr index. I am adding Java beans to solr by the addBean()
> > > function.
> > >
> > > This seems to be a character encoding issue. Any pointers on how to
> > > resolve this one?
> > >
> > > I have seen that this occurs  mostly for japanese chinese characters.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

Erick Erickson
The problem is there are about a dozen places where the character
encoding can be mis-configured. The problem you're seeing above
actually looks like a problem with the character set configured in
your browser, it may have nothing to do with what's actually in Solr.

You might write small SolrJ program and see if you can dump the contents
in binary and examine to see...

Best
Erick


On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <[hidden email]> wrote:

> How are you extracting the text that is there in the website[1] you are
> referring to? Apache Nutch or any other crawler? If yes, initially check
> whether that crawler engine is giving you data in correct format before you
> invoke solr index method.
>
> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>
> URI encoding should resolve this problem.
>
>
>
>
> On Fri, Nov 1, 2013 at 10:50 AM, Chris <[hidden email]> wrote:
>
> > Hi Rajani,
> >
> > I followed the steps exactly as in
> >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> >
> > However, when i send a query to this new instance in tomcat, i again get
> > the error -
> >
> >   <str name="fulltxt">Scheduled Groups Maintenance
> > In preparation for the new release roll-out,���� Diigo groups won’t be
> > accessible on Sept 28 (Mon) around midnight 0:00 PST for several
> > hours.
> > Stay tuned to say hello to Diigo V4 soon!
> >
> > location of the text  -
> > http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
> >
> > same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
> >
> > All text in title comes like -
> >
> > ������������������������������������ - ���������������������
> > ������������</str>
> >     <arr name="text">
> >       <str>������������������������������������ -
> > ��������������������� ������������</str>
> >     </arr>
> >
> >
> > Can you please advice?
> >
> > Chris
> >
> >
> >
> >
> > On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[hidden email]
> > >wrote:
> >
> > > Hi,
> > >
> > >    If you are using Apache Tomcat Server, hope you are not missing the
> > > below mentioned configuration:
> > >
> > >  <Connector port=”port Number″ protocol=”HTTP/1.1″
> > > connectionTimeout=”20000″
> > > redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
> > >
> > > I had faced similar issue with Chinese Characters and had resolved with
> > the
> > > above config.
> > >
> > > Links for reference :
> > >
> > >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> > >
> > >
> >
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
> > >
> > >
> > > Thanks
> > >
> > >
> > >
> > > On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I get characters like -
> > > >
> > > > ������������������ - CTA������������ -
> > > >
> > > > in the solr index. I am adding Java beans to solr by the addBean()
> > > > function.
> > > >
> > > > This seems to be a character encoding issue. Any pointers on how to
> > > > resolve this one?
> > > >
> > > > I have seen that this occurs  mostly for japanese chinese characters.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

Chris-3
Sorry, was away a bit & hence the delay.

I am inserting java strings into a java bean class, and then doing a
addBean() method to insert the POJO into Solr.

When i Query using either tomcat/jetty, I get these special characters. But
I have noted, if I change output to - "Shift-JIS" encoding then those
characters appear as some japanese characters I think.

But then this solution doesn't work for all special characters as I can
still see some of them...isn't there an encoding that can cover all the
characters whatever they might be? Any ideas on what do i do?

Regards,
Chris


On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <[hidden email]>wrote:

> The problem is there are about a dozen places where the character
> encoding can be mis-configured. The problem you're seeing above
> actually looks like a problem with the character set configured in
> your browser, it may have nothing to do with what's actually in Solr.
>
> You might write small SolrJ program and see if you can dump the contents
> in binary and examine to see...
>
> Best
> Erick
>
>
> On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <[hidden email]>
> wrote:
>
> > How are you extracting the text that is there in the website[1] you are
> > referring to? Apache Nutch or any other crawler? If yes, initially check
> > whether that crawler engine is giving you data in correct format before
> you
> > invoke solr index method.
> >
> > [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
> >
> > URI encoding should resolve this problem.
> >
> >
> >
> >
> > On Fri, Nov 1, 2013 at 10:50 AM, Chris <[hidden email]> wrote:
> >
> > > Hi Rajani,
> > >
> > > I followed the steps exactly as in
> > >
> > >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> > >
> > > However, when i send a query to this new instance in tomcat, i again
> get
> > > the error -
> > >
> > >   <str name="fulltxt">Scheduled Groups Maintenance
> > > In preparation for the new release roll-out,���� Diigo groups won’t be
> > > accessible on Sept 28 (Mon) around midnight 0:00 PST for several
> > > hours.
> > > Stay tuned to say hello to Diigo V4 soon!
> > >
> > > location of the text  -
> > > http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
> > >
> > > same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
> > >
> > > All text in title comes like -
> > >
> > > ������������������������������������ - ���������������������
> > > ������������</str>
> > >     <arr name="text">
> > >       <str>������������������������������������ -
> > > ��������������������� ������������</str>
> > >     </arr>
> > >
> > >
> > > Can you please advice?
> > >
> > > Chris
> > >
> > >
> > >
> > >
> > > On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[hidden email]
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > >    If you are using Apache Tomcat Server, hope you are not missing
> the
> > > > below mentioned configuration:
> > > >
> > > >  <Connector port=”port Number″ protocol=”HTTP/1.1″
> > > > connectionTimeout=”20000″
> > > > redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
> > > >
> > > > I had faced similar issue with Chinese Characters and had resolved
> with
> > > the
> > > > above config.
> > > >
> > > > Links for reference :
> > > >
> > > >
> > >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> > > >
> > > >
> > >
> >
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I get characters like -
> > > > >
> > > > > ������������������ - CTA������������ -
> > > > >
> > > > > in the solr index. I am adding Java beans to solr by the addBean()
> > > > > function.
> > > > >
> > > > > This seems to be a character encoding issue. Any pointers on how to
> > > > > resolve this one?
> > > > >
> > > > > I have seen that this occurs  mostly for japanese chinese
> characters.
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

T. Kuro Kurosaka-2
It sounds like the characters were mishandled at index build time.
I would use Luke to see if a character that appear correctly
when you change the output to be SHIFT JIS is actually
stored as one Unicode. I bet it's stored as two characters,
each having the character of the value that happened
to be high and low bytes of the SHIFT JIS character.

There are many possible cause of this. If you are indexing
the HTML document from HTTP servers, HTTP server may
be configured to send wrong charset= info in Content-Type
header. If the document is directly from a file system,
and if the document doesn't  have META header declaring
the charset, then the system assumes a default charset,
which is typically ISO-8859-1 or UTF-8, and misinterprets
SHIF-JIS encoded characters.

You need to debug to find out where the characters
get corrupted.

On 11/04/2013 11:15 PM, Chris wrote:

> Sorry, was away a bit & hence the delay.
>
> I am inserting java strings into a java bean class, and then doing a
> addBean() method to insert the POJO into Solr.
>
> When i Query using either tomcat/jetty, I get these special characters. But
> I have noted, if I change output to - "Shift-JIS" encoding then those
> characters appear as some japanese characters I think.
>
> But then this solution doesn't work for all special characters as I can
> still see some of them...isn't there an encoding that can cover all the
> characters whatever they might be? Any ideas on what do i do?
>
> Regards,
> Chris
>
>
> On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <[hidden email]>wrote:
>
>> The problem is there are about a dozen places where the character
>> encoding can be mis-configured. The problem you're seeing above
>> actually looks like a problem with the character set configured in
>> your browser, it may have nothing to do with what's actually in Solr.
>>
>> You might write small SolrJ program and see if you can dump the contents
>> in binary and examine to see...
>>
>> Best
>> Erick
>>
>>
>> On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <[hidden email]>
>> wrote:
>>
>>> How are you extracting the text that is there in the website[1] you are
>>> referring to? Apache Nutch or any other crawler? If yes, initially check
>>> whether that crawler engine is giving you data in correct format before
>> you
>>> invoke solr index method.
>>>
>>> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>
>>> URI encoding should resolve this problem.
>>>
>>>
>>>
>>>
>>> On Fri, Nov 1, 2013 at 10:50 AM, Chris <[hidden email]> wrote:
>>>
>>>> Hi Rajani,
>>>>
>>>> I followed the steps exactly as in
>>>>
>>>>
>> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
>>>> However, when i send a query to this new instance in tomcat, i again
>> get
>>>> the error -
>>>>
>>>>    <str name="fulltxt">Scheduled Groups Maintenance
>>>> In preparation for the new release roll-out,���� Diigo groups won’t be
>>>> accessible on Sept 28 (Mon) around midnight 0:00 PST for several
>>>> hours.
>>>> Stay tuned to say hello to Diigo V4 soon!
>>>>
>>>> location of the text  -
>>>> http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>>
>>>> same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
>>>>
>>>> All text in title comes like -
>>>>
>>>> ������������������������������������ - ���������������������
>>>> ������������</str>
>>>>      <arr name="text">
>>>>        <str>������������������������������������ -
>>>> ��������������������� ������������</str>
>>>>      </arr>
>>>>
>>>>
>>>> Can you please advice?
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[hidden email]
>>>>> wrote:
>>>>> Hi,
>>>>>
>>>>>     If you are using Apache Tomcat Server, hope you are not missing
>> the
>>>>> below mentioned configuration:
>>>>>
>>>>>   <Connector port=”port Number″ protocol=”HTTP/1.1″
>>>>> connectionTimeout=”20000″
>>>>> redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
>>>>>
>>>>> I had faced similar issue with Chinese Characters and had resolved
>> with
>>>> the
>>>>> above config.
>>>>>
>>>>> Links for reference :
>>>>>
>>>>>
>> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
>>>>>
>> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I get characters like -
>>>>>>
>>>>>> ������������������ - CTA������������ -
>>>>>>
>>>>>> in the solr index. I am adding Java beans to solr by the addBean()
>>>>>> function.
>>>>>>
>>>>>> This seems to be a character encoding issue. Any pointers on how to
>>>>>> resolve this one?
>>>>>>
>>>>>> I have seen that this occurs  mostly for japanese chinese
>> characters.

--
-----------------------------------------
T. "Kuro" Kurosaka • Senior Software Engineer

Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

Chris-3
I tried a lot of things and almost am at my wit's end :(


Here is the code I used to get the strings -

String htmlContent = readPage(page.getWebURL().getURL());

I even tried -
Document doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
        String htmlContent = doc.html();

& Document doc = Jsoup.parse(htmlContent,"UTF-8");

No improvement so far, any advice for me please?



function that gets the html ----------------------------------------
 public static String readPage(String urlString)  {
             try{

           URL url = new URL(urlString);
             DefaultHttpClient client = new DefaultHttpClient();
             client.getParams().setParameter(ClientPNames.COOKIE_POLICY,
                     CookiePolicy.BROWSER_COMPATIBILITY);

             HttpGet request = new HttpGet(url.toURI());
             HttpResponse response = client.execute(request);

             if(response.getStatusLine().getStatusCode() == 200 &&
response.getEntity().getContentType().toString().contains("text/html"))
             {
                 Reader reader = null;
                 try {
                     reader = new
InputStreamReader(response.getEntity().getContent());

                     StringBuffer sb = new StringBuffer();
                     {
                         int read;
                         char[] cbuf = new char[1024];
                         while ((read = reader.read(cbuf)) != -1)
                             sb.append(cbuf, 0, read);
                     }

                     return sb.toString();

                 } finally {
                     if (reader != null) {
                         try {
                             reader.close();
                         } catch (IOException e) {
                             e.printStackTrace();
                         }
                    }
                 }
             }
             else
                 return "";

             }catch(Exception e){return "";}

         }

---------------------------------------------------------------------------



On Wed, Nov 6, 2013 at 2:53 AM, T. Kuro Kurosaka <[hidden email]>wrote:

> It sounds like the characters were mishandled at index build time.
> I would use Luke to see if a character that appear correctly
> when you change the output to be SHIFT JIS is actually
> stored as one Unicode. I bet it's stored as two characters,
> each having the character of the value that happened
> to be high and low bytes of the SHIFT JIS character.
>
> There are many possible cause of this. If you are indexing
> the HTML document from HTTP servers, HTTP server may
> be configured to send wrong charset= info in Content-Type
> header. If the document is directly from a file system,
> and if the document doesn't  have META header declaring
> the charset, then the system assumes a default charset,
> which is typically ISO-8859-1 or UTF-8, and misinterprets
> SHIF-JIS encoded characters.
>
> You need to debug to find out where the characters
> get corrupted.
>
>
> On 11/04/2013 11:15 PM, Chris wrote:
>
>> Sorry, was away a bit & hence the delay.
>>
>> I am inserting java strings into a java bean class, and then doing a
>> addBean() method to insert the POJO into Solr.
>>
>> When i Query using either tomcat/jetty, I get these special characters.
>> But
>> I have noted, if I change output to - "Shift-JIS" encoding then those
>> characters appear as some japanese characters I think.
>>
>> But then this solution doesn't work for all special characters as I can
>> still see some of them...isn't there an encoding that can cover all the
>> characters whatever they might be? Any ideas on what do i do?
>>
>> Regards,
>> Chris
>>
>>
>> On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <[hidden email]>
>> wrote:
>>
>>  The problem is there are about a dozen places where the character
>>> encoding can be mis-configured. The problem you're seeing above
>>> actually looks like a problem with the character set configured in
>>> your browser, it may have nothing to do with what's actually in Solr.
>>>
>>> You might write small SolrJ program and see if you can dump the contents
>>> in binary and examine to see...
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <[hidden email]>
>>> wrote:
>>>
>>>  How are you extracting the text that is there in the website[1] you are
>>>> referring to? Apache Nutch or any other crawler? If yes, initially check
>>>> whether that crawler engine is giving you data in correct format before
>>>>
>>> you
>>>
>>>> invoke solr index method.
>>>>
>>>> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>>
>>>> URI encoding should resolve this problem.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Nov 1, 2013 at 10:50 AM, Chris <[hidden email]> wrote:
>>>>
>>>>  Hi Rajani,
>>>>>
>>>>> I followed the steps exactly as in
>>>>>
>>>>>
>>>>>  http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
>>> configure-solr-on-apache-tomcat-7-0-20/
>>>
>>>> However, when i send a query to this new instance in tomcat, i again
>>>>>
>>>> get
>>>
>>>> the error -
>>>>>
>>>>>    <str name="fulltxt">Scheduled Groups Maintenance
>>>>> In preparation for the new release roll-out,���� Diigo groups won’t be
>>>>> accessible on Sept 28 (Mon) around midnight 0:00 PST for several
>>>>> hours.
>>>>> Stay tuned to say hello to Diigo V4 soon!
>>>>>
>>>>> location of the text  -
>>>>> http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>>>
>>>>> same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
>>>>>
>>>>> All text in title comes like -
>>>>>
>>>>> ������������������������������������ - ���������������������
>>>>> ������������</str>
>>>>>      <arr name="text">
>>>>>        <str>������������������������������������ -
>>>>> ��������������������� ������������</str>
>>>>>      </arr>
>>>>>
>>>>>
>>>>> Can you please advice?
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[hidden email]
>>>>>
>>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>>     If you are using Apache Tomcat Server, hope you are not missing
>>>>>>
>>>>> the
>>>
>>>> below mentioned configuration:
>>>>>>
>>>>>>   <Connector port=”port Number″ protocol=”HTTP/1.1″
>>>>>> connectionTimeout=”20000″
>>>>>> redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
>>>>>>
>>>>>> I had faced similar issue with Chinese Characters and had resolved
>>>>>>
>>>>> with
>>>
>>>> the
>>>>>
>>>>>> above config.
>>>>>>
>>>>>> Links for reference :
>>>>>>
>>>>>>
>>>>>>  http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
>>> configure-solr-on-apache-tomcat-7-0-20/
>>>
>>>>
>>>>>>  http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-
>>> parameters.html#.Um_3P3Cw2X8
>>>
>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:
>>>>>>
>>>>>>  Hi All,
>>>>>>>
>>>>>>> I get characters like -
>>>>>>>
>>>>>>> ������������������ - CTA������������ -
>>>>>>>
>>>>>>> in the solr index. I am adding Java beans to solr by the addBean()
>>>>>>> function.
>>>>>>>
>>>>>>> This seems to be a character encoding issue. Any pointers on how to
>>>>>>> resolve this one?
>>>>>>>
>>>>>>> I have seen that this occurs  mostly for japanese chinese
>>>>>>>
>>>>>> characters.
>>>
>>
> --
> -----------------------------------------
> T. "Kuro" Kurosaka • Senior Software Engineer
>
>
Reply | Threaded
Open this post in threaded view
|

Re: character encoding issue...

Michael Sokolov-3
Don't feel bad: character encoding problems are often said to be among
the hardest in software engineering.

There's no simple answer to problems like this since as Erick said, any
tool in your chain could be the culprit. I doubt anyone on this list
will be able to guess "the answer" since the question hasn't even really
been properly arrived at yet.

My advice is to start as far upstream as you can (where you acquire the
data), and make sure you understand how it is encoded.  Keep in mind
that *it may not be encoded consistently*. Just because it may be
declared to be UTF-8, or Shift-JIS, or something, doesn't mean that the
characters are actually going to come out sensibly when interpreted in
that encoding.  You may just be getting garbage.  However, assuming
that's not the case, you should be able to determine the character set
somehow: look at the HTTP headers; look at the characters themselves. If
it's HTML or XML, look at the encoding that may be declared in the
beginning of the file itself (in the XML declaration).
Keep in mind that when you look at these things, you are looking at them
through the lens of a tool (wget, or Java's HTTP API, your shell, or a
text editor) that will have applied its own processing to the
characters.  My advice is to use a low-level tool like wget, and maybe
od or some other hex character-dumper as a sanity check.  Maybe try a
few different tools to make sure they agree. Understand all the
character-set-related options in your tools so that you can try
different settings.  Learn about character encodings so you can
recognize the byte patterns. In the end, you will only be successful if
you master your tools.

Good luck!

-Mike Sokolov

On 11/9/13 2:20 PM, Chris wrote:

> I tried a lot of things and almost am at my wit's end :(
>
>
> Here is the code I used to get the strings -
>
> String htmlContent = readPage(page.getWebURL().getURL());
>
> I even tried -
> Document doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
>          String htmlContent = doc.html();
>
> & Document doc = Jsoup.parse(htmlContent,"UTF-8");
>
> No improvement so far, any advice for me please?
>
>
>
> function that gets the html ----------------------------------------
>   public static String readPage(String urlString)  {
>               try{
>
>             URL url = new URL(urlString);
>               DefaultHttpClient client = new DefaultHttpClient();
>               client.getParams().setParameter(ClientPNames.COOKIE_POLICY,
>                       CookiePolicy.BROWSER_COMPATIBILITY);
>
>               HttpGet request = new HttpGet(url.toURI());
>               HttpResponse response = client.execute(request);
>
>               if(response.getStatusLine().getStatusCode() == 200 &&
> response.getEntity().getContentType().toString().contains("text/html"))
>               {
>                   Reader reader = null;
>                   try {
>                       reader = new
> InputStreamReader(response.getEntity().getContent());
>
>                       StringBuffer sb = new StringBuffer();
>                       {
>                           int read;
>                           char[] cbuf = new char[1024];
>                           while ((read = reader.read(cbuf)) != -1)
>                               sb.append(cbuf, 0, read);
>                       }
>
>                       return sb.toString();
>
>                   } finally {
>                       if (reader != null) {
>                           try {
>                               reader.close();
>                           } catch (IOException e) {
>                               e.printStackTrace();
>                           }
>                      }
>                   }
>               }
>               else
>                   return "";
>
>               }catch(Exception e){return "";}
>
>           }
>
> ---------------------------------------------------------------------------
>
>
>
> On Wed, Nov 6, 2013 at 2:53 AM, T. Kuro Kurosaka <[hidden email]>wrote:
>
>> It sounds like the characters were mishandled at index build time.
>> I would use Luke to see if a character that appear correctly
>> when you change the output to be SHIFT JIS is actually
>> stored as one Unicode. I bet it's stored as two characters,
>> each having the character of the value that happened
>> to be high and low bytes of the SHIFT JIS character.
>>
>> There are many possible cause of this. If you are indexing
>> the HTML document from HTTP servers, HTTP server may
>> be configured to send wrong charset= info in Content-Type
>> header. If the document is directly from a file system,
>> and if the document doesn't  have META header declaring
>> the charset, then the system assumes a default charset,
>> which is typically ISO-8859-1 or UTF-8, and misinterprets
>> SHIF-JIS encoded characters.
>>
>> You need to debug to find out where the characters
>> get corrupted.
>>
>>
>> On 11/04/2013 11:15 PM, Chris wrote:
>>
>>> Sorry, was away a bit & hence the delay.
>>>
>>> I am inserting java strings into a java bean class, and then doing a
>>> addBean() method to insert the POJO into Solr.
>>>
>>> When i Query using either tomcat/jetty, I get these special characters.
>>> But
>>> I have noted, if I change output to - "Shift-JIS" encoding then those
>>> characters appear as some japanese characters I think.
>>>
>>> But then this solution doesn't work for all special characters as I can
>>> still see some of them...isn't there an encoding that can cover all the
>>> characters whatever they might be? Any ideas on what do i do?
>>>
>>> Regards,
>>> Chris
>>>
>>>
>>> On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <[hidden email]>
>>> wrote:
>>>
>>>   The problem is there are about a dozen places where the character
>>>> encoding can be mis-configured. The problem you're seeing above
>>>> actually looks like a problem with the character set configured in
>>>> your browser, it may have nothing to do with what's actually in Solr.
>>>>
>>>> You might write small SolrJ program and see if you can dump the contents
>>>> in binary and examine to see...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>>
>>>> On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <[hidden email]>
>>>> wrote:
>>>>
>>>>   How are you extracting the text that is there in the website[1] you are
>>>>> referring to? Apache Nutch or any other crawler? If yes, initially check
>>>>> whether that crawler engine is giving you data in correct format before
>>>>>
>>>> you
>>>>
>>>>> invoke solr index method.
>>>>>
>>>>> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>>>
>>>>> URI encoding should resolve this problem.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 1, 2013 at 10:50 AM, Chris <[hidden email]> wrote:
>>>>>
>>>>>   Hi Rajani,
>>>>>> I followed the steps exactly as in
>>>>>>
>>>>>>
>>>>>>   http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
>>>> configure-solr-on-apache-tomcat-7-0-20/
>>>>
>>>>> However, when i send a query to this new instance in tomcat, i again
>>>>> get
>>>>> the error -
>>>>>>     <str name="fulltxt">Scheduled Groups Maintenance
>>>>>> In preparation for the new release roll-out,���� Diigo groups won’t be
>>>>>> accessible on Sept 28 (Mon) around midnight 0:00 PST for several
>>>>>> hours.
>>>>>> Stay tuned to say hello to Diigo V4 soon!
>>>>>>
>>>>>> location of the text  -
>>>>>> http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>>>>>>
>>>>>> same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
>>>>>>
>>>>>> All text in title comes like -
>>>>>>
>>>>>> ������������������������������������ - ���������������������
>>>>>> ������������</str>
>>>>>>       <arr name="text">
>>>>>>         <str>������������������������������������ -
>>>>>> ��������������������� ������������</str>
>>>>>>       </arr>
>>>>>>
>>>>>>
>>>>>> Can you please advice?
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <[hidden email]
>>>>>>
>>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>      If you are using Apache Tomcat Server, hope you are not missing
>>>>>>>
>>>>>> the
>>>>> below mentioned configuration:
>>>>>>>    <Connector port=”port Number″ protocol=”HTTP/1.1″
>>>>>>> connectionTimeout=”20000″
>>>>>>> redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
>>>>>>>
>>>>>>> I had faced similar issue with Chinese Characters and had resolved
>>>>>>>
>>>>>> with
>>>>> the
>>>>>>> above config.
>>>>>>>
>>>>>>> Links for reference :
>>>>>>>
>>>>>>>
>>>>>>>   http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
>>>> configure-solr-on-apache-tomcat-7-0-20/
>>>>
>>>>>>>   http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-
>>>> parameters.html#.Um_3P3Cw2X8
>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 29, 2013 at 9:20 PM, Chris <[hidden email]> wrote:
>>>>>>>
>>>>>>>   Hi All,
>>>>>>>> I get characters like -
>>>>>>>>
>>>>>>>> ������������������ - CTA������������ -
>>>>>>>>
>>>>>>>> in the solr index. I am adding Java beans to solr by the addBean()
>>>>>>>> function.
>>>>>>>>
>>>>>>>> This seems to be a character encoding issue. Any pointers on how to
>>>>>>>> resolve this one?
>>>>>>>>
>>>>>>>> I have seen that this occurs  mostly for japanese chinese
>>>>>>>>
>>>>>>> characters.
>> --
>> -----------------------------------------
>> T. "Kuro" Kurosaka • Senior Software Engineer
>>
>>