encoding problem

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

encoding problem

bernieh
We have an encoding problem with our solr application. That is, non-ASCII chars displaying fine in SOLR, but in googledegook in our application .

Our tomcat server.xml file already contains URIencoding="UTF-8" under the relevant <connector>.

A google search reveals that I should set the encoding for the JVM, but have no idea how to do this. I'm running Windows, and there is no tomcat process in my Windows Services.

TIA

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: [hidden email]
Email: [hidden email]<mailto:[hidden email]>
Website: http://www.deakin.edu.au
<http://www.deakin.edu.au/>Deakin University CRICOS Provider Code 00113B (Vic)

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus free

Reply | Threaded
Open this post in threaded view
|

Re: encoding problem

Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 10:24 AM, Bernadette Houghton <
[hidden email]> wrote:

> We have an encoding problem with our solr application. That is, non-ASCII
> chars displaying fine in SOLR, but in googledegook in our application .
>
> Our tomcat server.xml file already contains URIencoding="UTF-8" under the
> relevant <connector>.
>
> A google search reveals that I should set the encoding for the JVM, but
> have no idea how to do this. I'm running Windows, and there is no tomcat
> process in my Windows Services.
>

Add the following parameter to the JVM:

-Dfile.encoding=UTF-8

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

RE: encoding problem

bernieh
Hi Shalin, stupid question - I'm an apache/solr newbie - but how do I access the JVM???

Regards
Bern


-----Original Message-----
From: Shalin Shekhar Mangar [mailto:[hidden email]]
Sent: Wednesday, 26 August 2009 5:10 PM
To: [hidden email]
Subject: Re: encoding problem

On Wed, Aug 26, 2009 at 10:24 AM, Bernadette Houghton <
[hidden email]> wrote:

> We have an encoding problem with our solr application. That is, non-ASCII
> chars displaying fine in SOLR, but in googledegook in our application .
>
> Our tomcat server.xml file already contains URIencoding="UTF-8" under the
> relevant <connector>.
>
> A google search reveals that I should set the encoding for the JVM, but
> have no idea how to do this. I'm running Windows, and there is no tomcat
> process in my Windows Services.
>

Add the following parameter to the JVM:

-Dfile.encoding=UTF-8

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: encoding problem

Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 12:42 PM, Bernadette Houghton <
[hidden email]> wrote:

> Hi Shalin, stupid question - I'm an apache/solr newbie - but how do I
> access the JVM???
>

When you execute the java executable, just add -Dfile.encoding=UTF-8 as a
command line argument to the executable.

How are you consuming Solr? You mentioned there is no tomcat, is your solr
client a desktop java application?

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

RE: encoding problem

bernieh
Thanks for your quick reply, Shalin.

Tomcat is running on my Windows machine, but does not appear in Windows Services (as I was expecting it should ... am I wrong?). I'm running it from a startup.bat on my desktop - see below. Do I add the Dfile line to the startup.bat?

SOLR is part of the repository software that we are running.

Thanks!

BERN

Startup.bat -
@echo off
if "%OS%" == "Windows_NT" setlocal
rem ---------------------------------------------------------------------------
rem Start script for the CATALINA Server
rem
rem $Id: startup.bat 302918 2004-05-27 18:25:11Z yoavs $
rem ---------------------------------------------------------------------------

rem Guess CATALINA_HOME if not defined
set CURRENT_DIR=%cd%
if not "%CATALINA_HOME%" == "" goto gotHome
set CATALINA_HOME=%CURRENT_DIR%
if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
cd ..
set CATALINA_HOME=%cd%
cd %CURRENT_DIR%
:gotHome
if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
echo The CATALINA_HOME environment variable is not defined correctly
echo This environment variable is needed to run this program
goto end
:okHome

set EXECUTABLE=%CATALINA_HOME%\bin\catalina.bat

rem Check that target executable exists
if exist "%EXECUTABLE%" goto okExec
echo Cannot find %EXECUTABLE%
echo This file is needed to run this program
goto end
:okExec

rem Get remaining unshifted command line arguments and save them in the
set CMD_LINE_ARGS=
:setArgs
if ""%1""=="""" goto doneSetArgs
set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
shift
goto setArgs
:doneSetArgs

call "%EXECUTABLE%" start %CMD_LINE_ARGS%

:end



Reply | Threaded
Open this post in threaded view
|

Re: encoding problem

Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton <
[hidden email]> wrote:

> Thanks for your quick reply, Shalin.
>
> Tomcat is running on my Windows machine, but does not appear in Windows
> Services (as I was expecting it should ... am I wrong?). I'm running it from
> a startup.bat on my desktop - see below. Do I add the Dfile line to the
> startup.bat?
>
> SOLR is part of the repository software that we are running.
>

Tomcat respects an environment variable called JAVA_OPTS through which you
can pass any jvm argument (e.g. heap size, file encoding). Set
JAVA_OPTS="-Dfile.encoding=UTF-8" either through the GUI or by adding the
following to startup.bat:

set JAVA_OPTS="-Dfile.encoding=UTF-8"

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

RE: encoding problem

Fuad Efendi
In reply to this post by bernieh
If you are complaining about Web Application (other than SOLR) (probably
behind-the Apache HTTPD) having encoding problem - try to troubleshoot it
with Mozilla Firefox + Live Http Headers plugin.


Look at "Content-Encoding" HTTP response headers, and don't forget about
<meta http-equiv... > tag inside HTML...


-Fuad
http://www.tokenizer.org



-----Original Message-----
From: Bernadette Houghton [mailto:[hidden email]]
Sent: August-26-09 12:55 AM
To: '[hidden email]'
Subject: encoding problem

We have an encoding problem with our solr application. That is, non-ASCII
chars displaying fine in SOLR, but in googledegook in our application .

Our tomcat server.xml file already contains URIencoding="UTF-8" under the
relevant <connector>.

A google search reveals that I should set the encoding for the JVM, but have
no idea how to do this. I'm running Windows, and there is no tomcat process
in my Windows Services.

TIA

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: [hidden email]
Email:
[hidden email]<mailto:[hidden email]>
Website: http://www.deakin.edu.au
<http://www.deakin.edu.au/>Deakin University CRICOS Provider Code 00113B
(Vic)

Important Notice: The contents of this email are intended solely for the
named addressee and are confidential; any unauthorised use, reproduction or
storage of the contents is expressly prohibited. If you have received this
email in error, please delete it and any attachments immediately and advise
the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are
error or virus free



Reply | Threaded
Open this post in threaded view
|

RE: encoding problem

bernieh
In reply to this post by Shalin Shekhar Mangar
Hi Shalin, strangely, things still aren't working. I've set the JAVA_OPTS through either the GUI or to startup.bat, but absolutely no impact. Have tried reindexing also, but still no impact - results such as -

“My Universe is Here�

bern

-----Original Message-----
From: Shalin Shekhar Mangar [mailto:[hidden email]]
Sent: Wednesday, 26 August 2009 5:50 PM
To: [hidden email]
Subject: Re: encoding problem

On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton <
[hidden email]> wrote:

> Thanks for your quick reply, Shalin.
>
> Tomcat is running on my Windows machine, but does not appear in Windows
> Services (as I was expecting it should ... am I wrong?). I'm running it from
> a startup.bat on my desktop - see below. Do I add the Dfile line to the
> startup.bat?
>
> SOLR is part of the repository software that we are running.
>

Tomcat respects an environment variable called JAVA_OPTS through which you
can pass any jvm argument (e.g. heap size, file encoding). Set
JAVA_OPTS="-Dfile.encoding=UTF-8" either through the GUI or by adding the
following to startup.bat:

set JAVA_OPTS="-Dfile.encoding=UTF-8"

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: encoding problem

Yonik Seeley-2-2
Have you determined if the problem is on the indexing side or the
query side?  I don't see any reason you should have to set/change any
encoding in the JVM.

-Yonik
http://www.lucidimagination.com



On Thu, Aug 27, 2009 at 7:03 PM, Bernadette
Houghton<[hidden email]> wrote:

> Hi Shalin, strangely, things still aren't working. I've set the JAVA_OPTS through either the GUI or to startup.bat, but absolutely no impact. Have tried reindexing also, but still no impact - results such as -
>
> “My Universe is Here�
>
> bern
>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:[hidden email]]
> Sent: Wednesday, 26 August 2009 5:50 PM
> To: [hidden email]
> Subject: Re: encoding problem
>
> On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton <
> [hidden email]> wrote:
>
>> Thanks for your quick reply, Shalin.
>>
>> Tomcat is running on my Windows machine, but does not appear in Windows
>> Services (as I was expecting it should ... am I wrong?). I'm running it from
>> a startup.bat on my desktop - see below. Do I add the Dfile line to the
>> startup.bat?
>>
>> SOLR is part of the repository software that we are running.
>>
>
> Tomcat respects an environment variable called JAVA_OPTS through which you
> can pass any jvm argument (e.g. heap size, file encoding). Set
> JAVA_OPTS="-Dfile.encoding=UTF-8" either through the GUI or by adding the
> following to startup.bat:
>
> set JAVA_OPTS="-Dfile.encoding=UTF-8"
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
Reply | Threaded
Open this post in threaded view
|

RE: encoding problem

bernieh
Shalin, the XML from solr admin for the relevant field is displaying as -

<str name="citation_t"><a title="Browse by Author Name for Moncrieff, Joan" href="/fez/list/author/Moncrieff%2C+Joan/">Moncrieff, Joan</a>, <a title="Browse by Author Name for Macauley, Peter" href="/fez/list/author/Macauley%2C+Peter/">Macauley, Peter</a> and <a title="Browse by Author Name for Epps, Janine" href="/fez/list/author/Epps%2C+Janine/">Epps, Janine</a> <a title="Browse by Year 2006" href="/fez/list/year/2006/">2006</a>, <a title="Click to view Journal, Media Article: &ldquo;My Universe is Here&rdquo;: Implications For the Future of Academic Libraries From the Results of a Survey of Researchers" href="/fez/view/changeme:156">“My Universe is Here�: Implications For the Future of Academic Libraries From the Results of a Survey of Researchers</a><i></i>, vol. 38, no. 2, pp. 71-83.</str>


The weird thing is that the title displays OK in one place, but not in the "href" bit.

bern
Reply | Threaded
Open this post in threaded view
|

RE: encoding problem

bernieh
Still having a few issues with encoding, although I've been able to resolve the particular issue below by just re-editing the affected record.

The other encoding issue is with Greek characters. With solr turned off in our user-facing application, greek characters e.g. α,ω (small alpha, small omega) display correctly. But with solr turned on, garbage displays instead. If we enter the characters as decimal (e.g. &#969;), all displays OK with or without solr. Does this suggest anything to anyone??

TIA
bern

-----Original Message-----
From: Bernadette Houghton [mailto:[hidden email]]
Sent: Friday, 28 August 2009 9:31 AM
To: '[hidden email]'; '[hidden email]'
Subject: RE: encoding problem

Shalin, the XML from solr admin for the relevant field is displaying as -

<str name="citation_t"><a title="Browse by Author Name for Moncrieff, Joan" href="/fez/list/author/Moncrieff%2C+Joan/">Moncrieff, Joan</a>, <a title="Browse by Author Name for Macauley, Peter" href="/fez/list/author/Macauley%2C+Peter/">Macauley, Peter</a> and <a title="Browse by Author Name for Epps, Janine" href="/fez/list/author/Epps%2C+Janine/">Epps, Janine</a> <a title="Browse by Year 2006" href="/fez/list/year/2006/">2006</a>, <a title="Click to view Journal, Media Article: &ldquo;My Universe is Here&rdquo;: Implications For the Future of Academic Libraries From the Results of a Survey of Researchers" href="/fez/view/changeme:156">“My Universe is Here�: Implications For the Future of Academic Libraries From the Results of a Survey of Researchers</a><i></i>, vol. 38, no. 2, pp. 71-83.</str>


The weird thing is that the title displays OK in one place, but not in the "href" bit.

bern
Reply | Threaded
Open this post in threaded view
|

RE: encoding problem

bernieh
Finally resolved the problem! The solution was 3-pronged on my windows PC-

Added to my.ini under mysqld-
default-character-set=utf8
collation_server=utf8_unicode_ci
character_set_server=utf8
skip-character-set-client-handshake

Added to JAVA_OPTS environmental variable –
-Dfile.encoding=UTF-8

Added to beginning of tomcat startup.bat (positioning is important!)
set JAVA_OPTS="-Dfile.encoding=UTF-8"  

Thanks to everyone for their much appreciated help!

Bern

-----Original Message-----
From: Bernadette Houghton [mailto:[hidden email]]
Sent: Monday, 31 August 2009 9:18 AM
To: '[hidden email]'
Subject: RE: encoding problem

Still having a few issues with encoding, although I've been able to resolve the particular issue below by just re-editing the affected record.

The other encoding issue is with Greek characters. With solr turned off in our user-facing application, greek characters e.g. α,ω (small alpha, small omega) display correctly. But with solr turned on, garbage displays instead. If we enter the characters as decimal (e.g. &#969;), all displays OK with or without solr. Does this suggest anything to anyone??

TIA
bern