Solr interprets UTF-8 as ISO-8859-1

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr interprets UTF-8 as ISO-8859-1

Daniel Löfquist-2
Hello,

We're building a webapplication that uses Solr for searching and I've
come upon a problem that I can't seem to get my head around.

We have a servlet that accepts input via XML-RPC and based on that input
constructs the correct URL to perform a search with the Solr-servlet.

I know that the call to Solr (the URL) from our servlet looks like this
(which is what it should look like):

http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25

But Solr reports the input-fields (the GET-variables in the URL) as:

INFO: /select/
fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblå+status:online&q.op=AND&rows=25

which is all fine except where it says "ljusblå". Apparently Solr is
interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates
this garbage that makes the search return 0 when it should in reality
return 3 hits.

All other searches that don't use special characters work 100% fine.

I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
help me out and point me in the direction of a solution?

Sincerely,

Daniel Löfquist

Reply | Threaded
Open this post in threaded view
|

Re: Solr interprets UTF-8 as ISO-8859-1

Sean Timm
Send the URL with the å character URL encoded as %C3%A5.  That is the
UTF-8 URL encoding.

http://myserver:8080/solrproducts/select/?q=all_SV:ljusbl%C3%A5+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25

-Sean


Daniel Löfquist wrote:

> Hello,
>
> We're building a webapplication that uses Solr for searching and I've
> come upon a problem that I can't seem to get my head around.
>
> We have a servlet that accepts input via XML-RPC and based on that input
> constructs the correct URL to perform a search with the Solr-servlet.
>
> I know that the call to Solr (the URL) from our servlet looks like this
> (which is what it should look like):
>
> http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25 
>
>
> But Solr reports the input-fields (the GET-variables in the URL) as:
>
> INFO: /select/
> fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblå+status:online&q.op=AND&rows=25
>
>
> which is all fine except where it says "ljusblå". Apparently Solr is
> interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates
> this garbage that makes the search return 0 when it should in reality
> return 3 hits.
>
> All other searches that don't use special characters work 100% fine.
>
> I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
> help me out and point me in the direction of a solution?
>
> Sincerely,
>
> Daniel Löfquist
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr interprets UTF-8 as ISO-8859-1

Siegfried Goeschl
In reply to this post by Daniel Löfquist-2
Hi Daniel,

the following topic might help (at least it did the trick for me using
german chararcters)

http://wiki.apache.org/solr/FAQ - Why don't International Characters Work?

So I wrote the following servlet (taken from Wiki/mailing list)

import org.apache.solr.servlet.SolrDispatchFilter;

import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.FilterChain;
import javax.servlet.ServletException;
import java.io.IOException;

/**
 * A work around that the URL parameters are encoded using UTF-8 but no
character
 * encoding is defined. So enforce UTF-8 to make it work with German
characters.
 */
public class CdpSolrDispatchFilter extends SolrDispatchFilter {

  public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {

    String encoding = request.getCharacterEncoding();
    if (null == encoding) {
      // Set your default encoding here
      request.setCharacterEncoding("UTF-8");
    } else {
      request.setCharacterEncoding(encoding);
    }
   
    super.doFilter(request, response, chain);
  }
}

Cheers,

Siegfried Goeschl

Daniel Löfquist wrote:

> Hello,
>
> We're building a webapplication that uses Solr for searching and I've
> come upon a problem that I can't seem to get my head around.
>
> We have a servlet that accepts input via XML-RPC and based on that input
> constructs the correct URL to perform a search with the Solr-servlet.
>
> I know that the call to Solr (the URL) from our servlet looks like this
> (which is what it should look like):
>
> http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25 
>
>
> But Solr reports the input-fields (the GET-variables in the URL) as:
>
> INFO: /select/
> fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblå+status:online&q.op=AND&rows=25
>
>
> which is all fine except where it says "ljusblå". Apparently Solr is
> interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates
> this garbage that makes the search return 0 when it should in reality
> return 3 hits.
>
> All other searches that don't use special characters work 100% fine.
>
> I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
> help me out and point me in the direction of a solution?
>
> Sincerely,
>
> Daniel Löfquist
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr interprets UTF-8 as ISO-8859-1

uweklosa
In reply to this post by Daniel Löfquist-2
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: Solved! Solr interprets UTF-8 as ISO-8859-1

Daniel Löfquist-2
That did the trick. I actually figured it out on my own 10 minutes after
I posted to the mailinglist. Typical ;-)
Thanks for the help anyway everybody!

//Daniel

Uwe Klosa wrote:

> You should set uriEncoding="UTF-8" in your application server. For tomcat
> you can do that in the server.xml. For Glassfish you have to create a
> sun-web.xml containing the according parameters. Yoy r application server
> should provide a similar mechanism.
>
> Uwe
>
> On Mon, Mar 31, 2008 at 4:32 PM, Daniel Löfquist <
> [hidden email]> wrote:
>
>> Hello,
>>
>> We're building a webapplication that uses Solr for searching and I've
>> come upon a problem that I can't seem to get my head around.
>>
>> We have a servlet that accepts input via XML-RPC and based on that input
>> constructs the correct URL to perform a search with the Solr-servlet.
>>
>> I know that the call to Solr (the URL) from our servlet looks like this
>> (which is what it should look like):
>>
>> http://myserver:8080/solrproducts/select/?q=all_SV:ljusbl
>> å+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25
>>
>> But Solr reports the input-fields (the GET-variables in the URL) as:
>>
>> INFO: /select/
>>
>> fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblå+status:online&q.op=AND&rows=25
>>
>> which is all fine except where it says "ljusblå". Apparently Solr is
>> interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates
>> this garbage that makes the search return 0 when it should in reality
>> return 3 hits.
>>
>> All other searches that don't use special characters work 100% fine.
>>
>> I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
>> help me out and point me in the direction of a solution?
>>
>> Sincerely,
>>
>> Daniel Löfquist
>>
>>
>

--
Daniel Löfquist
Application Manager / Software Engineer

CDON.COM
Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden

Office: +46 40 601 61 00
Direct: +46 40 601 61 16
Mobile: +46 702 92 21 75
Fax: +46 40 601 61 20
E-mail: [hidden email] <mailto:[hidden email]>

CDON.COM <http://www.cdon.com/>

Confidentiality
Information contained in this e-mail is intended for the use of the
addressee only, and is confidential. Any dissemination, distribution,
copying or use of this communication without prior permission of
the addressee is strictly prohibited. If you are not the intended
addressee you must delete this e-mail and its attachments.