Probleme with unicode query

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Probleme with unicode query

Frederic Bouchery
hello,

I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey.

Here is my text field configuration
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory" articles="elisions.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory" articles="elisions.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" />
</analyzer>

When I performe this request : select/?q=hygiene sécurité&debugQuery=true
Here is debug infos :
<str name="rawquerystring">hygiene sécurité</str>
<str name="querystring">hygiene sécurité</str>
<str name="parsedquery">searchText:hygien (searchText:sa
searchText:curit)</str>
<str name="parsedquery_toString">searchText:hygien (searchText:sa
searchText:curit)</str>

Has you can see, unicode request failed : "searchText:sa searchText:curit"
instead of "searchText:securite"
I've tried with "ISOLatin1AccentFilterFactory", I've changed the order, but
no difference :(

Any ideas ?

Thanks

Frederic
Em
Reply | Threaded
Open this post in threaded view
|

Re: Probleme with unicode query

Em
Hi Frederic,

I saw similar issues when sending such a request without proper
URL-encoding. It is important to note that the URL-encoded string
already has to be an UTF-8-string.
What happens if you send that query via Solr's admin-panel?

Have a look at this page for troubleshooting:
http://wiki.apache.org/solr/SolrTomcat

Kind regards,
Em

Am 23.02.2012 18:15, schrieb Frederic Bouchery:

> hello,
>
> I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode quey.
>
> Here is my text field configuration
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ElisionFilterFactory" articles="elisions.txt"/>
> <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="French" />
> </analyzer>
> <analyzer type="query">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ElisionFilterFactory" articles="elisions.txt"/>
> <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="French" />
> </analyzer>
>
> When I performe this request : select/?q=hygiene sécurité&debugQuery=true
> Here is debug infos :
> <str name="rawquerystring">hygiene sécurité</str>
> <str name="querystring">hygiene sécurité</str>
> <str name="parsedquery">searchText:hygien (searchText:sa
> searchText:curit)</str>
> <str name="parsedquery_toString">searchText:hygien (searchText:sa
> searchText:curit)</str>
>
> Has you can see, unicode request failed : "searchText:sa searchText:curit"
> instead of "searchText:securite"
> I've tried with "ISOLatin1AccentFilterFactory", I've changed the order, but
> no difference :(
>
> Any ideas ?
>
> Thanks
>
> Frederic
>
Reply | Threaded
Open this post in threaded view
|

Re: Probleme with unicode query

Frederic Bouchery
Thanks !!

This is a tomcat issue and not solr : URIEncoding="UTF-8" is missing in
tomcat server.xml

Frederic

2012/2/23 Em <[hidden email]>

> Hi Frederic,
>
> I saw similar issues when sending such a request without proper
> URL-encoding. It is important to note that the URL-encoded string
> already has to be an UTF-8-string.
> What happens if you send that query via Solr's admin-panel?
>
> Have a look at this page for troubleshooting:
> http://wiki.apache.org/solr/SolrTomcat
>
> Kind regards,
> Em
>
> Am 23.02.2012 18:15, schrieb Frederic Bouchery:
> > hello,
> >
> > I'm using Solr 3.5 over Tomcat 6 and I've some problemes with unicode
> quey.
> >
> > Here is my text field configuration
> > <analyzer type="index">
> > <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > <tokenizer class="solr.StandardTokenizerFactory"/>
> > <filter class="solr.StandardFilterFactory"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.ElisionFilterFactory" articles="elisions.txt"/>
> > <filter class="solr.StopFilterFactory" words="stopwords.txt"
> > ignoreCase="true"/>
> > <filter class="solr.ASCIIFoldingFilterFactory"/>
> > <filter class="solr.SnowballPorterFilterFactory" language="French" />
> > </analyzer>
> > <analyzer type="query">
> > <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > <tokenizer class="solr.StandardTokenizerFactory"/>
> > <filter class="solr.StandardFilterFactory"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.ElisionFilterFactory" articles="elisions.txt"/>
> > <filter class="solr.StopFilterFactory" words="stopwords.txt"
> > ignoreCase="true"/>
> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true"/>
> > <filter class="solr.ASCIIFoldingFilterFactory"/>
> > <filter class="solr.SnowballPorterFilterFactory" language="French" />
> > </analyzer>
> >
> > When I performe this request : select/?q=hygiene sécurité&debugQuery=true
> > Here is debug infos :
> > <str name="rawquerystring">hygiene sécurité</str>
> > <str name="querystring">hygiene sécurité</str>
> > <str name="parsedquery">searchText:hygien (searchText:sa
> > searchText:curit)</str>
> > <str name="parsedquery_toString">searchText:hygien (searchText:sa
> > searchText:curit)</str>
> >
> > Has you can see, unicode request failed : "searchText:sa
> searchText:curit"
> > instead of "searchText:securite"
> > I've tried with "ISOLatin1AccentFilterFactory", I've changed the order,
> but
> > no difference :(
> >
> > Any ideas ?
> >
> > Thanks
> >
> > Frederic
> >
>



--
*Frédéric BOUCHERY*
OuestFranceMultimédi@
*BU - Emploi* : 0.22.33.55.88.9