Problems querying Russian content

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems querying Russian content

dma_bamboo
Hi

I'm in trouble now about how to issue queries against Solr using in my "q"
parameter content in Russian (it applies to Chinese and Arabic as well).

The problem is I can't send any Russian special character in URL's because
they don't fit in ASCII domain, so I'm doing a POST to accomplish that.

My application gets the request and logs it (and the Russian characters
appear correctly on my logs) and then calls the Solr server and Solr is not
receiving it correctly... I can just see in the Solr log the special
characters as question marks...

Did anyone faced problems like that? My whole system is set to work in UTF-8
(browser, application servers).

Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Problems querying Russian content

Yonik Seeley-2
On 6/28/07, Daniel Alheiros <[hidden email]> wrote:
> I'm in trouble now about how to issue queries against Solr using in my "q"
> parameter content in Russian (it applies to Chinese and Arabic as well).
>
> The problem is I can't send any Russian special character in URL's because
> they don't fit in ASCII domain, so I'm doing a POST to accomplish that.

You can send unicode in URLs (it's done as the UTF-8 bytes percent encoded).
http://www.ietf.org/rfc/rfc3986.txt

But a POST should work too.  You just need to make sure the
Content-type contains the character encoding, and that it actually
matches what is being sent.

If this is a browser doing the POST, it can be a bit tricky to get it
to post UTF-8... basically, I think the browser uses the charset of
the HTML page containing the form when it does the POST (so make sure
that's UTF8).

Shut down Solr and use something like netcat (nc -l -p8983) to see
exactly what is being sent.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Problems querying Russian content

Jérôme Etévé-2
On 6/28/07, Yonik Seeley <[hidden email]> wrote:

> On 6/28/07, Daniel Alheiros <[hidden email]> wrote:
> > I'm in trouble now about how to issue queries against Solr using in my "q"
> > parameter content in Russian (it applies to Chinese and Arabic as well).
> >
> > The problem is I can't send any Russian special character in URL's because
> > they don't fit in ASCII domain, so I'm doing a POST to accomplish that.
>
> You can send unicode in URLs (it's done as the UTF-8 bytes percent encoded).
> http://www.ietf.org/rfc/rfc3986.txt
>
> But a POST should work too.  You just need to make sure the
> Content-type contains the character encoding, and that it actually
> matches what is being sent.
>
> If this is a browser doing the POST, it can be a bit tricky to get it
> to post UTF-8... basically, I think the browser uses the charset of
> the HTML page containing the form when it does the POST (so make sure
> that's UTF8).

You can also ensure the browser sends an utf8 encoded post by
<form accept-charset="UTF-8" ...
It works even if the page the form is in is not an UTF-8 page.


--
Jerome Eteve.
[hidden email]
http://jerome.eteve.free.fr/
Reply | Threaded
Open this post in threaded view
|

Re: Problems querying Russian content

Chris Hostetter-3

: You can also ensure the browser sends an utf8 encoded post by
: <form accept-charset="UTF-8" ...
: It works even if the page the form is in is not an UTF-8 page.

the solr admin pages already set the charset to UTF-8, so this is really
only an issue if you are using your own form.

but this is only the first half of the problem.

the second half is that servlet containers don't always do the right thing
with percent encoded UTF-8 strings...

http://www.nabble.com/Cyrillic-characters-t1963293.html#a5402562
http://wiki.apache.org/solr/SolrTomcat (see URI charset section)


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Problems querying Russian content

funtick-2
In reply to this post by dma_bamboo
Hi Danier,

Ensure that UTF-8 is everywhere... SOLR, WebServer, AppServer, HTTP  
Headers, etc.

And do not use  
q=&#1041;&#1072;&#1084;&#1073;&#1072;&#1088;&#1073;&#1080;&#1072;  
&#1050;&#1080;&#1088;&#1082;&#1091;&#1076;&#1091;
use this instead (encoded URL):
q=%D0%91%D0%B0%D0%BC%D0%B1%D0%B0%D1%80%D0%B1%D0%B8%D0%B0+%D0%9A%D0%B8%D1%80%D0%BA%D1%83%D0%B4%D1%83

http://www.tokenizer.org is a search engine, SOLR powered... I need to  
add some large Internet shops to the crawler, from Russia...

Quoting Daniel Alheiros:

> Hi
>
> I'm in trouble now about how to issue queries against Solr using in my "q"
> parameter content in Russian (it applies to Chinese and Arabic as well).
>
> The problem is I can't send any Russian special character in URL's because
> they don't fit in ASCII domain, so I'm doing a POST to accomplish that.
>
> My application gets the request and logs it (and the Russian characters
> appear correctly on my logs) and then calls the Solr server and Solr is not
> receiving it correctly... I can just see in the Solr log the special
> characters as question marks...
>
> Did anyone faced problems like that? My whole system is set to work in UTF-8
> (browser, application servers).
>
> Regards,
> Daniel
>
>
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain  
> personal views which are not the views of the BBC unless  
> specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in  
> reliance on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Problems querying Russian content

dma_bamboo
In reply to this post by Chris Hostetter-3
Thanks a lot!

Now it is working. It was the Tomcat connector setup ....

Regards,
Daniel


On 28.06.2007 17:19, "Chris Hostetter" <[hidden email]> wrote:

>
> : You can also ensure the browser sends an utf8 encoded post by
> : <form accept-charset="UTF-8" ...
> : It works even if the page the form is in is not an UTF-8 page.
>
> the solr admin pages already set the charset to UTF-8, so this is really
> only an issue if you are using your own form.
>
> but this is only the first half of the problem.
>
> the second half is that servlet containers don't always do the right thing
> with percent encoded UTF-8 strings...
>
> http://www.nabble.com/Cyrillic-characters-t1963293.html#a5402562
> http://wiki.apache.org/solr/SolrTomcat (see URI charset section)
>
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Problems querying Russian content

dma_bamboo
In reply to this post by funtick-2
Thanks.

Yes I will do it.

So you may be the best person to talk about the Russian content indexing. :)
My indexing process follows:
    1. RussianTokenizer
    2. RussianLowerCaseFilter
    3. RussianStopFilter
    4. RussianStemFilter

Seems OK to me as I'm using the same structure used by the Lucene's
RussianAnalyzer... Do you think I can improve it somehow?

Regards,
Daniel



On 28.06.2007 17:37, "[hidden email]" <[hidden email]> wrote:

> Hi Danier,
>
> Ensure that UTF-8 is everywhere... SOLR, WebServer, AppServer, HTTP
> Headers, etc.
>
> And do not use  
> q=&#1041;&#1072;&#1084;&#1073;&#1072;&#1088;&#1073;&#1080;&#1072;
> &#1050;&#1080;&#1088;&#1082;&#1091;&#1076;&#1091;
> use this instead (encoded URL):
> q=%D0%91%D0%B0%D0%BC%D0%B1%D0%B0%D1%80%D0%B1%D0%B8%D0%B0+%D0%9A%D0%B8%D1%80%D0
> %BA%D1%83%D0%B4%D1%83
>
> http://www.tokenizer.org is a search engine, SOLR powered... I need to
> add some large Internet shops to the crawler, from Russia...
>
> Quoting Daniel Alheiros:
>
>> Hi
>>
>> I'm in trouble now about how to issue queries against Solr using in my "q"
>> parameter content in Russian (it applies to Chinese and Arabic as well).
>>
>> The problem is I can't send any Russian special character in URL's because
>> they don't fit in ASCII domain, so I'm doing a POST to accomplish that.
>>
>> My application gets the request and logs it (and the Russian characters
>> appear correctly on my logs) and then calls the Solr server and Solr is not
>> receiving it correctly... I can just see in the Solr log the special
>> characters as question marks...
>>
>> Did anyone faced problems like that? My whole system is set to work in UTF-8
>> (browser, application servers).
>>
>> Regards,
>> Daniel
>>
>>
>> http://www.bbc.co.uk/
>> This e-mail (and any attachments) is confidential and may contain
>> personal views which are not the views of the BBC unless
>> specifically stated.
>> If you have received it in error, please delete it from your system.
>> Do not use, copy or disclose the information in any way nor act in
>> reliance on it and notify the sender immediately.
>> Please note that the BBC monitors e-mails sent or received.
>> Further communication will signify your consent to this.
>>
>>
>
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: Problems querying Russian content

funtick-2
I know Russian better than Russians ;)
I currently use default configuration for "dismax" provided by SOLR  
1.1; I can add few URLs tonight to the crawler to see what happens. As  
I know, Lucene/Nutch can even define web page (pdf, txt, html)  
language by checking raw bytearray (raw HTTP Response without  
"language" clues in HTML). Code in Nutch Trunk is huge, a lot of  
useful staff...

Quoting Daniel Alheiros:
> My indexing process follows:
>     1. RussianTokenizer
>     2. RussianLowerCaseFilter
>     3. RussianStopFilter
>     4. RussianStemFilter


I haven't tried it yet... I'll need probably separate SOLR + Website  
for Russian (?)

Currently http://www.tokenizer.org has pages in French (Canadian shops  
are bilingual), and Google correctly "understands" that such pages are  
in French (without additional HTML/HTTP language clues); I don't know  
French and can't test...

Unfortunately query "écran" does not retrieve anything. However, I  
have a lot of "d'Intel", including "d’Intel et écran".

I need to work on it too... Thanks!