Searching combined English-Japanese index

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Searching combined English-Japanese index

Max Hütter
Hi,

I know there has been quite some discussion about Multilanguage
searching already, but I am not quite sure this applies to my case.

I have an index with field which contain Japanese and English at the
same time. Is this possible? Tokenizing is not the big problem here, the
 StandardTokenizerFactory is good enough, judging by the Solr-Admin
Field Analysis.

My problem is, that searches for Japanese Text don't give any results. I
get results for the English parts, but not for the Japanese.

Using Limo I can see that it is correctly indexed as UTF-8. But using
the Solr Admin Query, I don't get any results. As I understood it, Solr
should just match the characters and return something.

When I search using an English term, I get results but the Japanese is
not encoded correctly in the response. (although it is UTF-8 encoded)

I am using Solr 1.2.

Any ideas, what I might be doing wrong?

Best regards,

Max

--
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel            :  (+49) 0711 - 45 10 17 578
Fax            :  (+49) 0711 - 45 10 17 573
e-mail         :  [hidden email]
Sitz           :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich
Reply | Threaded
Open this post in threaded view
|

Re: Searching combined English-Japanese index

Yonik Seeley-2
On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
> When I search using an English term, I get results but the Japanese is
> not encoded correctly in the response. (although it is UTF-8 encoded)

One quick thing to try is the python writer (wt=python) to see the
actual unicode values of what you are getting back (since the python
writer automatically escapes non-ascii).  That can help rule out
incorrect charset handling by clients.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Searching combined English-Japanese index

Max Hütter
Yonik Seeley schrieb:

> On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
>> When I search using an English term, I get results but the Japanese is
>> not encoded correctly in the response. (although it is UTF-8 encoded)
>
> One quick thing to try is the python writer (wt=python) to see the
> actual unicode values of what you are getting back (since the python
> writer automatically escapes non-ascii).  That can help rule out
> incorrect charset handling by clients.
>
> -Yonik
>
Thanks for the tip, it turns out that the unicode values are wrong... I
mean the browser displays correctly what is send. But I don't know how
solr gets these values.

For example python output is:

'key':'honshu_server_ovo:application_List VPO NT Templates_integrated',
         'backend':'honshu_server',
         'service':'ovoconfig',
         'objectclass':'ovo:application',
         'objecttype':'integrated',
         'name':'List VPO NT Templates',
         'label':u'VPO
\u00e3\u0083\u0086\u00e3\u0083\u00b3\u00e3\u0083\u0097\u00e3\u0083\u00ac\u00e3\u0083\u00bc\u00e3\u0083\u0088',
         'path':'',
         'context':'',
         'revision':'',
         'description':'',
         'ovo:application_name':'List VPO NT Templates'},

But in Limo the doc looks like this:

key   honshu_server_ovo:application_List VPO NT Templates_integrated
backend honshu_server
service ovoconfig
objectclass ovo:application
objecttype integrated
name List VPO NT Templates
label VPO テンプレート
path
context
revision
description
ovo:application_name List VPO NT Templates

I hope you can view the japanese katakana in the label field.

But somehow this is changed to completely different unicode characters
in the search result.

Max

--
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel            :  (+49) 0711 - 45 10 17 578
Fax            :  (+49) 0711 - 45 10 17 573
e-mail         :  [hidden email]
Sitz           :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich
Reply | Threaded
Open this post in threaded view
|

Re: Searching combined English-Japanese index

Yonik Seeley-2
On 10/1/07, Maximilian Hütter <[hidden email]> wrote:

> Yonik Seeley schrieb:
> > On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
> >> When I search using an English term, I get results but the Japanese is
> >> not encoded correctly in the response. (although it is UTF-8 encoded)
> >
> > One quick thing to try is the python writer (wt=python) to see the
> > actual unicode values of what you are getting back (since the python
> > writer automatically escapes non-ascii).  That can help rule out
> > incorrect charset handling by clients.
> >
> > -Yonik
> >
> Thanks for the tip, it turns out that the unicode values are wrong... I
> mean the browser displays correctly what is send. But I don't know how
> solr gets these values.

OK, so they never got into the index correctly.
The most likely explanation is that the charset wasn't set correctly
when the update message was sent to Solr.

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: Searching combined English-Japanese index

Lance Norskog-2
Some servlet containers don't do UTF-8 out of the box. There is information
about this on the wiki.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik Seeley
Sent: Monday, October 01, 2007 9:45 AM
To: [hidden email]
Subject: Re: Searching combined English-Japanese index

On 10/1/07, Maximilian Hütter <[hidden email]> wrote:

> Yonik Seeley schrieb:
> > On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
> >> When I search using an English term, I get results but the Japanese
> >> is not encoded correctly in the response. (although it is UTF-8
> >> encoded)
> >
> > One quick thing to try is the python writer (wt=python) to see the
> > actual unicode values of what you are getting back (since the python
> > writer automatically escapes non-ascii).  That can help rule out
> > incorrect charset handling by clients.
> >
> > -Yonik
> >
> Thanks for the tip, it turns out that the unicode values are wrong...
> I mean the browser displays correctly what is send. But I don't know
> how solr gets these values.

OK, so they never got into the index correctly.
The most likely explanation is that the charset wasn't set correctly when
the update message was sent to Solr.

-Yonik

Reply | Threaded
Open this post in threaded view
|

Re: Searching combined English-Japanese index

Max Hütter
In reply to this post by Yonik Seeley-2
Yonik Seeley schrieb:

> On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
>> Yonik Seeley schrieb:
>>> On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
>>>> When I search using an English term, I get results but the Japanese is
>>>> not encoded correctly in the response. (although it is UTF-8 encoded)
>>> One quick thing to try is the python writer (wt=python) to see the
>>> actual unicode values of what you are getting back (since the python
>>> writer automatically escapes non-ascii).  That can help rule out
>>> incorrect charset handling by clients.
>>>
>>> -Yonik
>>>
>> Thanks for the tip, it turns out that the unicode values are wrong... I
>> mean the browser displays correctly what is send. But I don't know how
>> solr gets these values.
>
> OK, so they never got into the index correctly.
> The most likely explanation is that the charset wasn't set correctly
> when the update message was sent to Solr.
>
> -Yonik
>
Are you sure, they are wrong in the index? When I use the Lucene Index
Monitor (http://limo.sourceforge.net/) to look at the document in the
index the Japanese is displayed correctly.
I am using Jetty 6.0.1 by the way.

Best regards,

Max

--
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel            :  (+49) 0711 - 45 10 17 578
Fax            :  (+49) 0711 - 45 10 17 573
e-mail         :  [hidden email]
Sitz           :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich
Reply | Threaded
Open this post in threaded view
|

RE: Searching combined English-Japanese index

Lance Norskog-2
Python does not do Unicode strings natively, you have to do them explicitly.
It is possible that your python receiver is not doing the right thing with
the incoming strings.  Also, Jetty has problems with UTF-8; the Wiki has
more on this.

Lance

-----Original Message-----
From: Maximilian Hütter [mailto:[hidden email]]
Sent: Tuesday, October 02, 2007 1:35 AM
To: [hidden email]
Subject: Re: Searching combined English-Japanese index

Yonik Seeley schrieb:

> On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
>> Yonik Seeley schrieb:
>>> On 10/1/07, Maximilian Hütter <[hidden email]> wrote:
>>>> When I search using an English term, I get results but the Japanese
>>>> is not encoded correctly in the response. (although it is UTF-8
>>>> encoded)
>>> One quick thing to try is the python writer (wt=python) to see the
>>> actual unicode values of what you are getting back (since the python
>>> writer automatically escapes non-ascii).  That can help rule out
>>> incorrect charset handling by clients.
>>>
>>> -Yonik
>>>
>> Thanks for the tip, it turns out that the unicode values are wrong...
>> I mean the browser displays correctly what is send. But I don't know
>> how solr gets these values.
>
> OK, so they never got into the index correctly.
> The most likely explanation is that the charset wasn't set correctly
> when the update message was sent to Solr.
>
> -Yonik
>
Are you sure, they are wrong in the index? When I use the Lucene Index
Monitor (http://limo.sourceforge.net/) to look at the document in the index
the Japanese is displayed correctly.
I am using Jetty 6.0.1 by the way.

Best regards,

Max

--
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel            :  (+49) 0711 - 45 10 17 578
Fax            :  (+49) 0711 - 45 10 17 573
e-mail         :  [hidden email]
Sitz           :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich

Reply | Threaded
Open this post in threaded view
|

Re: Searching combined English-Japanese index

Yonik Seeley-2
In reply to this post by Max Hütter
On 10/2/07, Maximilian Hütter <[hidden email]> wrote:
> Are you sure, they are wrong in the index?

It's not an issue with Jetty output encoding since the python writer
takes the string and converts it to ascii before that.  Since Solr
does no charset encoding itself on output, that must mean that it's in
the index incorrectly.

> When I use the Lucene Index
> Monitor (http://limo.sourceforge.net/) to look at the document in the
> index the Japanese is displayed correctly.

I've never really used limo, but it's possible it's incorrectly
interpreting what's in the index (and by luck doing the reverse
transformation that got the data in there incorrectly).

Try indexing a document with a unicode character specified via an
entity, to remove the issues of input char encodings.  For example if
a Japanese char has a unicode value of \u1234, then in the XML doc,
use &#x1234

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Searching combined English-Japanese index

Max Hütter
You were right, the indexing is already wrong. I debugged Solr and saw
that the indexwriter gets the wrong values. That was because of the
missing Content-Type in the update-requests. It was just text/xml
without the charset=utf-8 . So it was interpreted as ISO-8859-1 Ithink.
Changing the charset to utf-8 fixed the index. The xml had the encoding
set but Solr seems to ignore that.

Limo really seems to converted back correctly by chance.

Thanks for the help! Now I just have to figure out how to correctly
encode the query string...

Best regards,

Max

--
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel            :  (+49) 0711 - 45 10 17 578
Fax            :  (+49) 0711 - 45 10 17 573
e-mail         :  [hidden email]
Sitz           :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich