no support for CJK characters from Extension B in Solr

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

no support for CJK characters from Extension B in Solr

Christian Wittern
Hi there,

The documents I am trying to index with Solr contain characters from the CJK
Extension B, which had been added to Unicode in version 3.1 (March 2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.

Solr seems to accept the documents without problem, but when I retrieve the
documents, there are strange placeholders like #0; etc. in its place.  Might
this be a configuration issue?

While most of the characters in this range are very rare, due to the latest
mapping tables between Unicode and the Japanese JIS coded character sets,
some of the characters in everyday use in Japan are now encoded in this
area.  It does therefore seems highly desirable that this problem gets
solved.  I am testing this on a Mac OS X 10.5.2 system, with Java 1.5.0_13
and Solr 1.2.0.

Any hints appreciated,

Christian Wittern


--
 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Leonardo Santagada

On 28/02/2008, at 00:23, Christian Wittern wrote:

> The documents I am trying to index with Solr contain characters from  
> the CJK
> Extension B, which had been added to Unicode in version 3.1 (March  
> 2001).


Just to give more information, does java suport this? I beleive they  
don't support characters with more than 2 bytes so maybe this is the  
case...

--
Leonardo Santagada



Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Christian Wittern
Leonardo Santagada wrote:

>
> On 28/02/2008, at 00:23, Christian Wittern wrote:
>
>> The documents I am trying to index with Solr contain characters from
>> the CJK
>> Extension B, which had been added to Unicode in version 3.1 (March
>> 2001).
>
>
> Just to give more information, does java suport this? I beleive they
> don't support characters with more than 2 bytes so maybe this is the
> case...
It is supported since in Java since 5 (or 1.5.0):

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.UnicodeBlock.html#CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B

Christian

--
 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Erik Hatcher
In reply to this post by Christian Wittern
Christian,

Is this an issue with the encoding used when adding the documents to  
the index?   There are two encodings that need to be gotten right,  
the one for the XML content POSTed to Solr, and also the HTTP header  
on that POST request.   If you are getting mangled content back from  
a stored field, sounds like something went awry in getting that  
document into Solr in the first place.

        Erik



On Feb 27, 2008, at 10:23 PM, Christian Wittern wrote:

> Hi there,
>
> The documents I am trying to index with Solr contain characters  
> from the CJK
> Extension B, which had been added to Unicode in version 3.1 (March  
> 2001).
> Unfortunately, it seems to be the case that Solr (and maybe Lucene)  
> do not
> yet support these characters.
>
> Solr seems to accept the documents without problem, but when I  
> retrieve the
> documents, there are strange placeholders like #0; etc. in its  
> place.  Might
> this be a configuration issue?
>
> While most of the characters in this range are very rare, due to  
> the latest
> mapping tables between Unicode and the Japanese JIS coded character  
> sets,
> some of the characters in everyday use in Japan are now encoded in  
> this
> area.  It does therefore seems highly desirable that this problem gets
> solved.  I am testing this on a Mac OS X 10.5.2 system, with Java  
> 1.5.0_13
> and Solr 1.2.0.
>
> Any hints appreciated,
>
> Christian Wittern
>
>
> --
>  Christian Wittern
>  Institute for Research in Humanities, Kyoto University
>  47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Erik Hatcher
In reply to this post by Christian Wittern
Christian,

This bit of trivia is probably useful to you as well.  Lucene's  
StandardTokenizer uses these Unicode ranges for CJK characters:

KOREAN     = [\uac00-\ud7af\u1100-\u11ff]

// Chinese, Japanese
CJ         = [\u3040-\u318f\u3100-\u312f\u3040-\u309F\u30A0-\u30FF
\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff
\uff65-\uff9f]

I haven't done my homework to correlate that with CJK Extension B,  
but I bet you know!  :)

        Erik


On Feb 27, 2008, at 10:23 PM, Christian Wittern wrote:

> Hi there,
>
> The documents I am trying to index with Solr contain characters  
> from the CJK
> Extension B, which had been added to Unicode in version 3.1 (March  
> 2001).
> Unfortunately, it seems to be the case that Solr (and maybe Lucene)  
> do not
> yet support these characters.
>
> Solr seems to accept the documents without problem, but when I  
> retrieve the
> documents, there are strange placeholders like #0; etc. in its  
> place.  Might
> this be a configuration issue?
>
> While most of the characters in this range are very rare, due to  
> the latest
> mapping tables between Unicode and the Japanese JIS coded character  
> sets,
> some of the characters in everyday use in Japan are now encoded in  
> this
> area.  It does therefore seems highly desirable that this problem gets
> solved.  I am testing this on a Mac OS X 10.5.2 system, with Java  
> 1.5.0_13
> and Solr 1.2.0.
>
> Any hints appreciated,
>
> Christian Wittern
>
>
> --
>  Christian Wittern
>  Institute for Research in Humanities, Kyoto University
>  47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

kkrugler
In reply to this post by Christian Wittern
Hi Christian,

>The documents I am trying to index with Solr contain characters from the CJK
>Extension B, which had been added to Unicode in version 3.1 (March 2001).
>Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
>yet support these characters.
>
>Solr seems to accept the documents without problem, but when I retrieve the
>documents, there are strange placeholders like #0; etc. in its place.  Might
>this be a configuration issue?

1. What encoding are you using when pushing these documents to Solr?
Both as specified in the XML, and the POST request. And there's a
separate issue about the mime-type you use for the POST, if you're
doing it yourself (not using the latest scripts from Solr).

2. What do these characters look like in the XML you're pushing? For
example, if they are encoded as two surrogate characters instead of
one code point from the extension B set, most XML parsers will not
handle it correctly. This is the most source of similar issues I've
seen.

3. Do the base plane characters (code points < U+10000) round-trip correctly?

One potential issue is the XML parser being used - most have been
updated to handle extended Unicode code points, but there were a few
older parsers that still failed to handle &#20103, for example.

-- Ken

>
>While most of the characters in this range are very rare, due to the latest
>mapping tables between Unicode and the Japanese JIS coded character sets,
>some of the characters in everyday use in Japan are now encoded in this
>area.  It does therefore seems highly desirable that this problem gets
>solved.  I am testing this on a Mac OS X 10.5.2 system, with Java 1.5.0_13
>and Solr 1.2.0.
>
>Any hints appreciated,
>
>Christian Wittern


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

kkrugler
In reply to this post by Christian Wittern
Hi Christian,

>The documents I am trying to index with Solr contain characters from the CJK
>Extension B, which had been added to Unicode in version 3.1 (March 2001).
>Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
>yet support these characters.
>
>Solr seems to accept the documents without problem, but when I retrieve the
>documents, there are strange placeholders like #0; etc. in its place.  Might
>this be a configuration issue?

And as Erik mentioned, it appears that line 114 of StandardTokenizerImpl.jflex:

http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

needs to be updated to include the Extension B character range.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Erik Hatcher
To elaborate.... StandardTokenizer comes into play for indexing and  
querying (and only if you have that configured for that field in  
schema.xml).  But the original issue seems to be with actually  
parsing the content properly and storing it in the Lucene index,  
which is separate from the tokenization process altogether - I just  
wanted to point it out as something else you might encounter along  
the way.

        Erik



On Feb 28, 2008, at 11:26 AM, Ken Krugler wrote:

> Hi Christian,
>
>> The documents I am trying to index with Solr contain characters  
>> from the CJK
>> Extension B, which had been added to Unicode in version 3.1 (March  
>> 2001).
>> Unfortunately, it seems to be the case that Solr (and maybe  
>> Lucene) do not
>> yet support these characters.
>>
>> Solr seems to accept the documents without problem, but when I  
>> retrieve the
>> documents, there are strange placeholders like #0; etc. in its  
>> place.  Might
>> this be a configuration issue?
>
> And as Erik mentioned, it appears that line 114 of  
> StandardTokenizerImpl.jflex:
>
> http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/ 
> trunk/src/java/org/apache/lucene/analysis/standard/
> StandardTokenizerImpl.jflex
>
> needs to be updated to include the Extension B character range.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"

Reply | Threaded
Open this post in threaded view
|

RE: no support for CJK characters from Extension B in Solr

steve_rowe
In reply to this post by kkrugler
On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
> And as Erik mentioned, it appears that line 114 of
> StandardTokenizerImpl.jflex:
>
> http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> needs to be updated to include the Extension B character range.

JFlex 1.4.1 (the latest release) does not support supplementary code points (those above the BMP - Basic Multilingual Plane: [U+0000-U+FFFF]), and CJK Ideograph Extension B is definitely a supplementary range - see the first column from <http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt> (the extent of this range is unchanged through the latest [beta] version, 5.1.0):

        20000;<CJK Ideograph Extension B, First> ...
        2A6D6;<CJK Ideograph Extension B, Last> ...

I am working with Gerwin Klein on the development version of JFlex, and am hoping to get Level 1 [Regular Expression] Basic Unicode Support into the next release (see <http://unicode.org/reports/tr18/>) - among other things, this entails accepting supplementary code points.

However, the next release of JFlex will require Java 1.5+, and Lucene 2.X requires Java 1.4, so until Lucene reaches release 3.0 and begins requiring Java 1.5 (and Solr incorporates it), JFlex support of supplementary code points is moot.

In short, it'll probably be at least a year before the StandardTokenizer can be modified to accept supplementary characters, given the processes involved.

Steve
Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Erik Hatcher
Wow - great stuff Steve!

As for StandardTokenizer and Java version - no worries there really,  
as Solr itself requires Java 1.5+, so when such a tokenizer is made  
available it could be  used just fine in Solr even if it isn't built  
into a core Lucene release for a while.

        Erik



On Feb 28, 2008, at 12:08 PM, Steven A Rowe wrote:

> On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
>> And as Erik mentioned, it appears that line 114 of
>> StandardTokenizerImpl.jflex:
>>
>> http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/ 
>> trunk/src/java/org/apache/lucene/analysis/standard/
>> StandardTokenizerImpl.jflex
>>
>> needs to be updated to include the Extension B character range.
>
> JFlex 1.4.1 (the latest release) does not support supplementary  
> code points (those above the BMP - Basic Multilingual Plane: [U
> +0000-U+FFFF]), and CJK Ideograph Extension B is definitely a  
> supplementary range - see the first column from <http://
> www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt> (the  
> extent of this range is unchanged through the latest [beta]  
> version, 5.1.0):
>
> 20000;<CJK Ideograph Extension B, First> ...
> 2A6D6;<CJK Ideograph Extension B, Last> ...
>
> I am working with Gerwin Klein on the development version of JFlex,  
> and am hoping to get Level 1 [Regular Expression] Basic Unicode  
> Support into the next release (see <http://unicode.org/reports/tr18/ 
> >) - among other things, this entails accepting supplementary code  
> points.
>
> However, the next release of JFlex will require Java 1.5+, and  
> Lucene 2.X requires Java 1.4, so until Lucene reaches release 3.0  
> and begins requiring Java 1.5 (and Solr incorporates it), JFlex  
> support of supplementary code points is moot.
>
> In short, it'll probably be at least a year before the  
> StandardTokenizer can be modified to accept supplementary  
> characters, given the processes involved.
>
> Steve

Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Christian Wittern
Thanks to all for clearing this up.  It seems we are still quite far
away from full Unicode support:-(  

As to the questions about the encoding in previous messages, all of the
other characters in the documents come through without a glitch, so
there is definitely no other issue involved.

Christian

Erik Hatcher wrote:

> Wow - great stuff Steve!
>
> As for StandardTokenizer and Java version - no worries there really,
> as Solr itself requires Java 1.5+, so when such a tokenizer is made
> available it could be  used just fine in Solr even if it isn't built
> into a core Lucene release for a while.
>
>     Erik
>
>
>
> On Feb 28, 2008, at 12:08 PM, Steven A Rowe wrote:
>
>> On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
>>> And as Erik mentioned, it appears that line 114 of
>>> StandardTokenizerImpl.jflex:
>>>
>>> http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex 
>>>
>>>
>>> needs to be updated to include the Extension B character range.
>>
>> JFlex 1.4.1 (the latest release) does not support supplementary code
>> points (those above the BMP - Basic Multilingual Plane:
>> [U+0000-U+FFFF]), and CJK Ideograph Extension B is definitely a
>> supplementary range - see the first column from
>> <http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt> (the
>> extent of this range is unchanged through the latest [beta] version,
>> 5.1.0):
>>
>>     20000;<CJK Ideograph Extension B, First> ...
>>     2A6D6;<CJK Ideograph Extension B, Last> ...
>>
>> I am working with Gerwin Klein on the development version of JFlex,
>> and am hoping to get Level 1 [Regular Expression] Basic Unicode
>> Support into the next release (see
>> <http://unicode.org/reports/tr18/>) - among other things, this
>> entails accepting supplementary code points.
>>
>> However, the next release of JFlex will require Java 1.5+, and Lucene
>> 2.X requires Java 1.4, so until Lucene reaches release 3.0 and begins
>> requiring Java 1.5 (and Solr incorporates it), JFlex support of
>> supplementary code points is moot.
>>
>> In short, it'll probably be at least a year before the
>> StandardTokenizer can be modified to accept supplementary characters,
>> given the processes involved.
>>
>> Steve
>
>


--

 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

kkrugler
>Thanks to all for clearing this up.  It seems we are still quite far
>away from full Unicode support:-(
>As to the questions about the encoding in previous messages, all of
>the other characters in the documents come through without a glitch,
>so there is definitely no other issue involved.

What was the actual format of the Extension B characters in the XML
being posted?

-- Ken

>Erik Hatcher wrote:
>>Wow - great stuff Steve!
>>
>>As for StandardTokenizer and Java version - no worries there
>>really, as Solr itself requires Java 1.5+, so when such a tokenizer
>>is made available it could be  used just fine in Solr even if it
>>isn't built into a core Lucene release for a while.
>>
>>     Erik
>>
>>
>>
>>On Feb 28, 2008, at 12:08 PM, Steven A Rowe wrote:
>>
>>>On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
>>>>And as Erik mentioned, it appears that line 114 of
>>>>StandardTokenizerImpl.jflex:
>>>>
>>>>http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>>>>
>>>>needs to be updated to include the Extension B character range.
>>>
>>>JFlex 1.4.1 (the latest release) does not support supplementary
>>>code points (those above the BMP - Basic Multilingual Plane:
>>>[U+0000-U+FFFF]), and CJK Ideograph Extension B is definitely a
>>>supplementary range - see the first column from
>>><http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt>
>>>(the extent of this range is unchanged through the latest [beta]
>>>version, 5.1.0):
>>>
>>>     20000;<CJK Ideograph Extension B, First> ...
>>>     2A6D6;<CJK Ideograph Extension B, Last> ...
>>>
>>>I am working with Gerwin Klein on the development version of
>>>JFlex, and am hoping to get Level 1 [Regular Expression] Basic
>>>Unicode Support into the next release (see
>>><http://unicode.org/reports/tr18/>) - among other things, this
>>>entails accepting supplementary code points.
>>>
>>>However, the next release of JFlex will require Java 1.5+, and
>>>Lucene 2.X requires Java 1.4, so until Lucene reaches release 3.0
>>>and begins requiring Java 1.5 (and Solr incorporates it), JFlex
>>>support of supplementary code points is moot.
>>>
>>>In short, it'll probably be at least a year before the
>>>StandardTokenizer can be modified to accept supplementary
>>>characters, given the processes involved.
>>>
>>>Steve


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Erik Hatcher
In reply to this post by Christian Wittern

On Feb 28, 2008, at 6:56 PM, Christian Wittern wrote:
> Thanks to all for clearing this up.  It seems we are still quite  
> far away from full Unicode support:-(
> As to the questions about the encoding in previous messages, all of  
> the other characters in the documents come through without a  
> glitch, so there is definitely no other issue involved.

How are you POSTing the documents to Solr?   What content-type are  
you using with the HTTP header?  And what encoding are you using with  
the XML (file?) being POSTed, and is that encoding specified in the  
XML file itself?

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Christian Wittern
In reply to this post by kkrugler
Ken Krugler wrote:
>
> What was the actual format of the Extension B characters in the XML
> being posted?
>
I tried both a binary (UTF-8) format and numeric character
representation of the type &#x20000; -- the results where the same.

Christian


--

 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Reply | Threaded
Open this post in threaded view
|

Re: no support for CJK characters from Extension B in Solr

Christian Wittern
In reply to this post by Erik Hatcher
Erik Hatcher wrote:
> How are you POSTing the documents to Solr?   What content-type are you
> using with the HTTP header?  And what encoding are you using with the
> XML (file?) being POSTed, and is that encoding specified in the XML
> file itself?
For these tests I used the script post.sh from the example directory --
I am just assuming that this is doing The Right Thing:-)  The encoding
is (also?) specified in the XML file itself as UTF-8.

Christian

--

 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN