Unicode Tokenizer problem with Registered Trademark Search

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode Tokenizer problem with Registered Trademark Search

Bruce.Nawrocki

I am having a problem when searching for certain Unicode characters, such as the Registered Trademark. That's the Unicode character 00AE. It's also a problem searching for a Japanese Yen symbol (Unicode character 00A5).

I'm using the Lucene 2.0.0 jar file, and we used to use Lucene 1.4.2 jar file, where this used to work OK. But Lucene 2.0.0 doesn't work the same way.

I see that the registered trademark is in the Lucene index file, so that's good. The problem comes when I try to search for these characters.

I see that my query starts off OK, as this:

( (Locale:en) AND ( productName:(DigitalĀ„^95) ) )    (if you cannot see the Japanese Yen symbol, it comes directly after "Digital")

Note: the "^95" is just a boost factor, and is OK.

I'm using StandardAnalyzer and StandardTokenizer to create a new QueryParser , and after I call the "parse" method of the QueryParser, my query becomes this:

 +Locale:en +productName:digital^95.0

Notice that the Japanese Yen symbol is gone! I think it's because the StandardTokenizer.jj file doesn't handle this character, and so it throws it away.

Is there any way to use a different Analyzer and/or Tokenizer, rather than building my own?

And if I had created my Lucene indexes with the StandardAnalyzer, must I use the StandardAnalyzer and StandardTokenizer to search the index?

Thanks.
Reply | Threaded
Open this post in threaded view
|

RE: Unicode Tokenizer problem with Registered Trademark Search

steve_rowe
Hi Bruce,

On 04/02/2008 at 4:58 PM, [hidden email] wrote:
> I am having a problem when searching for certain Unicode
> characters, such as the Registered Trademark. That's the
> Unicode character 00AE. It's also a problem searching for a
> Japanese Yen symbol (Unicode character 00A5).
>
> I'm using the Lucene 2.0.0 jar file, and we used to use
> Lucene 1.4.2 jar file, where this used to work OK. But Lucene
> 2.0.0 doesn't work the same way.

I don't see anything that would have caused such a change - below is a colored side-by-side diff of StandardTokenizer.jj at revisions 150560 and 409716, corresponding to the lucene_1_4_2 and lucene_2_0_0 tags, respectively:

<http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_0_0/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=150560&r2=409716&diff_format=h>

(Note that the JavaCC-targetted StandardAnalyzer.jj was replaced at release 2.3.0 by JFlex-targetted StandardTokenizerImpl.jflex for performance reasons - see <http://issues.apache.org/jira/browse/LUCENE-966>.)

> I see that the registered trademark is in the Lucene index
> file, so that's good. The problem comes when I try to search
> for these characters.
>
> I see that my query starts off OK, as this:
>
> ( (Locale:en) AND ( productName:(DigitalĀ„^95) ) )    (if you
> cannot see the Japanese Yen symbol, it comes directly after "Digital")
>
> Note: the "^95" is just a boost factor, and is OK.
>
> I'm using StandardAnalyzer and StandardTokenizer to create a
> new QueryParser , and after I call the "parse" method of the
> QueryParser, my query becomes this:
>
>  +Locale:en +productName:digital^95.0
>
> Notice that the Japanese Yen symbol is gone! I think it's
> because the StandardTokenizer.jj file doesn't handle this
> character, and so it throws it away.
>
> Is there any way to use a different Analyzer and/or
> Tokenizer, rather than building my own?
>
> And if I had created my Lucene indexes with the
> StandardAnalyzer, must I use the StandardAnalyzer and
> StandardTokenizer to search the index?

In order for the Yen and Registered Trademark symbols to appear in the index, you must have used a different analyzer for indexing than the one you're using for querying.  This can lead to problems, as you have discovered.

The short answer is: you should use the same analyzer.

The longer answer is that you should use "compatible" analyzers.  "Compatibility" means that the terms produced by the query-time analyzer have corresponding index terms.  Of course, this condition is satisfied by using the same analyzer at both index- and query-time.  An example of compatibile, but different, analyzers is index- or query-time synonym injection.

I don't know why you weren't seeing this problem with Lucene 1.4.2, but is it possible that the 1.4.2-created index did *not* have these two symbols?  If that were true, then you would get the hits you're looking for, though you might get some others that you don't want.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]