Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

classic Classic list List threaded Threaded
5 messages Options
SR
Reply | Threaded
Open this post in threaded view
|

Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

SR
Hi,

I'm using Solr 1.4 and I need to use a Latin Accent Filter. In the Solr wiki (http://wiki.apache.org/solr/SchemaDesign), it's recommended to use MappingCharFilterFactory instead of ISOLatin1AccentFilterFactory.

Could someone tell me the reason of choosing the first filter instead of the second one?

In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory must be used with MappingCharFilterFactory. But, when I use these tokenizer and filter together, I get a sever error saying that the filed type containing these filter and tokenizer is unknown. However, when I use this filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!

I saw on the Web that this problem has been faced, but I didn't see any solution. Does someone have any idea to fix this issue?

Thanks,
-Saïd
Reply | Threaded
Open this post in threaded view
|

Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Koji Sekiguchi

> In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory must be used with MappingCharFilterFactory. But, when I use these tokenizer and filter together, I get a sever error saying that the filed type containing these filter and tokenizer is unknown. However, when I use this filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!
>
>    
The wiki is not correct today. Before Lucene 2.9 (and Solr 1.4),
Tokenizers can take Reader argument in constructor. But after that,
because they can take CharStream argument in constructor,
*CharStreamAware* Tokenizers are no longer needed (all Tokenizers
are aware of CharStream). I'll update the wiki.

Koji

--
http://www.rondhuit.com/en/

SR
Reply | Threaded
Open this post in threaded view
|

Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

SR
Thanks Koji for the reply and for updating wiki. As it's written now in wiki, it sounds (at least to me) like MappingCharFilterFactory works only with WhitespaceTokenizerFactory.

Did you really mean that? Because this filter  works also with other tkenizers. For instance, in my text type, I'm using StandardTokenizerFactory for document processing, and  WhitespaceTokenizerFactory for query processing.

I also noticed that, in whatever order you put this filter in the definition of a field type, it's always applied (during text processing) before the tokenizer and all the other filters. Is there a reason for that? Is there a possibility to force the filter to be applied at a certain order among the other filters?

Thanks,
-S

On Jul 5, 2010, at 4:28 PM, Koji Sekiguchi wrote:

>
>> In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory must be used with MappingCharFilterFactory. But, when I use these tokenizer and filter together, I get a sever error saying that the filed type containing these filter and tokenizer is unknown. However, when I use this filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!
>>
>>  
> The wiki is not correct today. Before Lucene 2.9 (and Solr 1.4),
> Tokenizers can take Reader argument in constructor. But after that,
> because they can take CharStream argument in constructor,
> *CharStreamAware* Tokenizers are no longer needed (all Tokenizers
> are aware of CharStream). I'll update the wiki.
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>

Reply | Threaded
Open this post in threaded view
|

Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Koji Sekiguchi
No, all tokenizer can be used with mappingcharfilter

Koji Sekiguchi from mobile


On 2010/07/06, at 0:32, Saïd Radhouani <[hidden email]> wrote:

> Thanks Koji for the reply and for updating wiki. As it's written now in wiki, it sounds (at least to me) like MappingCharFilterFactory works only with WhitespaceTokenizerFactory.
>
> Did you really mean that? Because this filter  works also with other tkenizers. For instance, in my text type, I'm using StandardTokenizerFactory for document processing, and  WhitespaceTokenizerFactory for query processing.
>
> I also noticed that, in whatever order you put this filter in the definition of a field type, it's always applied (during text processing) before the tokenizer and all the other filters. Is there a reason for that? Is there a possibility to force the filter to be applied at a certain order among the other filters?
>
> Thanks,
> -S
>
> On Jul 5, 2010, at 4:28 PM, Koji Sekiguchi wrote:
>
>>
>>> In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory must be used with MappingCharFilterFactory. But, when I use these tokenizer and filter together, I get a sever error saying that the filed type containing these filter and tokenizer is unknown. However, when I use this filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!
>>>
>>>
>> The wiki is not correct today. Before Lucene 2.9 (and Solr 1.4),
>> Tokenizers can take Reader argument in constructor. But after that,
>> because they can take CharStream argument in constructor,
>> *CharStreamAware* Tokenizers are no longer needed (all Tokenizers
>> are aware of CharStream). I'll update the wiki.
>>
>> Koji
>>
>> --
>> http://www.rondhuit.com/en/
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Jan Høydahl / Cominvent
In reply to this post by SR
The Char-filters MUST come before the Tokenizer, due to their nature of processing the character-stream and not the tokens.

If you need to apply the accent normalizatino later in the analysis chain, either use ISOLatin1AccentFilterFactory or help with the implementation of SOLR-1978.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 5. juli 2010, at 17.32, Saïd Radhouani wrote:

> Thanks Koji for the reply and for updating wiki. As it's written now in wiki, it sounds (at least to me) like MappingCharFilterFactory works only with WhitespaceTokenizerFactory.
>
> Did you really mean that? Because this filter  works also with other tkenizers. For instance, in my text type, I'm using StandardTokenizerFactory for document processing, and  WhitespaceTokenizerFactory for query processing.
>
> I also noticed that, in whatever order you put this filter in the definition of a field type, it's always applied (during text processing) before the tokenizer and all the other filters. Is there a reason for that? Is there a possibility to force the filter to be applied at a certain order among the other filters?
>
> Thanks,
> -S
>
> On Jul 5, 2010, at 4:28 PM, Koji Sekiguchi wrote:
>
>>
>>> In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory must be used with MappingCharFilterFactory. But, when I use these tokenizer and filter together, I get a sever error saying that the filed type containing these filter and tokenizer is unknown. However, when I use this filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!
>>>
>>>
>> The wiki is not correct today. Before Lucene 2.9 (and Solr 1.4),
>> Tokenizers can take Reader argument in constructor. But after that,
>> because they can take CharStream argument in constructor,
>> *CharStreamAware* Tokenizers are no longer needed (all Tokenizers
>> are aware of CharStream). I'll update the wiki.
>>
>> Koji
>>
>> --
>> http://www.rondhuit.com/en/
>>
>