Using wildcard with accented words

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Using wildcard with accented words

Kamran Shadkhast
I have problem searching accented words with wild card. although I have configured schema using <filter class="solr.ISOLatin1AccentFilterFactory"/> both in index and query part.
it is working for q=chrétien and find documents with "chretien" but searching for q=chré* does not work,  but q=chre* works fine.
is this a bug or I am doing something wrong?
Reply | Threaded
Open this post in threaded view
|

Re: Using wildcard with accented words

Erik Hatcher

On Oct 22, 2007, at 3:45 PM, kshadkhast wrote:
> I have problem searching accented words with wild card. although I  
> have
> configured schema using <filter  
> class="solr.ISOLatin1AccentFilterFactory"/>
> both in index and query part.
> it is working for q=chrétien and find documents with "chretien" but
> searching for q=chré* does not work,  but q=chre* works fine.
> is this a bug or I am doing something wrong?

It's a bit tricky here.... Lucene's QueryParser, the heart of Solr's  
query parsing, does not analyze wildcard query parts.  Consider  
stemmed words, for example, on why that is a problem.   In this case  
it does make sense to run it through a filter that normalizes  
diacritics on characters, but unfortunately Solr doesn't support what  
you need at this point.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Using wildcard with accented words

Erik Hatcher

On Oct 22, 2007, at 4:06 PM, Erik Hatcher wrote:

> On Oct 22, 2007, at 3:45 PM, kshadkhast wrote:
>> I have problem searching accented words with wild card. although I  
>> have
>> configured schema using <filter  
>> class="solr.ISOLatin1AccentFilterFactory"/>
>> both in index and query part.
>> it is working for q=chrétien and find documents with "chretien" but
>> searching for q=chré* does not work,  but q=chre* works fine.
>> is this a bug or I am doing something wrong?
>
> It's a bit tricky here.... Lucene's QueryParser, the heart of  
> Solr's query parsing, does not analyze wildcard query parts.  
> Consider stemmed words, for example, on why that is a problem.   In  
> this case it does make sense to run it through a filter that  
> normalizes diacritics on characters, but unfortunately Solr doesn't  
> support what you need at this point.

Further on this, QueryParser does have some settings specific to  
wildcard queries, such as lowercasing the prefix part.

Perhaps this is a case that Solr could address with a third analyzer  
configuration (it already has "query", and "index" differentiation)  
that could be incorporated for wildcard queries.   Thoughts on that?

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Using wildcard with accented words

Yonik Seeley-2
On 10/22/07, Erik Hatcher <[hidden email]> wrote:
> Perhaps this is a case that Solr could address with a third analyzer
> configuration (it already has "query", and "index" differentiation)
> that could be incorporated for wildcard queries.   Thoughts on that?

I've actually thought about it previously.... it would be nice for it
to all work automatically for the user.  Seems like the implementation
should be based on the TokenFilter level, then things like synonym
filters, stemmers, etc, would do nothing.

Perhaps add some new methods to BaseTokenFilterFactory to do prefix,
wildcard, etc, transformations?

Another gotcha is handling multiple tokens.
What happens if someone queries for myfield:foo-bar*
with a letter tokenizer or a word-delimiter filter?  It's not a simple
prefix query at all!

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Using wildcard with accented words

Guillaume Smet
On 10/23/07, Yonik Seeley <[hidden email]> wrote:
> I've actually thought about it previously.... it would be nice for it
> to all work automatically for the user.  Seems like the implementation
> should be based on the TokenFilter level, then things like synonym
> filters, stemmers, etc, would do nothing.

I concur that it could be really useful. Currently, we have to
implement the ISOLatin1AccentFilterFactory filter in the client part
of our applications but it could be great to be able to have this part
in Solr directly and it would be more consistent with non-wildcard
queries behaviour.

Regards,

--
Guillaume