foreign characters equivalent in solr search

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

foreign characters equivalent in solr search

radarghost
we are using solr 1.2 and dont want to upgrade to 1.3 till official release for Debian.
i want solr to search for equivalent of a foreign chracter for getting better results

in example:

if a user searches for Tiesto which is indexed in this format Tiësto in our solr. we want solr also return result
return search result for á, à, â, ä, ã, å where they are in word but that word has been searched with normal a
e for ë, i for ï, o for ö, and so on

any solution?

hope i could tell what i need with my poor English

thanks

Reply | Threaded
Open this post in threaded view
|

Re: foreign characters equivalent in solr search

iorixxx
I think best way to do this is to modify org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter index time.

if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you will replace it with its equvalent ascii character (a). Then you will inject this new Token as a Synonym.

I don't know is it the best way but it will give you what you want.

--- On Wed, 2/18/09, radarghost <[hidden email]> wrote:

> From: radarghost <[hidden email]>
> Subject: foreign characters equivalent in solr search
> To: [hidden email]
> Date: Wednesday, February 18, 2009, 4:28 PM
> we are using solr 1.2 and dont want to upgrade to 1.3 till
> official release
> for Debian.
> i want solr to search for equivalent of a foreign chracter
> for getting
> better results
>
> in example:
>
> if a user searches for Tiesto which is indexed in this
> format Tiësto in our
> solr. we want solr also return result
> return search result for á, à, â, ä, ã, å where they
> are in word but that
> word has been searched with normal a
> e for ë, i for ï, o for ö, and so on
>
> any solution?
>
> hope i could tell what i need with my poor English
>
> thanks
>
>
> --
> View this message in context:
> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: foreign characters equivalent in solr search

Koji Sekiguchi
CharFilter will solve the problem, but it comes with Solr 1.4.

https://issues.apache.org/jira/browse/SOLR-822

Koji

AHMET ARSLAN wrote:

> I think best way to do this is to modify org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter index time.
>
> if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you will replace it with its equvalent ascii character (a). Then you will inject this new Token as a Synonym.
>
> I don't know is it the best way but it will give you what you want.
>
> --- On Wed, 2/18/09, radarghost <[hidden email]> wrote:
>
>  
>> From: radarghost <[hidden email]>
>> Subject: foreign characters equivalent in solr search
>> To: [hidden email]
>> Date: Wednesday, February 18, 2009, 4:28 PM
>> we are using solr 1.2 and dont want to upgrade to 1.3 till
>> official release
>> for Debian.
>> i want solr to search for equivalent of a foreign chracter
>> for getting
>> better results
>>
>> in example:
>>
>> if a user searches for Tiesto which is indexed in this
>> format Tiësto in our
>> solr. we want solr also return result
>> return search result for á, à, â, ä, ã, å where they
>> are in word but that
>> word has been searched with normal a
>> e for ë, i for ï, o for ö, and so on
>>
>> any solution?
>>
>> hope i could tell what i need with my poor English
>>
>> thanks
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
>> Sent from the Solr - User mailing list archive at
>> Nabble.com.
>>    
>
>
>      
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: foreign characters equivalent in solr search

radarghost
In reply to this post by iorixxx
thanks

we will try that and post the results here but it seems we may get problem with highlight function.


Ahmet Arslan wrote
I think best way to do this is to modify org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter index time.

if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you will replace it with its equvalent ascii character (a). Then you will inject this new Token as a Synonym.

I don't know is it the best way but it will give you what you want.

--- On Wed, 2/18/09, radarghost <radarghost@yahoo.com> wrote:

> From: radarghost <radarghost@yahoo.com>
> Subject: foreign characters equivalent in solr search
> To: solr-user@lucene.apache.org
> Date: Wednesday, February 18, 2009, 4:28 PM
> we are using solr 1.2 and dont want to upgrade to 1.3 till
> official release
> for Debian.
> i want solr to search for equivalent of a foreign chracter
> for getting
> better results
>
> in example:
>
> if a user searches for Tiesto which is indexed in this
> format Tiësto in our
> solr. we want solr also return result
> return search result for á, à, â, ä, ã, å where they
> are in word but that
> word has been searched with normal a
> e for ë, i for ï, o for ö, and so on
>
> any solution?
>
> hope i could tell what i need with my poor English
>
> thanks
>
>
> --
> View this message in context:
> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: foreign characters equivalent in solr search

radarghost
In reply to this post by Koji Sekiguchi
it may takes too long for Solr 1.4

any other solution for Solr 1.2?

anyway thanks for the reply.

Koji Sekiguchi-2 wrote
CharFilter will solve the problem, but it comes with Solr 1.4.

https://issues.apache.org/jira/browse/SOLR-822

Koji

AHMET ARSLAN wrote:
> I think best way to do this is to modify org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter index time.
>
> if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you will replace it with its equvalent ascii character (a). Then you will inject this new Token as a Synonym.
>
> I don't know is it the best way but it will give you what you want.
>
> --- On Wed, 2/18/09, radarghost <radarghost@yahoo.com> wrote:
>
>  
>> From: radarghost <radarghost@yahoo.com>
>> Subject: foreign characters equivalent in solr search
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, February 18, 2009, 4:28 PM
>> we are using solr 1.2 and dont want to upgrade to 1.3 till
>> official release
>> for Debian.
>> i want solr to search for equivalent of a foreign chracter
>> for getting
>> better results
>>
>> in example:
>>
>> if a user searches for Tiesto which is indexed in this
>> format Tiësto in our
>> solr. we want solr also return result
>> return search result for á, à, â, ä, ã, å where they
>> are in word but that
>> word has been searched with normal a
>> e for ë, i for ï, o for ö, and so on
>>
>> any solution?
>>
>> hope i could tell what i need with my poor English
>>
>> thanks
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
>> Sent from the Solr - User mailing list archive at
>> Nabble.com.
>>    
>
>
>      
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: foreign characters equivalent in solr search

iorixxx
In reply to this post by radarghost
> we will try that and post the results here but it seems we
> may get problem with highlight function.

No highlighting works fine with that. I am also using similar filter for turkish chars. I replace ç with c, ş with s and so on at index time.

Another (easier but less efficient ) way to implement this filter is to extend org.apache.lucene.index.memory.SynonymMap and override public String[] getSynonyms(String word) method. In this case your getSynonyms method will return either new String[0] or new String[1]. Constructor will invoke super(null); without problems.

After that you can use your custom SynonymMap in your Lucene's SynonymTokenFilter constructor. (without modifying SynonymTokenFilter)

stream = new SynonymTokenFilter(stream, new MySynonymMap(), Integer.MAX_VALUE);

Because SynonymTokenFilter invokes only getSynonyms method of SynonymMap.



Reply | Threaded
Open this post in threaded view
|

Re: foreign characters equivalent in solr search

hossman
In reply to this post by radarghost
: if a user searches for Tiesto which is indexed in this format Tiësto in our
: solr. we want solr also return result

This is what the ISOLatin1AccentFilter is for.  It's been included in Solr
since 1.1.

It's been deprecated in favor of the newer ASCIIFoldingFilter which does
a better job with other charsets, but all of you examples seem to be
Latin1 chars so i'm guessing it will probably work pretty well in your
cases.



-Hoss