Accent insensitive search for greek characters

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Accent insensitive search for greek characters

Chitra R
Hi,

   I want to search greek characters(with accent insensitive) by removing
or replacing accent marks with similar characters.

Eg: when searching a greek accent word say *πῬοἲὅν*, we expect accent
insensitive search ie need equivalent greek accent like *προιον*



Moreover, I am not having more knowledge on Greek characters. so only I am
looking for standard rules to perform greek accent insensitive search.


Does *ICUFoldingFilter* solve my case? I have tried this already. Its
working fine for greek accent characters. But this is not language
specific... It has internalization support for all languages. Here, I am
not sure whether it will break my existing language behavior in the index.


Is there any way to make ICUFoldingFilter as language specific?



--
Regards,
Chitra
Reply | Threaded
Open this post in threaded view
|

Re: Accent insensitive search for greek characters

Shawn Heisey-2
On 10/13/2017 1:28 AM, Chitra wrote:

>    I want to search greek characters(with accent insensitive) by removing
> or replacing accent marks with similar characters.
>
> Eg: when searching a greek accent word say *πῬοἲὅν*, we expect accent
> insensitive search ie need equivalent greek accent like *προιον*
>
> Moreover, I am not having more knowledge on Greek characters. so only I am
> looking for standard rules to perform greek accent insensitive search.
>
> Does *ICUFoldingFilter* solve my case? I have tried this already. Its
> working fine for greek accent characters. But this is not language
> specific... It has internalization support for all languages. Here, I am
> not sure whether it will break my existing language behavior in the index.
>
> Is there any way to make ICUFoldingFilter as language specific?

The entire point of the ICU filters is that they are functional across
all of Unicode -- all languages.  As far as I am aware, there is no way
to adjust what ICUFoldingFilter does.  According to the code, it
offloads all work to IBM's ICU library and does not offer any
configurability.

The following filters also exist, with less functionality than the ICU
filter:

https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ASCIIFoldingFilter
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-LowerCaseFilter

Those filters operate on single characters from the input, which means
they cannot take character context into account like ICU does.  If I am
reading what the ASCII filter does correctly, it may not work for Greek
characters at all -- it says that it folds to the lower range of ASCII,
and that character set doesn't have Greek letters.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Accent insensitive search for greek characters

Chitra R
Hi Shawan,
                     Thank you so much for the kind response.


> Those filters operate on single characters from the input, which means
> they cannot take character context into account like ICU does.  If I am
> reading what the ASCII filter does correctly, it may not work for Greek
> characters at all -- it says that it folds to the lower range of ASCII,
> and that character set doesn't have Greek letters


 yes, as of now we are using ASCIIFolding filter and LowerCaseFilter to
remove diacritics and case folding but in some cases, it doesn't work for
greek accent characters.

So only, I am looking for better solution.


--
Regards,
Chitra
Reply | Threaded
Open this post in threaded view
|

Re: Accent insensitive search for greek characters

Alexandre Rafalovitch
In reply to this post by Chitra R
There is also ICUTransform which is insanely powerful and can be configured.

I did something for Thai test at
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml

Regards,
    Alex

On Oct 13, 2017 3:28 AM, "Chitra" <[hidden email]> wrote:

> Hi,
>
>    I want to search greek characters(with accent insensitive) by removing
> or replacing accent marks with similar characters.
>
> Eg: when searching a greek accent word say *πῬοἲὅν*, we expect accent
> insensitive search ie need equivalent greek accent like *προιον*
>
>
>
> Moreover, I am not having more knowledge on Greek characters. so only I am
> looking for standard rules to perform greek accent insensitive search.
>
>
> Does *ICUFoldingFilter* solve my case? I have tried this already. Its
> working fine for greek accent characters. But this is not language
> specific... It has internalization support for all languages. Here, I am
> not sure whether it will break my existing language behavior in the index.
>
>
> Is there any way to make ICUFoldingFilter as language specific?
>
>
>
> --
> Regards,
> Chitra
>