Search with Accent and without accent Character

Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Search with Accent and without accent Character

Rushikesh K
Hello All,
I integrated Nutch with solr ,everything seems to be fine till now, i am
having a issue while searching some spanish accent characters,the search
results are not same,with accent (Example :investigación) gives correct
result  but without accent(example :investigacion) gives zero results.
I tried using  various filters but still the issue is same.Here is my
configuration on nutch and solr.


 <fieldType name="text_es" class="solr.TextField"
positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ICUFoldingFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
maxGramSize="50" side="front"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ICUFoldingFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>

    </analyzer>
  </fieldType>

I would really appreciate if  anyone of you can  tell me what i am missing?
--
Regards
Rushikesh M
.Net Developer
Reply | Threaded
Open this post in threaded view
|

Re: Search with Accent and without accent Character

kamaci
Hi Rushi,

This is a Solr specific question but let me answer it. You can click
Analysis tab at your Solr dashboard and check Index and Query analyses
whether they are same or not. You will get a analyzer by analyzer debug
output at that panel.

Kind Regards,
Furkan KAMACI

On Tue, Feb 13, 2018 at 8:40 PM, Rushi <[hidden email]> wrote:

> Hello All,
> I integrated Nutch with solr ,everything seems to be fine till now, i am
> having a issue while searching some spanish accent characters,the search
> results are not same,with accent (Example :investigación) gives correct
> result  but without accent(example :investigacion) gives zero results.
> I tried using  various filters but still the issue is same.Here is my
> configuration on nutch and solr.
>
>
>  <fieldType name="text_es" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> maxGramSize="50" side="front"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>     </analyzer>
>   </fieldType>
>
> I would really appreciate if  anyone of you can  tell me what i am missing?
> --
> Regards
> Rushikesh M
> .Net Developer
>
Reply | Threaded
Open this post in threaded view
|

Re: Search with Accent and without accent Character

BlackIce
Hi,

As stated it's a solr question... But I give you a hint (I don't have
access to the server right now)... Stemming is different for Spanish as for
English... If I remember correctly I had to use the hunspell tokenizer set
for Spanish.... Or something similar to that..

Sorry I can't be more precise.... But you've got now a better starting
point from what I had :)


Greetz

RRK


On Feb 13, 2018 7:45 PM, "Furkan KAMACI" <[hidden email]> wrote:

Hi Rushi,

This is a Solr specific question but let me answer it. You can click
Analysis tab at your Solr dashboard and check Index and Query analyses
whether they are same or not. You will get a analyzer by analyzer debug
output at that panel.

Kind Regards,
Furkan KAMACI

On Tue, Feb 13, 2018 at 8:40 PM, Rushi <[hidden email]> wrote:

> Hello All,
> I integrated Nutch with solr ,everything seems to be fine till now, i am
> having a issue while searching some spanish accent characters,the search
> results are not same,with accent (Example :investigación) gives correct
> result  but without accent(example :investigacion) gives zero results.
> I tried using  various filters but still the issue is same.Here is my
> configuration on nutch and solr.
>
>
>  <fieldType name="text_es" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> maxGramSize="50" side="front"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>     </analyzer>
>   </fieldType>
>
> I would really appreciate if  anyone of you can  tell me what i am
missing?
> --
> Regards
> Rushikesh M
> .Net Developer
>
Reply | Threaded
Open this post in threaded view
|

Re: Search with Accent and without accent Character

BlackIce
Also in order for Spanish accents to be propperly stemmed... Something had
to be set to ISO Latin .... And a propper file had to be supplied to
solr....

I'm on a tablet and can't access the server to look....

On Feb 13, 2018 10:03 PM, "BlackIce" <[hidden email]> wrote:

Hi,

As stated it's a solr question... But I give you a hint (I don't have
access to the server right now)... Stemming is different for Spanish as for
English... If I remember correctly I had to use the hunspell tokenizer set
for Spanish.... Or something similar to that..

Sorry I can't be more precise.... But you've got now a better starting
point from what I had :)


Greetz

RRK


On Feb 13, 2018 7:45 PM, "Furkan KAMACI" <[hidden email]> wrote:

Hi Rushi,

This is a Solr specific question but let me answer it. You can click
Analysis tab at your Solr dashboard and check Index and Query analyses
whether they are same or not. You will get a analyzer by analyzer debug
output at that panel.

Kind Regards,
Furkan KAMACI

On Tue, Feb 13, 2018 at 8:40 PM, Rushi <[hidden email]> wrote:

> Hello All,
> I integrated Nutch with solr ,everything seems to be fine till now, i am
> having a issue while searching some spanish accent characters,the search
> results are not same,with accent (Example :investigación) gives correct
> result  but without accent(example :investigacion) gives zero results.
> I tried using  various filters but still the issue is same.Here is my
> configuration on nutch and solr.
>
>
>  <fieldType name="text_es" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> maxGramSize="50" side="front"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>     </analyzer>
>   </fieldType>
>
> I would really appreciate if  anyone of you can  tell me what i am
missing?
> --
> Regards
> Rushikesh M
> .Net Developer
>
Reply | Threaded
Open this post in threaded view
|

RE: Search with Accent and without accent Character

Markus Jelsma-2
In reply to this post by Rushikesh K
Hi,

My guess is you haven't reindexed after changing filter configuration, which is required for index-time filters.

Regarding your fieldType, you can drop the lowercase and ASCII folding filters and just keep the ICU folder, it will work for pretty much any character set. It will normalize case, Scandinavian digraphs (AE), probably Dutch digraphs (IJ) as well. But also deal with German oe ü, ringel s and all regular Latin accents including Spanish tilde ~, circumflex etc.

If a there is a language specific normalizer/folder, use that instead of ICU because there can be differences in how accents should be normalized across languages.

And do not forget to reindex and use the same normalizers index- and query-time.

Regards,
Markus

 
 
-----Original message-----

> From:Rushi <[hidden email]>
> Sent: Tuesday 13th February 2018 19:40
> To: [hidden email]
> Subject: Search with Accent and without accent Character
>
> Hello All,
> I integrated Nutch with solr ,everything seems to be fine till now, i am
> having a issue while searching some spanish accent characters,the search
> results are not same,with accent (Example :investigación) gives correct
> result  but without accent(example :investigacion) gives zero results.
> I tried using  various filters but still the issue is same.Here is my
> configuration on nutch and solr.
>
>
>  <fieldType name="text_es" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> maxGramSize="50" side="front"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>     </analyzer>
>   </fieldType>
>
> I would really appreciate if  anyone of you can  tell me what i am missing?
> --
> Regards
> Rushikesh M
> .Net Developer
>
Reply | Threaded
Open this post in threaded view
|

RE: Search with Accent and without accent Character

Markus Jelsma-2
In reply to this post by Rushikesh K
Checked and confirmed, even Dutch digraph IJ is folded properly, as well as the upper case dotless Turkish i and the Spanish example you provided is folded properly.

Correction for German (before Nagel corrects me), ö and ü are not normalized by ICU folder according to German rules. Their accents are stripped instead of transforming them into oe and ue respectively. It makes the case of language specific folders, especially when dealing with Scandinavian or German. Dutch and Latin can be folded just by removing their accents.

Correct me when im wrong!
Markus
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Tuesday 13th February 2018 22:21
> To: [hidden email]
> Subject: RE: Search with Accent and without accent Character
>
> Hi,
>
> My guess is you haven't reindexed after changing filter configuration, which is required for index-time filters.
>
> Regarding your fieldType, you can drop the lowercase and ASCII folding filters and just keep the ICU folder, it will work for pretty much any character set. It will normalize case, Scandinavian digraphs (AE), probably Dutch digraphs (IJ) as well. But also deal with German oe ü, ringel s and all regular Latin accents including Spanish tilde ~, circumflex etc.
>
> If a there is a language specific normalizer/folder, use that instead of ICU because there can be differences in how accents should be normalized across languages.
>
> And do not forget to reindex and use the same normalizers index- and query-time.
>
> Regards,
> Markus
>
>  
>  
> -----Original message-----
> > From:Rushi <[hidden email]>
> > Sent: Tuesday 13th February 2018 19:40
> > To: [hidden email]
> > Subject: Search with Accent and without accent Character
> >
> > Hello All,
> > I integrated Nutch with solr ,everything seems to be fine till now, i am
> > having a issue while searching some spanish accent characters,the search
> > results are not same,with accent (Example :investigación) gives correct
> > result  but without accent(example :investigacion) gives zero results.
> > I tried using  various filters but still the issue is same.Here is my
> > configuration on nutch and solr.
> >
> >
> >  <fieldType name="text_es" class="solr.TextField"
> > positionIncrementGap="100">
> >     <analyzer type="index">
> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >         <filter class="solr.ICUFoldingFilterFactory" />
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.ASCIIFoldingFilterFactory"/>
> >         <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> > maxGramSize="50" side="front"/>
> >     </analyzer>
> >     <analyzer type="query">
> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >         <filter class="solr.ICUFoldingFilterFactory" />
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.ASCIIFoldingFilterFactory"/>
> >
> >     </analyzer>
> >   </fieldType>
> >
> > I would really appreciate if  anyone of you can  tell me what i am missing?
> > --
> > Regards
> > Rushikesh M
> > .Net Developer
> >
>