Hindi, diacritics and search results

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Hindi, diacritics and search results

Ostap Bender
Hi All,

 

I'm using the default setup of lucene (no custom analyzers configured) and
came across the following issue:

In Hindi if there is a letter with a diacritic in a phrase lucene will find
the phrase with this letter even if the search string is for the letter
without a diacritics.

Is this an expected behavior? Maybe this is standard for all languages with
letters that have diacritics?

 

From pure byte standpoint I can see the logic, the letter with diacritics
takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
so if I search for *some_letter* where some letter has code (E0 A4 95)
lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.

 

Any comments much appreciated.

 

Thanks.

 

Reply | Threaded
Open this post in threaded view
|

Re: Hindi, diacritics and search results

Robert Muir
Which analyzer in particular are you using?

Its probably not doing what you want for hindi. These "diacritics" are
important (vowels, etc).


On Fri, Jul 10, 2009 at 3:10 PM, OBender<[hidden email]> wrote:

> Hi All,
>
>
>
> I'm using the default setup of lucene (no custom analyzers configured) and
> came across the following issue:
>
> In Hindi if there is a letter with a diacritic in a phrase lucene will find
> the phrase with this letter even if the search string is for the letter
> without a diacritics.
>
> Is this an expected behavior? Maybe this is standard for all languages with
> letters that have diacritics?
>
>
>
> From pure byte standpoint I can see the logic, the letter with diacritics
> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
> so if I search for *some_letter* where some letter has code (E0 A4 95)
> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
>
>
>
> Any comments much appreciated.
>
>
>
> Thanks.
>
>
>
>



--
Robert Muir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Hindi, diacritics and search results

Ostap Bender
I'm using default analyzer. Actually one that is set by default by Compass framework but I assume it is the same that would be used in Lucene by default.
Which one should I use?

-----Original Message-----
From: Robert Muir [mailto:[hidden email]]
Sent: Friday, July 10, 2009 6:13 PM
To: [hidden email]
Subject: Re: Hindi, diacritics and search results

Which analyzer in particular are you using?

Its probably not doing what you want for hindi. These "diacritics" are
important (vowels, etc).


On Fri, Jul 10, 2009 at 3:10 PM, OBender<[hidden email]> wrote:

> Hi All,
>
>
>
> I'm using the default setup of lucene (no custom analyzers configured) and
> came across the following issue:
>
> In Hindi if there is a letter with a diacritic in a phrase lucene will find
> the phrase with this letter even if the search string is for the letter
> without a diacritics.
>
> Is this an expected behavior? Maybe this is standard for all languages with
> letters that have diacritics?
>
>
>
> From pure byte standpoint I can see the logic, the letter with diacritics
> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
> so if I search for *some_letter* where some letter has code (E0 A4 95)
> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
>
>
>
> Any comments much appreciated.
>
>
>
> Thanks.
>
>
>
>



--
Robert Muir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Checked by AVG - www.avg.com
Version: 8.5.375 / Virus Database: 270.13.0/2209 - Release Date: 07/10/09 17:57:00


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hindi, diacritics and search results

Robert Muir
In reply to this post by Robert Muir
there is really no default in lucene

a good start for hindi would be to try WhitespaceAnalyzer.

On Fri, Jul 10, 2009 at 9:13 PM, OBender Hotmail<[hidden email]> wrote:

> I'm using default analyzer. Actually one that is set by default by Compass framework but I assume it is the same that would be used in Lucene by default.
> Which one should I use?
>
> -----Original Message-----
> From: Robert Muir [mailto:[hidden email]]
> Sent: Friday, July 10, 2009 6:13 PM
> To: [hidden email]
> Subject: Re: Hindi, diacritics and search results
>
> Which analyzer in particular are you using?
>
> Its probably not doing what you want for hindi. These "diacritics" are
> important (vowels, etc).
>
>
> On Fri, Jul 10, 2009 at 3:10 PM, OBender<[hidden email]> wrote:
>> Hi All,
>>
>>
>>
>> I'm using the default setup of lucene (no custom analyzers configured) and
>> came across the following issue:
>>
>> In Hindi if there is a letter with a diacritic in a phrase lucene will find
>> the phrase with this letter even if the search string is for the letter
>> without a diacritics.
>>
>> Is this an expected behavior? Maybe this is standard for all languages with
>> letters that have diacritics?
>>
>>
>>
>> From pure byte standpoint I can see the logic, the letter with diacritics
>> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
>> so if I search for *some_letter* where some letter has code (E0 A4 95)
>> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
>>
>>
>>
>> Any comments much appreciated.
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>
>
>
> --
> Robert Muir
> [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> Checked by AVG - www.avg.com
> Version: 8.5.375 / Virus Database: 270.13.0/2209 - Release Date: 07/10/09 17:57:00
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Robert Muir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hindi, diacritics and search results

KK-4
Apart from using WhiteSpaceAnalyzer which will tokenize words based on
spaces, you can try writing a simple custom analyzer which'll a bit more. I
did the following for handling Indic languages intermingled with English
content,

/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzerIndex extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream ts = new WhitespaceTokenizer(reader);
        /**
        * @param ts, token stream
        * @param generateWordParts If 1, causes parts of words to be
generated: "PowerShot" => "Power" "Shot"
        * @param generateNumberParts If 1, causes number subwords to be
generated: "500-42" => "500" "42"
        * @param catenateWords  1, causes maximum runs of word parts to be
catenated: "wi-fi" => "wifi"
        * @param catenateNumbers If 1, causes maximum runs of number parts
to be catenated: "500-42" => "50042"
        * @param catenateAll If 1, causes all subword parts to be catenated:
"wi-fi-4000" => "wifi4000"
        */
        ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
        ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
        ts = new LowerCaseFilter(ts);
        ts = new PorterStemFilter(ts);
        return ts;
    }
}

The above is for indexing, for querying you can just use the following
values for the worddelimiterfilter constructor, keeping the rest of the
things same,
ts = new WordDelimiterFilter(ts, 1, 1, 0, 0, 0);

I pulled the "worddelimterfilter" class from Solr nightly build, as nothing
as such is available in Lucene, AFAIK.

In my case its working perfectly fine for all indian languages mixed with
english content. As you can see for english it applies the usual process of
stemming/stop-word-removal etc. Try it out and do let us know if you face
any issues.

Thanks,
KK.

On Sat, Jul 11, 2009 at 8:05 AM, Robert Muir <[hidden email]> wrote:

> there is really no default in lucene
>
> a good start for hindi would be to try WhitespaceAnalyzer.
>
> On Fri, Jul 10, 2009 at 9:13 PM, OBender Hotmail<[hidden email]>
> wrote:
> > I'm using default analyzer. Actually one that is set by default by
> Compass framework but I assume it is the same that would be used in Lucene
> by default.
> > Which one should I use?
> >
> > -----Original Message-----
> > From: Robert Muir [mailto:[hidden email]]
> > Sent: Friday, July 10, 2009 6:13 PM
> > To: [hidden email]
> > Subject: Re: Hindi, diacritics and search results
> >
> > Which analyzer in particular are you using?
> >
> > Its probably not doing what you want for hindi. These "diacritics" are
> > important (vowels, etc).
> >
> >
> > On Fri, Jul 10, 2009 at 3:10 PM, OBender<[hidden email]> wrote:
> >> Hi All,
> >>
> >>
> >>
> >> I'm using the default setup of lucene (no custom analyzers configured)
> and
> >> came across the following issue:
> >>
> >> In Hindi if there is a letter with a diacritic in a phrase lucene will
> find
> >> the phrase with this letter even if the search string is for the letter
> >> without a diacritics.
> >>
> >> Is this an expected behavior? Maybe this is standard for all languages
> with
> >> letters that have diacritics?
> >>
> >>
> >>
> >> From pure byte standpoint I can see the logic, the letter with
> diacritics
> >> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4
> 95)
> >> so if I search for *some_letter* where some letter has code (E0 A4 95)
> >> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
> >>
> >>
> >>
> >> Any comments much appreciated.
> >>
> >>
> >>
> >> Thanks.
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > Robert Muir
> > [hidden email]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> > Checked by AVG - www.avg.com
> > Version: 8.5.375 / Virus Database: 270.13.0/2209 - Release Date: 07/10/09
> 17:57:00
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
>
> --
> Robert Muir
> [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>