strange issues with IRISH

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

strange issues with IRISH

Ostap Bender
Hi All,

 

I've came across very strange issue with Irish language.

I have the following set of strings in Irish:

 

ag an gcrosbhealach seo,

Lean ar an mуrbhealach.,

Lean an bуthar seo.,

An bhfuil ... in am imeacht?,

An ... sin an t-am ceart?

 

And here is a search string: an

 

Search returns nothing instead of all of those phrases. I'm using simple
analyzer but suspect that [an] is still ignored as a stop word for some
reason.

I've tried custom analyzer with the following code:

 

TokenStream ts = new WhitespaceTokenizer(reader);

ts = new LowerCaseFilter(ts);

return ts;

 

with no luck.

 

Any ideas?

 

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: strange issues with IRISH

John Byrne-3
Hi,

"suspect that [an] is still ignored as a stop word for some reason"

Yes, "an" is still a stop word in English of course! (eg. 'an apple')

Your custom analyzer should work; are you making sure to do both your
indexing *and* your searching with the new analyzer?

I think making a list of Irish stop words could be tricky, since "an"
sometimes means "the", but sometimes forms part of a verb (eg. "an
bhfuil...?")

The safest bet is probably not to bother removing stop words. These days
it doesn't really affect performance much,storage space is generally not
much of an issue, and it makes phrase searching more accurate if you
keep them.

-John

> Hi All,
>
>  
>
> I've came across very strange issue with Irish language.
>
> I have the following set of strings in Irish:
>
>  
>
> ag an gcrosbhealach seo,
>
> Lean ar an mуrbhealach.,
>
> Lean an bуthar seo.,
>
> An bhfuil ... in am imeacht?,
>
> An ... sin an t-am ceart?
>
>  
>
> And here is a search string: an
>
>  
>
> Search returns nothing instead of all of those phrases. I'm using simple
> analyzer but suspect that [an] is still ignored as a stop word for some
> reason.
>
> I've tried custom analyzer with the following code:
>
>  
>
> TokenStream ts = new WhitespaceTokenizer(reader);
>
> ts = new LowerCaseFilter(ts);
>
> return ts;
>
>  
>
> with no luck.
>
>  
>
> Any ideas?
>
>  
>
> Thanks.
>
>
>  
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.387 / Virus Database: 270.13.12/2233 - Release Date: 07/12/09 08:20:00
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]