Arabic words search in solr

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Arabic words search in solr

mohanmca01
Hi,

In solr search I want to search with product name using Arabic letters.
While searching, Arabic user can feel little default to search some product
name. Because some characters need to mention while searching.

Ex: إ أ آ


In the above mentioned characters, user can get combination of shift key.
Usually if Arabic people will mention “ ا “  character and will get the
below combined words.

Ex: إبرا


In my solr schema.xml I defined product arabic name field as below


<field name="productNameArabic" type="text_ar" indexed="true"
stored="true"/>


  <fieldType name="text_ar" class="solr.TextField"
positionIncrementGap="100">

      <analyzer>

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_ar.txt" />

        <filter class="solr.ArabicNormalizationFilterFactory"/>

        <filter class="solr.ArabicStemFilterFactory"/>

      </analyzer>

    </fieldType>



What changes I have do in schame.xml. Please help me on this.



 --
Regards,
Mohan.N
096896429683
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
Hi Mohan,

The analyzer in your text_ar field type looks like an expanded version of the one suggested in the Solr Reference Guide[1].

Can you give an example of a query and the indexed text you expect to match but doesn't?

ArabicNormalizationFilterFactory, which uses Lucene’s ArabicNormalizer[2] should convert alefs with hamza to plain alef, among several other normalizations.

The Light 10 stemming algorithm implemented by ArabicNormalizer and ArabicStemmer[3] is described here: <http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf>.

[1] Solr Ref Guide: Language Analysis: Arabic <https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Arabic>
[2] ArabicNormalizer javadocs <https://lucene.apache.org/core/6_4_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizer.html>
[3] ArabicStemmer javadocs <https://lucene.apache.org/core/6_4_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicStemmer.html>

--
Steve
www.lucidworks.com

> On Jan 29, 2017, at 2:12 PM, mohan sundaram <[hidden email]> wrote:
>
> Hi,
>
> In solr search I want to search with product name using Arabic letters.
> While searching, Arabic user can feel little default to search some product
> name. Because some characters need to mention while searching.
>
> Ex: إ أ آ
>
>
> In the above mentioned characters, user can get combination of shift key.
> Usually if Arabic people will mention “ ا “  character and will get the
> below combined words.
>
> Ex: إبرا
>
>
> In my solr schema.xml I defined product arabic name field as below
>
>
> <field name="productNameArabic" type="text_ar" indexed="true"
> stored="true"/>
>
>
>  <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>
>      <analyzer>
>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>
>        <filter class="solr.ArabicNormalizationFilterFactory"/>
>
>        <filter class="solr.ArabicStemFilterFactory"/>
>
>      </analyzer>
>
>    </fieldType>
>
>
>
> What changes I have do in schame.xml. Please help me on this.
>
>
>
> --
> Regards,
> Mohan.N
> 096896429683

Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Steve,

Thanks for sharing the information.

 I went through the solr references document which you shared in the link. Your shared references document pointing to solr version 6.4.0.
The implemented Solr version in my project is 4.9.0.

As I mentioned earlier In my solr schema.xml I defined product Arabic name field as below:

/*----------------------------------------------*/
<field name="productNameArabic" type="text_ar" indexed="true" stored="true"/> 
 
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
                <analyzer>
                                <tokenizer class="solr.StandardTokenizerFactory"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
    </analyzer>
</fieldType>
/*----------------------------------------------*/


I am indexing the Arabic content using “text_ar” field type.

 
Characters
ا
أ
إ
آ
Shift key Considers for the above
Table 1

These are the example of characters where I’m facing the searching difficulty.
 
Example Indexed words
ابرا
أبرا
إبرا
آبرا
Table 2

These an example of indexed words in Solr.
 
Searching word
ابرا
Table 3

Now my problem is, By searching for the above word(table 3) I should get all indexed words in table 2 in the output.
 
Is Solr version 4.9.0 compatible with Arabic search or do I need to upgrade to higher version?

Kindly, do let me know if I need to give an example of all characters since I gave only for one character which is hamza with alef.

Thanks,
Mohan
Reply | Threaded
Open this post in threaded view
|

Fwd: Arabic words search in solr

mohanmca01
In reply to this post by mohanmca01
Hi,

 I went through the solr references document which you shared in the link.
Your shared references document pointing to solr version 6.4.0.

The implemented Solr version in my project is 4.9.0.


As I mentioned earlier In my solr schema.xml I defined product Arabic name
field as below:

/*----------------------------------------------*/

<field name="productNameArabic" type="text_ar" indexed="true"
stored="true"/>



<fieldType name="text_ar" class="solr.TextField"
positionIncrementGap="100">

                <analyzer>

                                <tokenizer
class="solr.StandardTokenizerFactory"/>


                                <filter class="solr.LowerCaseFilterFactory"/>


                                <filter class="solr.StopFilterFactory"
ignoreCase="true" words="lang/stopwords_ar.txt" />

        <filter class="solr.ArabicNormalizationFilterFactory"/>

        <filter class="solr.ArabicStemFilterFactory"/>

    </analyzer>

</fieldType>

/*----------------------------------------------*/



I am indexing the Arabic content using “text_ar” field type.




*Characters*

*ا*

*أ*

*إ*

*آ*

Shift key Considers for the above

Table 1


These are the example of characters where I’m facing the searching
difficulty.




*Example Indexed words*

*ابرا*

*أبرا*

*إبرا*

*آبرا*

Table 2

These an example of indexed words in Solr.



*Searching word*

*ابرا*

Table 3


Now my problem is, By searching for the above word(table 3) I should get
all indexed words in table 2 in the output.



Is Solr version 4.9.0 compatible with Arabic search or do I need to upgrade
to higher version?


Kindly, do let me know if I need to give an example of all characters since
I gave only for one character which is hamza with alef.


Thanks,

Mohan




On Mon, Jan 30, 2017 at 9:21 PM, Steve Rowe <[hidden email]> wrote:

> Hi Mohan,
>
> I answered your question on the solr-user list.  Did you see my response?
>
> I CC’d you on this email, but you should know that Apache mailing lists
> won’t automatically send you email unless you have subscribed to the list.
> For more information, see <http://lucene.apache.org/solr
> /community.html#mailing-lists-irc>.
>
> --
> Steve
> www.lucidworks.com
>
> > On Jan 29, 2017, at 2:16 PM, mohan sundaram <[hidden email]>
> wrote:
> >
> > Hi,
> >
> > In solr search I want to search with product name using Arabic letters.
> > While searching, Arabic user can feel little default to search some
> product
> > name. Because some characters need to mention while searching.
> >
> > Ex: إ أ آ
> >
> >
> > In the above mentioned characters, user can get combination of shift key.
> > Usually if Arabic people will mention “ ا “  character and will get the
> > below combined words.
> >
> > Ex: إبرا
> >
> >
> > In my solr schema.xml I defined product arabic name field as below
> >
> >
> > <field name="productNameArabic" type="text_ar" indexed="true"
> > stored="true"/>
> >
> >
> >  <fieldType name="text_ar" class="solr.TextField"
> > positionIncrementGap="100">
> >
> >      <analyzer>
> >
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="lang/stopwords_ar.txt" />
> >
> >        <filter class="solr.ArabicNormalizationFilterFactory"/>
> >
> >        <filter class="solr.ArabicStemFilterFactory"/>
> >
> >      </analyzer>
> >
> >    </fieldType>
> >
> >
> >
> > What changes I have do in schame.xml. Please help me on this.
> >
> >
> >
> > --
> > Regards,
> > Mohan.N
> > 096896429683
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

Erick Erickson
If you look in the upper-lerf corner of any reference guide page
you'll see a link to previous versions of the docs and can download
whatever version you are working with back to 4.7 IIRC. I'd download
that and see if there's similar functionality.

On Mon, Jan 30, 2017 at 10:19 PM, mohan sundaram <[hidden email]> wrote:

> Hi,
>
>  I went through the solr references document which you shared in the link.
> Your shared references document pointing to solr version 6.4.0.
>
> The implemented Solr version in my project is 4.9.0.
>
>
> As I mentioned earlier In my solr schema.xml I defined product Arabic name
> field as below:
>
> /*----------------------------------------------*/
>
> <field name="productNameArabic" type="text_ar" indexed="true"
> stored="true"/>
>
>
>
> <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>
>                 <analyzer>
>
>                                 <tokenizer
> class="solr.StandardTokenizerFactory"/>
>
>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>
>
>                                 <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="lang/stopwords_ar.txt" />
>
>         <filter class="solr.ArabicNormalizationFilterFactory"/>
>
>         <filter class="solr.ArabicStemFilterFactory"/>
>
>     </analyzer>
>
> </fieldType>
>
> /*----------------------------------------------*/
>
>
>
> I am indexing the Arabic content using “text_ar” field type.
>
>
>
>
> *Characters*
>
> *ا*
>
> *أ*
>
> *إ*
>
> *آ*
>
> Shift key Considers for the above
>
> Table 1
>
>
> These are the example of characters where I’m facing the searching
> difficulty.
>
>
>
>
> *Example Indexed words*
>
> *ابرا*
>
> *أبرا*
>
> *إبرا*
>
> *آبرا*
>
> Table 2
>
> These an example of indexed words in Solr.
>
>
>
> *Searching word*
>
> *ابرا*
>
> Table 3
>
>
> Now my problem is, By searching for the above word(table 3) I should get
> all indexed words in table 2 in the output.
>
>
>
> Is Solr version 4.9.0 compatible with Arabic search or do I need to upgrade
> to higher version?
>
>
> Kindly, do let me know if I need to give an example of all characters since
> I gave only for one character which is hamza with alef.
>
>
> Thanks,
>
> Mohan
>
>
>
>
> On Mon, Jan 30, 2017 at 9:21 PM, Steve Rowe <[hidden email]> wrote:
>
>> Hi Mohan,
>>
>> I answered your question on the solr-user list.  Did you see my response?
>>
>> I CC’d you on this email, but you should know that Apache mailing lists
>> won’t automatically send you email unless you have subscribed to the list.
>> For more information, see <http://lucene.apache.org/solr
>> /community.html#mailing-lists-irc>.
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>> > On Jan 29, 2017, at 2:16 PM, mohan sundaram <[hidden email]>
>> wrote:
>> >
>> > Hi,
>> >
>> > In solr search I want to search with product name using Arabic letters.
>> > While searching, Arabic user can feel little default to search some
>> product
>> > name. Because some characters need to mention while searching.
>> >
>> > Ex: إ أ آ
>> >
>> >
>> > In the above mentioned characters, user can get combination of shift key.
>> > Usually if Arabic people will mention “ ا “  character and will get the
>> > below combined words.
>> >
>> > Ex: إبرا
>> >
>> >
>> > In my solr schema.xml I defined product arabic name field as below
>> >
>> >
>> > <field name="productNameArabic" type="text_ar" indexed="true"
>> > stored="true"/>
>> >
>> >
>> >  <fieldType name="text_ar" class="solr.TextField"
>> > positionIncrementGap="100">
>> >
>> >      <analyzer>
>> >
>> >        <tokenizer class="solr.StandardTokenizerFactory"/>
>> >
>> >        <filter class="solr.LowerCaseFilterFactory"/>
>> >
>> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="lang/stopwords_ar.txt" />
>> >
>> >        <filter class="solr.ArabicNormalizationFilterFactory"/>
>> >
>> >        <filter class="solr.ArabicStemFilterFactory"/>
>> >
>> >      </analyzer>
>> >
>> >    </fieldType>
>> >
>> >
>> >
>> > What changes I have do in schame.xml. Please help me on this.
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Mohan.N
>> > 096896429683
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
In reply to this post by mohanmca01
Mohan,

I downloaded and started Solr 4.9.0 and entered your example indexed and queried words into the Admin UI’s Analysis pane using the text_ar field type.  You can see the results here: <http://sarowe.net:8080/solr-4.9.0.admin.ui.text_ar.analysis.png>.

Each of the indexed words and the query word are analyzed to the same string.  They should match and return docs containing them as hits for the query word.

So, what is exactly the problem you are having?  What specifically doesn’t work?

FYI, in general you should be using the most recent release of Solr (6.4.0 right now) unless there are reasons why you can't.  It’s the most stable/performant/supported version.

--
Steve
www.lucidworks.com

> On Jan 31, 2017, at 1:19 AM, mohan sundaram <[hidden email]> wrote:
>
> Hi,
>
> I went through the solr references document which you shared in the link.
> Your shared references document pointing to solr version 6.4.0.
>
> The implemented Solr version in my project is 4.9.0.
>
>
> As I mentioned earlier In my solr schema.xml I defined product Arabic name
> field as below:
>
> /*----------------------------------------------*/
>
> <field name="productNameArabic" type="text_ar" indexed="true"
> stored="true"/>
>
>
>
> <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>
>                <analyzer>
>
>                                <tokenizer
> class="solr.StandardTokenizerFactory"/>
>
>
>                                <filter class="solr.LowerCaseFilterFactory"/>
>
>
>                                <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="lang/stopwords_ar.txt" />
>
>        <filter class="solr.ArabicNormalizationFilterFactory"/>
>
>        <filter class="solr.ArabicStemFilterFactory"/>
>
>    </analyzer>
>
> </fieldType>
>
> /*----------------------------------------------*/
>
>
>
> I am indexing the Arabic content using “text_ar” field type.
>
>
>
>
> *Characters*
>
> *ا*
>
> *أ*
>
> *إ*
>
> *آ*
>
> Shift key Considers for the above
>
> Table 1
>
>
> These are the example of characters where I’m facing the searching
> difficulty.
>
>
>
>
> *Example Indexed words*
>
> *ابرا*
>
> *أبرا*
>
> *إبرا*
>
> *آبرا*
>
> Table 2
>
> These an example of indexed words in Solr.
>
>
>
> *Searching word*
>
> *ابرا*
>
> Table 3
>
>
> Now my problem is, By searching for the above word(table 3) I should get
> all indexed words in table 2 in the output.
>
>
>
> Is Solr version 4.9.0 compatible with Arabic search or do I need to upgrade
> to higher version?
>
>
> Kindly, do let me know if I need to give an example of all characters since
> I gave only for one character which is hamza with alef.
>
>
> Thanks,
>
> Mohan
>
>
>
>
> On Mon, Jan 30, 2017 at 9:21 PM, Steve Rowe <[hidden email]> wrote:
>
>> Hi Mohan,
>>
>> I answered your question on the solr-user list.  Did you see my response?
>>
>> I CC’d you on this email, but you should know that Apache mailing lists
>> won’t automatically send you email unless you have subscribed to the list.
>> For more information, see <http://lucene.apache.org/solr
>> /community.html#mailing-lists-irc>.
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>>> On Jan 29, 2017, at 2:16 PM, mohan sundaram <[hidden email]>
>> wrote:
>>>
>>> Hi,
>>>
>>> In solr search I want to search with product name using Arabic letters.
>>> While searching, Arabic user can feel little default to search some
>> product
>>> name. Because some characters need to mention while searching.
>>>
>>> Ex: إ أ آ
>>>
>>>
>>> In the above mentioned characters, user can get combination of shift key.
>>> Usually if Arabic people will mention “ ا “  character and will get the
>>> below combined words.
>>>
>>> Ex: إبرا
>>>
>>>
>>> In my solr schema.xml I defined product arabic name field as below
>>>
>>>
>>> <field name="productNameArabic" type="text_ar" indexed="true"
>>> stored="true"/>
>>>
>>>
>>> <fieldType name="text_ar" class="solr.TextField"
>>> positionIncrementGap="100">
>>>
>>>     <analyzer>
>>>
>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="lang/stopwords_ar.txt" />
>>>
>>>       <filter class="solr.ArabicNormalizationFilterFactory"/>
>>>
>>>       <filter class="solr.ArabicStemFilterFactory"/>
>>>
>>>     </analyzer>
>>>
>>>   </fieldType>
>>>
>>>
>>>
>>> What changes I have do in schame.xml. Please help me on this.
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Mohan.N
>>> 096896429683
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Dear Steve, Thanks for investigating our problem. Our project is basically business directory search platform, and we have more than 100+ K business details information. I’m providing you some examples of Arabic words to reproduce the problem. please find attached word file where i explained everything along with screenshots.arabicSearch.docx regarding upgrading to the latest version, our project is running on Java 1.7V, and if i need to upgrade then we have to upgrade Java, Application Server JBoos, and etc. which is not that right time to do this activity at all..!!
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
Hi Mohan,

I ran your Case #1 through Solr 4.9.0’s Admin UI Analysis pane and I can see the analyzer for the field type “text_ar" analyzer does not remove all diacritics:

Indexed original: المؤسسة التجارية العمانية
Indexed analyzed: مؤسس تجار عمان

Query original: الموسسة التجارية
Query analyzed: موسس تجار

The analyzed query terms are the same as the first two analyzed indexed terms, with one exception: the hamza on the waw in the analyzed indexed term “مؤسس” was not stripped off by the analyzer, and so won’t match the analyzed query term “موسس”, which was entered by the user without the hamza.

Adding ICUFoldingFilterFactory to the “text_ar” field type fixed case #1 for me by stripping the hamza from the waw.  You can read more about this filter in the Solr Reference Guide (yes, this is basically for Solr 6.4, but I don’t think this functionality has changed between 4.9 and 6.4): <https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ICUFoldingFilter>.  If you do this, you can remove the LowerCaseFilterFactory since ICUFoldingFilterFactory performs lowercasing as part of its work.

Note that to use ICUFoldingFilterFactory you must add three jars to the lib/ directory in your solr home dir.  Here’s how I did it:

$ mkdir example/solr/lib
$ cp dist/solr-analysis-extras-4.9.0.jar example/solr/lib/
$ cp contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.9.0.jar example/solr/lib/
$ cp contrib/analysis-extras/lib/icu4j-53.1.jar example/solr/lib/

--
Steve
www.lucidworks.com

> On Feb 1, 2017, at 6:50 AM, mohanmca01 <[hidden email]> wrote:
>
> Dear Steve,Thanks for investigating our problem. Our project is basically
> business directory search platform, and we have more than 100+ K business
> details information. I’m providing you some examples of Arabic words to
> reproduce the problem. please find attached word file where i explained
> everything along with screenshots. arabicSearch.docx
> <http://lucene.472066.n3.nabble.com/file/n4318227/arabicSearch.docx>
> regarding upgrading to the latest version, our project is running on Java
> 1.7V, and if i need to upgrade then we have to upgrade Java, Application
> Server JBoos, and etc. which is not that right time to do this activity at
> all..!!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4318227.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Steve,

Thanks for your continues investigation on this issue.

I added ICU Folding Filter in schema.xml file and re-indexed all the data again. i noticed some improvements in search but its not really as expected.

below is the configuration changed in schema file:

-----------------
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
       
         <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
       
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
      </analyzer>
    </fieldType>
-----------------

attached the document for your reference where highlighted ones in red are not working as expected.

Also, i have raised one point regarding Jquery autocomplete with unique records..kindly let me know if you have any background on how to implement the same.

arabicSearch.docx


 
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
Hi Mohan,

I haven’t looked at the latest problems, but the ICU folding filter should be the last filter, to allow the Arabic normalization and stemming filters to see the original words.

--
Steve
www.lucidworks.com

> On Feb 8, 2017, at 10:58 PM, mohanmca01 <[hidden email]> wrote:
>
> Hi Steve,
>
> Thanks for your continues investigation on this issue.
>
> I added ICU Folding Filter in schema.xml file and re-indexed all the data
> again. i noticed some improvements in search but its not really as expected.
>
> below is the configuration changed in schema file:
>
> -----------------
> <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>
>        <filter class="solr.ArabicNormalizationFilterFactory"/>
>        <filter class="solr.ArabicStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
> -----------------
>
> attached the document for your reference where highlighted ones in red are
> not working as expected.
>
> Also, i have raised one point regarding Jquery autocomplete with unique
> records..kindly let me know if you have any background on how to implement
> the same.
>
> arabicSearch.docx
> <http://lucene.472066.n3.nabble.com/file/n4319436/arabicSearch.docx>  
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4319436.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Steve,

any update on this .???.. I am waiting for your inputs..
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
Hi Mohan,

Did you change the order of the filters as I suggested?

--
Steve
eww.lucidworks.com

On Tue, Feb 14, 2017 at 8:05 AM mohanmca01 <[hidden email]> wrote:

> Hi Steve,
>
> any update on this .???.. I am waiting for your inputs..
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4320253.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Steve,

As per your suggestion,I added ICUFoldingFilterFactory in schema.xml as below:

<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
      </analyzer>
    </fieldType>

I attached expecting result document in previous mail thread for your references.

Kindly check and let me know.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
Hi Mohan,

When I said "the ICU folding filter should be the last filter, to allow the Arabic normalization and stemming filters to see the original words”, I meant that no filter should follow it.  

You did not make that change.

Here’s what I mean:

   <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_ar.txt" />
       <filter class="solr.ArabicNormalizationFilterFactory"/>
       <filter class="solr.ArabicStemFilterFactory"/>
       <filter class="solr.ICUFoldingFilterFactory"/>
     </analyzer>
   </fieldType>

--
Steve
www.lucidworks.com

> On Feb 15, 2017, at 12:23 AM, mohanmca01 <[hidden email]> wrote:
>
> Hi Steve,
>
> As per your suggestion,I added ICUFoldingFilterFactory in schema.xml as
> below:
>
> <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>        <filter class="solr.ArabicNormalizationFilterFactory"/>
>        <filter class="solr.ArabicStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> I attached expecting result document in previous mail thread for your
> references.
>
> Kindly check and let me know.
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4320427.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Steve,

I changed ICU folding filter order and re-index entire Arabic content. But still problem is present. I am not able to get the expected result.

I attached screen shot for your references.




Kindly check and let me know.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
Hi Mohan,

It looks to me like the example query should match, since the analyzed query terms look like a subset of the analyzed document terms.

Did you re-index your docuemnts after you changed your schema?  If not, then the indexed documents won’t have the same terms as the ones you see on the Admin UI Analysis pane.

If you have re-indexed, and are still not getting matches you expect, please include textual examples of the remaining problems, so that I can copy/paste to reproduce the problem - I can’t copy/paste Arabic from images you pointed to.

--
Steve
www.lucidworks.com

> On Feb 21, 2017, at 1:28 AM, mohanmca01 <[hidden email]> wrote:
>
> Hi Steve,
>
> I changed ICU folding filter order and re-index entire Arabic content. But
> still problem is present. I am not able to get the expected result.
>
> I attached screen shot for your references.
> <http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin.png>
> <http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin%281%29.png>
> <http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin%282%29.png>
>
> Kindly check and let me know.
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4321397.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Stave,

As per your suggestion I added ICU folding filter and I re-indexed entire solr data, but still am unable to find the expected results which i highlighted earlier.

attached excel sheet with examples of Arabic names for your investigation & reproducing the issue.
Arabic_Characters2.xlsx

thanks
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

sarowe
Hi Mohan,

I indexed your 9 examples as simple documents after mapping dynamic field “*_ar” to the “text_ar” field type:

-----
[{"id":"1", "name_ar":"المؤسسة التجارية العمانية"},
{"id":"2", "name_ar":"شركة التأمين الأهلية ش.م.ع.م"},
{"id":"3", "name_ar":"شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء"},
{"id":"4", "name_ar":"شركة ظفار للتأمين ش.م.ع.ع"},
{"id":"5", "name_ar":"طوارئ المستشفيات   - طوارئ مستشفى صحار"},
{"id":"6", "name_ar":"شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي"},
{"id":"7", "name_ar":"المؤسسة التجارية العمانية"},
{"id":"8", "name_ar":"وزارة الصحة - المديرية العامة للخدمات الصحية  محافظة الداخلية -  - مستشفى إزكي (البدالة)  - الطوارئ"},
{"id":"9", "name_ar":"أسعار المكالمات الدولية - مونتسرات -  - مونتسرات”}]
-----

Then when I search from the Admin UI for “name_ar:شرطة ازكي” (the query in one of your screenshots with numFound=0) I get the following results:

-----
{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "indent": "true",
      "q": "name_ar:شرطة ازكي",
      "_": "1487912340325",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "id": "6",
        "name_ar": [
          "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي"
        ],
        "_version_": 1560170434794619000
      },
      {
        "id": "3",
        "name_ar": [
          "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء"
        ],
        "_version_": 1560170434793570300
      }
    ]
  }
}
-----

So I cannot reproduce the failures you’re seeing.  In fact, I tried all 9 of the queries you listed as not working, and all of them matched at least one of the above 9 documents, except for case 5 (which I give details for below).  Are you absolutely sure that you reindexed your data with the ICUFF last?

The one query that didn’t return any matches for me is “name_ar:طوارى صحار”.  Here’s why:

Indexed original: طوارئ صحار
Indexed analyzed: طواري صحار

Query original: طوارى صحار
Query analyzed: طوار صحار

In the analyzed indexed form, the “ئ” (yeh with hamza above) is left intact by ArabicNormalizationFilter and ArabicStemFilter, and then the ICUFoldingFilter converts it to “ي” (yeh without the hamza).

In the analyzed query, ArabicNormalizationFilter converts “طوارى” to “طواري” (alef maksura->yeh), which ArabicStemFilter converts to “طوار” by removing the trailing yeh.

I don’t know what the correct thing to do is to make alef maksura and yeh match each other, but one possibility is adding a char filter that converts all alefs maksura into yehs with hamza, like this:

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ى" replacement="ئ”/>

When I added the above to my “text_ar" field type and re-indexed, I got the following when I queried for “name_ar:طوارى صحار”:

-----
{
  "responseHeader": {
    "status": 0,
    "QTime": 2,
    "params": {
      "indent": "true",
      "q": "name_ar:طوارى صحار",
      "_": "1487915432177",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "id": "5",
        "name_ar": [
          "طوارئ المستشفيات   - طوارئ مستشفى صحار"
        ],
        "_version_": 1560192353894924300
      },
      {
        "id": "8",
        "name_ar": [
          "وزارة الصحة - المديرية العامة للخدمات الصحية  محافظة الداخلية -  - مستشفى إزكي (البدالة)  - الطوارئ"
        ],
        "_version_": 1560192353895972900
      }
    ]
  }
}
-----

--
Steve
www.lucidworks.com
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Stave,

Thank for your continues investigation..

This has improved the search little bit, but am facing another issue where am getting a record doesn't have a specific word in my query.

Plz note that you have indexed only 9 records where i have shared you more than 76 sample records (please refer to the earlier attachment Arabic_Characters2.xlsx in Examples sheet) to index so you can reproduce the issue.

i.e. i searched with (bizNameAr: شرطة ازكي), and am getting:

{
  "responseHeader": {
    "status": 0,
    "QTime": 3,
    "params": {
      "indent": "true",
      "q": "bizNameAr: شرطة ازكي",
      "_": "1488089550104",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 4,
    "start": 0,
    "docs": [
      {
        "id": "82",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي",
        "_version_": 1560298301338681300
      },
      {
        "id": "63",
        "bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي",
        "_version_": 1560298301325049900
      },
      {
        "id": "56",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية  -  - مركز شرطة إبراء",
        "_version_": 1560298301319807000
      },
      {
        "id": "79",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء",
        "_version_": 1560298301335535600
      }
    ]
  }
}



the expected result is:   "id": "82",
                                  "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي",

as the above has both the words mentioned in the query (marked as Bold), where the rest have the following:

        "id": "63",
        "bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي"

it has only one word of the query (ازكي)

        "id": "56",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية  -  - مركز شرطة إبراء"

it has only one word of the query (شرطة)

"id": "79",
"bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء"

It has only one word of the query (شرطة)

where the above 3 records should not come in the result since already 2 words mentioned in the query, and only one record has these two words.


I would really suggest if we can give you a real-time demo on our system with my Arab colleague so it can be more clear for you. let us know if we can do that.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Arabic words search in solr

mohanmca01
Hi Stave,

Any update on this.....
123