Chinese language search in SOLR 3.6.1

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Chinese language search in SOLR 3.6.1

Poornima Jay
Hi,

Did any one face a problem for chinese language in SOLR 3.6.1. Below is the analyzer in the schema.xml file.

<fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
          <tokenizer class="solr.CJKTokenizerFactory"/>
           <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
           <filter class="solr.ChineseFilterFactory" />
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.CJKTokenizerFactory"/>
          <filter class="solr.ChineseFilterFactory" />
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
 </fieldType>

It works fine with the chinese strings but not working with product code or ISBN even though the fields are defined as string.

Please let me know how should the chinese schema be configured.

Thanks.
Poornima
Reply | Threaded
Open this post in threaded view
|

Re: Chinese language search in SOLR 3.6.1

Rajinimaski
Hi Poornima,

  Your statement :   "It works fine with the chinese strings but not
working with product code or ISBN even though the fields are defined as
string" is confusing.

Did you mean that the product code and ISBN fields are of type text_Chinese?

Is it first or second:
<field name="product_code"* type="string" *indexed="true" stored="false"/>
or
<field name="product_code" type="text_chinese" indexed="true"
stored="false"/>


What do you refer to when you tell that it's not working? Unable to search?

















On Tue, Oct 22, 2013 at 6:09 PM, Poornima Jay <[hidden email]>wrote:

> Hi,
>
> Did any one face a problem for chinese language in SOLR 3.6.1. Below is
> the analyzer in the schema.xml file.
>
> <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>           <tokenizer class="solr.CJKTokenizerFactory"/>
>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>            <filter class="solr.ChineseFilterFactory" />
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>           <filter class="solr.ChineseFilterFactory" />
>           <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>  </fieldType>
>
> It works fine with the chinese strings but not working with product code
> or ISBN even though the fields are defined as string.
>
> Please let me know how should the chinese schema be configured.
>
> Thanks.
> Poornima
>
Reply | Threaded
Open this post in threaded view
|

Re: Chinese language search in SOLR 3.6.1

Poornima Jay
Hi Rajani,

Below is the configured in my schema.
<fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ChineseTokenizerFactory"/>        
        <filter class="solr.StopFilterFactory"  ignoreCase="true"  words="stopwords.txt"   enablePositionIncrements="true" />
        <filter class="solr.ChineseFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ChineseTokenizerFactory"/>
        <filter class="solr.ChineseFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
    </fieldType>

<field name="product_code" type="string" indexed="true" stored="false" multiValued="true" />

<field name="author_name" type="text_chinese" indexed="true" stored="false" multiValued="true"/>

<field name="author_name_string" type="string" indexed="true" stored="false" multiValued="true" />

<field name="simple" type="text_chinese" indexed="true" stored="false" multiValued="true" />

<copyField source="product_code" dest="simple" />

<copyField source="author_name" dest="author_name_string" />


if I search with the query q=simple:总评价 it works but doesn't work if I search with q=simple:676767667. If the field is defined as string the chinese character works but doesn't work if it is defined as text_chinese.

Regards,
Poornima





On Tuesday, 22 October 2013 7:52 PM, Rajani Maski <[hidden email]> wrote:
 
Hi Poornima,

  Your statement :   "It works fine with the chinese strings but not working with product code or ISBN even though the fields are defined as string" is confusing. 

Did you mean that the product code and ISBN fields are of type text_Chinese?

Is it first or second:
<field name="product_code"type="string" indexed="true" stored="false"/>

or 

<field name="product_code" type="text_chinese" indexed="true" stored="false"/>


What do you refer to when you tell that it's not working? Unable to search?


















On Tue, Oct 22, 2013 at 6:09 PM, Poornima Jay <[hidden email]> wrote:

Hi,

>
>Did any one face a problem for chinese language in SOLR 3.6.1. Below is the analyzer in the schema.xml file.
>
><fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>          <tokenizer class="solr.CJKTokenizerFactory"/>
>           <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
>           <filter class="solr.ChineseFilterFactory" />
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.CJKTokenizerFactory"/>
>          <filter class="solr.ChineseFilterFactory" />
>          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
> </fieldType>
>
>It works fine with the chinese strings but not working with product code or ISBN even though the fields are defined as string.
>
>Please let me know how should the chinese schema be configured.
>
>Thanks.
>Poornima
>
Reply | Threaded
Open this post in threaded view
|

Re: Chinese language search in SOLR 3.6.1

Rajinimaski
String field will work for any case when you do exact key search.
text_chinese also should work if you are simply searching with exact
string"676767667".

Well, the best way to find an answer to this query is by using solr
analysis tool : http://localhost:8983/solr/#/collection1/analysis
Enter your field type and index time input that you had given with query
value that you are searching for.

You should be able to find your answers.





On Tue, Oct 22, 2013 at 8:06 PM, Poornima Jay <[hidden email]>wrote:

> Hi Rajani,
>
> Below is the configured in my schema.
> <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.ChineseTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"  ignoreCase="true"
>  words="stopwords.txt"   enablePositionIncrements="true" />
>         <filter class="solr.ChineseFilterFactory" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.ChineseTokenizerFactory"/>
>         <filter class="solr.ChineseFilterFactory" />
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> <field name="product_code" type="string" indexed="true" stored="false"
> multiValued="true" />
> <field name="author_name" type="text_chinese" indexed="true"
> stored="false" multiValued="true"/>
> <field name="author_name_string" type="string" indexed="true"
> stored="false" multiValued="true" />
> <field name="simple" type="text_chinese" indexed="true" stored="false"
> multiValued="true" />
> <copyField source="product_code" dest="simple" />
> <copyField source="author_name" dest="author_name_string" />
>
> if I search with the query q=simple:总评价 it works but doesn't work if I
> search with q=simple:676767667. If the field is defined as string the
> chinese character works but doesn't work if it is defined as text_chinese.
>
> Regards,
> Poornima
>
>
>
>
>   On Tuesday, 22 October 2013 7:52 PM, Rajani Maski <[hidden email]>
> wrote:
>  Hi Poornima,
>
>   Your statement :   "It works fine with the chinese strings but not
> working with product code or ISBN even though the fields are defined as
> string" is confusing.
>
> Did you mean that the product code and ISBN fields are of type
> text_Chinese?
>
> Is it first or second:
> <field name="product_code"* type="string" *indexed="true" stored="false"/>
> or
> <field name="product_code" type="text_chinese" indexed="true"
> stored="false"/>
>
>
> What do you refer to when you tell that it's not working? Unable to search?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Oct 22, 2013 at 6:09 PM, Poornima Jay <[hidden email]>wrote:
>
> Hi,
>
> Did any one face a problem for chinese language in SOLR 3.6.1. Below is
> the analyzer in the schema.xml file.
>
> <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>           <tokenizer class="solr.CJKTokenizerFactory"/>
>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>            <filter class="solr.ChineseFilterFactory" />
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>           <filter class="solr.ChineseFilterFactory" />
>           <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>  </fieldType>
>
> It works fine with the chinese strings but not working with product code
> or ISBN even though the fields are defined as string.
>
> Please let me know how should the chinese schema be configured.
>
> Thanks.
> Poornima
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Chinese language search in SOLR 3.6.1

Poornima Jay
Hi Rajani,

The string field type is not analyzed. But that is not the case for text_chinese field type for which is  ChineseTokenizerFactory and ChineseFilterFactory is added for index and query analysis. Below check the schema and the fields how it is defined in my above mail.

Thanks,
Poornima



On Wednesday, 23 October 2013 7:21 AM, Rajani Maski <[hidden email]> wrote:
 
String field will work for any case when you do exact key search.
text_chinese also should work if you are simply searching with exact
string"676767667".

Well, the best way to find an answer to this query is by using solr
analysis tool : http://localhost:8983/solr/#/collection1/analysis
Enter your field type and index time input that you had given with query
value that you are searching for.

You should be able to find your answers.






On Tue, Oct 22, 2013 at 8:06 PM, Poornima Jay <[hidden email]>wrote:

> Hi Rajani,
>
> Below is the configured in my schema.
> <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.ChineseTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"  ignoreCase="true"
>  words="stopwords.txt"   enablePositionIncrements="true" />
>         <filter class="solr.ChineseFilterFactory" />
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.ChineseTokenizerFactory"/>
>         <filter class="solr.ChineseFilterFactory" />
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> <field name="product_code" type="string" indexed="true" stored="false"
> multiValued="true" />
> <field name="author_name" type="text_chinese" indexed="true"
> stored="false" multiValued="true"/>
> <field name="author_name_string" type="string" indexed="true"
> stored="false" multiValued="true" />
> <field name="simple" type="text_chinese" indexed="true" stored="false"
> multiValued="true" />
> <copyField source="product_code" dest="simple" />
> <copyField source="author_name" dest="author_name_string" />
>
> if I search with the query q=simple:总评价 it works but doesn't work if I
> search with q=simple:676767667. If the field is defined as string the
> chinese character works but doesn't work if it is defined as text_chinese.
>
> Regards,
> Poornima
>
>
>
>
>   On Tuesday, 22 October 2013 7:52 PM, Rajani Maski <[hidden email]>
> wrote:
>  Hi Poornima,
>
>   Your statement :   "It works fine with the chinese strings but not
> working with product code or ISBN even though the fields are defined as
> string" is confusing.
>
> Did you mean that the product code and ISBN fields are of type
> text_Chinese?
>
> Is it first or second:
> <field name="product_code"* type="string" *indexed="true" stored="false"/>
> or
> <field name="product_code" type="text_chinese" indexed="true"
> stored="false"/>
>
>
> What do you refer to when you tell that it's not working? Unable to search?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Oct 22, 2013 at 6:09 PM, Poornima Jay <[hidden email]>wrote:
>
> Hi,
>
> Did any one face a problem for chinese language in SOLR 3.6.1. Below is
> the analyzer in the schema.xml file.
>
> <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>           <tokenizer class="solr.CJKTokenizerFactory"/>
>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>            <filter class="solr.ChineseFilterFactory" />
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>           <filter class="solr.ChineseFilterFactory" />
>           <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>  </fieldType>
>
> It works fine with the chinese strings but not working with product code
> or ISBN even though the fields are defined as string.
>
> Please let me know how should the chinese schema be configured.
>
> Thanks.
> Poornima
>
>
>
>
>