[solr8.7] not relevant results for chinese query

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[solr8.7] not relevant results for chinese query

Bruno Mannina
Hello,

 

I try to use chinese language with my index.

 

My definition is:

<field name="tizh" type="text_zh" multiValued="true" indexed="true"
stored="true" termVectors="true" termPositions="true" termOffsets="true"/>

 

    <!-- Simplified chinese -->

    <!-- BRUNO -->

    <fieldType name="text_zh" class="solr.TextField"
positionIncrementGap="100">

      <analyzer>

       <tokenizer class="solr.HMMChineseTokenizerFactory"/>

       <filter class="solr.CJKWidthFilterFactory"/>

       <filter class="solr.StopFilterFactory"

          words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>

       <filter class="solr.PorterStemFilterFactory"/>

       <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

 

But, I get too much not relevant results.

 

i.e. : With the query (phone case):

tizh:(手機殼)

 

my query is translate to:

tizh:(手 OR 機 OR 殼)

 

But:

tizh:(手 AND 機 AND 殼)

returns 0 result.

 

And:

tizh:”手機殼”

returns also 0 result.

 

Is it possible to improve my fieldType ? or must I add something else ?

 

Thanks,

Bruno

 



--
L'absence de virus dans ce courrier electronique a ete verifiee par le logiciel antivirus Avast.
https://www.avast.com/antivirus
Reply | Threaded
Open this post in threaded view
|

RE: [solr8.7] not relevant results for chinese query

Bruno Mannina
Hi,

With this article ( https://opensourceconnections.com/blog/2011/12/23/indexing-chinese-in-solr/ ), I begin to understand what happens.

Is someone have already try, with a recent SOLR, the Poading algorithm?


Thanks,
Bruno

-----Message d'origine-----
De : Bruno Mannina [mailto:[hidden email]]
Envoyé : dimanche 10 janvier 2021 17:57
À : [hidden email]
Objet : [solr8.7] not relevant results for chinese query

Hello,



I try to use chinese language with my index.



My definition is:

<field name="tizh" type="text_zh" multiValued="true" indexed="true"
stored="true" termVectors="true" termPositions="true" termOffsets="true"/>



    <!-- Simplified chinese -->

    <!-- BRUNO -->

    <fieldType name="text_zh" class="solr.TextField"
positionIncrementGap="100">

      <analyzer>

       <tokenizer class="solr.HMMChineseTokenizerFactory"/>

       <filter class="solr.CJKWidthFilterFactory"/>

       <filter class="solr.StopFilterFactory"

          words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>

       <filter class="solr.PorterStemFilterFactory"/>

       <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>



But, I get too much not relevant results.



i.e. : With the query (phone case):

tizh:(手機殼)



my query is translate to:

tizh:(手 OR 機 OR 殼)



But:

tizh:(手 AND 機 AND 殼)

returns 0 result.



And:

tizh:”手機殼”

returns also 0 result.



Is it possible to improve my fieldType ? or must I add something else ?



Thanks,

Bruno





--
L'absence de virus dans ce courrier electronique a ete verifiee par le logiciel antivirus Avast.
https://www.avast.com/antivirus


--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus