Solr search engine configuration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr search engine configuration

PeterKerk
Since Google onsite search will be end of life April 1 2018, I'm trying to
setup my own onsite search engine that indexes my site's content and makes
it searchable.

My data config successfully loads data from my database (products,
companies, blogs) into the fields.

I then try to search in both the title and the description fields with
weights. Now for example when users search on "dieren" (this means "animals"
in Dutch):

&q=(title_search_global:(dieren) OR
description_search_global:(dieren))&qf=title_search_global+title_exactm‌​atch^1000+description_search_global+description_exactm‌​atch^100

I get results with "dieren", "huisdieren", but I also get undesired results
with "manieren" and "versieren".

What I want is to find text using the following logic (all case
insensitive):


Exact match "dieren" boost result with 1000
Partial match "huisdieren" boost result with 500
Stem match "dier" boost result with 100
Stem partial match "huisdier" boost result with 70
Other partial matches "die" boost result with 10

My current schema.xml is here: http://www.telefonievergelijken.nl/schema.xml
I tried the solr admin tool for tokenization, but I can't figure out how to
get to the above logic.
I also Googled for an example Solr schema.xml configuration for building
your own search engines and I'm really surprised there's nothing out there.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr search engine configuration

Erick Erickson
You're mixing two different parsers I think.

If you're using edismax (either specify defType=edismax on your query
or set it up as the default for, say, the "/select" handler in
solrcofnig.xml). The "qf" parameter only is relevant if you _are_
using edismax. If you wan to use edismax your query could look
something like
q=dieren&defType=edismax&qf=qtitle_search_global
title_exactm‌atch^1000 description_search_global
description_exactm‌atch^100

On the other hand if you don't want to use edismax your query would
have to look something like:
q=qtitle_search_global:dieren title_exactm‌atch:dieren^1000
description_search_global:dieren description_exactm‌atch:dieren^100

This is guessing a bit, but If you add &debug=query to your URL,
you'll see the parsed results of the query which can be very useful in
figuring out exactly what Solr thinks the query is..

Best,
Erick

On Sat, Mar 10, 2018 at 2:06 PM, PeterKerk <[hidden email]> wrote:

> Since Google onsite search will be end of life April 1 2018, I'm trying to
> setup my own onsite search engine that indexes my site's content and makes
> it searchable.
>
> My data config successfully loads data from my database (products,
> companies, blogs) into the fields.
>
> I then try to search in both the title and the description fields with
> weights. Now for example when users search on "dieren" (this means "animals"
> in Dutch):
>
> &q=(title_search_global:(dieren) OR
> description_search_global:(dieren))&qf=title_search_global+title_exactm‌atch^1000+description_search_global+description_exactm‌atch^100
>
> I get results with "dieren", "huisdieren", but I also get undesired results
> with "manieren" and "versieren".
>
> What I want is to find text using the following logic (all case
> insensitive):
>
>
> Exact match "dieren" boost result with 1000
> Partial match "huisdieren" boost result with 500
> Stem match "dier" boost result with 100
> Stem partial match "huisdier" boost result with 70
> Other partial matches "die" boost result with 10
>
> My current schema.xml is here: http://www.telefonievergelijken.nl/schema.xml
> I tried the solr admin tool for tokenization, but I can't figure out how to
> get to the above logic.
> I also Googled for an example Solr schema.xml configuration for building
> your own search engines and I'm really surprised there's nothing out there.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr search engine configuration

PeterKerk
Thanks! That provides me with some more insight, I altered the search query
to "dieren zaak" to see how queries consisting of more than 1 word are
handled.
I see that words are tokenized into groups of 3, I think because of my
NGramFilterFactory with minGramSize of 3.

<lst name="debug">
        <str name="rawquerystring">
        (title_search_global:(dieren zaak) OR description_search_global:(dieren
zaak))
        </str>
        <str name="querystring">
        (title_search_global:(dieren zaak) OR description_search_global:(dieren
zaak))
        </str>
        <str name="parsedquery">
        (+(((title_search_global:die title_search_global:ier
title_search_global:ere title_search_global:ren title_search_global:dier
title_search_global:iere title_search_global:eren title_search_global:diere
title_search_global:ieren title_search_global:dieren)
(title_search_global:zaa title_search_global:aak title_search_global:zaak))
(((description_search_global:dier description_search_global:diere
description_search_global:dieren)/no_coord)
description_search_global:zaak)))/no_coord
        </str>
        <str name="parsedquery_toString">
        +(((title_search_global:die title_search_global:ier title_search_global:ere
title_search_global:ren title_search_global:dier title_search_global:iere
title_search_global:eren title_search_global:diere title_search_global:ieren
title_search_global:dieren) (title_search_global:zaa title_search_global:aak
title_search_global:zaak)) ((description_search_global:dier
description_search_global:diere description_search_global:dieren)
description_search_global:zaak))
        </str>
        <str name="QParser">ExtendedDismaxQParser</str>
        <null name="altquerystring"/>
        <null name="boost_queries"/>
        <arr name="parsed_boost_queries"/>
        <null name="boostfuncs"/>
        <arr name="filter_queries">
                <str>(lang:"nl" OR lang:"all")</str>
        </arr>
        <arr name="parsed_filter_queries">
                <str>lang:nl lang:all</str>
        </arr>
</lst>


I tried the query with and without the &defType=edismax parameter but I'm
getting the EXACT same results. Does that mean some configuration error?

I'm not sure how to progress from here. Can you see if your presumption that
I'm mixing two different parsers is correct? My schema.xml is here:
http://www.telefonievergelijken.nl/schema.xml


Related: do you know of the existence of any sample schema.xml config that
would be usable for a search engine? Seems like something so obvious to
float around out there. I feel that would go a long way.



Not sure if it matters but my requirements are:

Exact match "dieren zaak" boost result with 1000
Exact match "dierenzaak" boost result with 900
Exact match "dieren" or "zaak" boost result with 600

Partial match "huisdierenzaak" or "huisdieren zaak" boost result with 500
Stem match "dier" boost result with 100
Stem partial match "huisdier" boost result with 70
Other partial matches "die" boost result with 10




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr search engine configuration

Erick Erickson
bq: I tried the query with and without the &defType=edismax parameter but I'm
getting the EXACT same results. Does that mean some configuration error?

Well, not an error at all, this line:
 <str name="QParser">ExtendedDismaxQParser</str>

Means you're using edismax. If that happens both with or without
&defType, that means
that your request handler in solrconfig.xml has this defined as a
default. Look for the
entry like:

<requestHandler name="/select" class="solr.SearchHandler">
   <lst name="defaults">
         <str name="defType">edismax</str>

So any search you send to Solr like
http://blah blah/solr/collection/select?

will use edismax if no defType overrides it on the URL.

-------
Let's talk about what "exact match" means ;)


Exact match "dieren zaak". Does "Exact match" here mean it would or
would not be an exact match on "dieren zaak soemthingelse"?

I you do NOT consider the above "exact match", the usual trick is to
use a copyField directive to a field that uses KeywordTokenizerFactory
(probably) followed by LowerCaseFilterFactory etc.
KeywordTokenizerFactory takes the entire input field as a _single_
token, then you can transform it various ways, things like folding
accents, lowercasing and the like if desired.

I you DO consider the above "exact match", take a look at the pf, pf2
and pf3 parameters in edismax. They're all about forming phrases,
bigrams and trigrams respectively for this form of "exact match".

Exact match "dierenzaak". This one is tricky. There's little OOB that
understands that "dieren zaak" is equivalent to "dierenzaak". I know
that in German there's prior art on "decompounding" filters, I don't
know about Dutch. Further, given my total lack of understanding the
rules of either language I don't know if it does "compounding" too,
i.e. understanding that "dieren zaak" is equivalent to "dierenzaak".
Can't help much there.

For a start I'd get rid of the gramming until I'd explored other
alternatives. Gramming is generally a good thing for pre-and-post
wildcards, i.e. matching *some*. Since you're concerned with
relevance, I suspect that gramming will make your task harder.

And if you haven't discovered the admin UI/analysis page, I recommend
you spend some time with it (hint, un-check the "verbose" checkbox).
As you play with various combinations of tokenizers and filters it'll
give you a much better understanding of what the effects of various
combinations are.

If only human language followed strict rules ;)

Professor:                            "In English, two negatives are
allowed and mean a positive, but two positives don't mean a negative."
Bored voice from the back: "Yeah, right".

Erick

On Sun, Mar 11, 2018 at 5:19 AM, PeterKerk <[hidden email]> wrote:

> Thanks! That provides me with some more insight, I altered the search query
> to "dieren zaak" to see how queries consisting of more than 1 word are
> handled.
> I see that words are tokenized into groups of 3, I think because of my
> NGramFilterFactory with minGramSize of 3.
>
> <lst name="debug">
>         <str name="rawquerystring">
>         (title_search_global:(dieren zaak) OR description_search_global:(dieren
> zaak))
>         </str>
>         <str name="querystring">
>         (title_search_global:(dieren zaak) OR description_search_global:(dieren
> zaak))
>         </str>
>         <str name="parsedquery">
>         (+(((title_search_global:die title_search_global:ier
> title_search_global:ere title_search_global:ren title_search_global:dier
> title_search_global:iere title_search_global:eren title_search_global:diere
> title_search_global:ieren title_search_global:dieren)
> (title_search_global:zaa title_search_global:aak title_search_global:zaak))
> (((description_search_global:dier description_search_global:diere
> description_search_global:dieren)/no_coord)
> description_search_global:zaak)))/no_coord
>         </str>
>         <str name="parsedquery_toString">
>         +(((title_search_global:die title_search_global:ier title_search_global:ere
> title_search_global:ren title_search_global:dier title_search_global:iere
> title_search_global:eren title_search_global:diere title_search_global:ieren
> title_search_global:dieren) (title_search_global:zaa title_search_global:aak
> title_search_global:zaak)) ((description_search_global:dier
> description_search_global:diere description_search_global:dieren)
> description_search_global:zaak))
>         </str>
>         <str name="QParser">ExtendedDismaxQParser</str>
>         <null name="altquerystring"/>
>         <null name="boost_queries"/>
>         <arr name="parsed_boost_queries"/>
>         <null name="boostfuncs"/>
>         <arr name="filter_queries">
>                 <str>(lang:"nl" OR lang:"all")</str>
>         </arr>
>         <arr name="parsed_filter_queries">
>                 <str>lang:nl lang:all</str>
>         </arr>
> </lst>
>
>
> I tried the query with and without the &defType=edismax parameter but I'm
> getting the EXACT same results. Does that mean some configuration error?
>
> I'm not sure how to progress from here. Can you see if your presumption that
> I'm mixing two different parsers is correct? My schema.xml is here:
> http://www.telefonievergelijken.nl/schema.xml
>
>
> Related: do you know of the existence of any sample schema.xml config that
> would be usable for a search engine? Seems like something so obvious to
> float around out there. I feel that would go a long way.
>
>
>
> Not sure if it matters but my requirements are:
>
> Exact match "dieren zaak" boost result with 1000
> Exact match "dierenzaak" boost result with 900
> Exact match "dieren" or "zaak" boost result with 600
>
> Partial match "huisdierenzaak" or "huisdieren zaak" boost result with 500
> Stem match "dier" boost result with 100
> Stem partial match "huisdier" boost result with 70
> Other partial matches "die" boost result with 10
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr search engine configuration

PeterKerk
Sorry for this lengthy post, but I wanted to be complete.

The only occurence of edismax in solrconfig.xml is this one:

        <requestHandler name="/scoresearch" class="solr.SearchHandler"
default="true">
           
                        <lst name="defaults">
                          <str name="defType">edismax</str>    
                          <str name="echoParams">explicit</str>
                          <int name="rows">10</int>
                         
                          <str name="qf">double_score</str>
                          <str name="debug">false</str>
                          <str name="q.alt">*:*</str>
                </lst>
        </requestHandler>
       
I don't have a requestHandler named "/select".


Also, removing the gramming definitely helped! :-)

I tried to simplify my setup first and then expand, so what I have now is
this:

       
        <fieldType name="searchtext_nl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_nl.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"></filter>
               
               
      </analyzer>
      <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_nl.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"></filter>
               
               
      </analyzer>
    </fieldType>

        <field name="title_search_global" type="searchtext_nl" indexed="true"
stored="true"/>
       
In my database I have these 4 values for "title" that populate
"title_search_global"
       
"Hi there dier something else"
"Hi there dieren zaak something else"
"Hi there dierenzaak something else"
"Hi there dierzaak something else"

ps. "dier" is singular of plural "dieren".

Using this query:
http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true

These results are found:
"Hi there dier something else"
"Hi there dieren zaak something else"

And these are NOT:
"Hi there dierenzaak something else"
"Hi there dierzaak something else"

I'd expect it should be fairly easy (although I don't know how) to also
include result "dierenzaak", by compounding the 2 query values. And yes you
are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
sure what logic would also include "dierzaak"

Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
exact match of "dieren zaak"
So I also checked the usage of pf parameters with edismax (based on these
links:
https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html,
http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/)
And also for dismax:
https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter

But I can't find any examples how to actually use these parameters?


The search results, including debug info is here:


<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">7</int>
        <lst name="params">
            <str name="q">title_search_global:(dieren zaak)</str>
            <str name="defType">edismax</str>
            <str name="debug">true</str>
            <str name="indent">true</str>
            <str name="qf">title_search_global</str>
            <str name="fl">id,title</str>
            <str name="fq">(lang:"nl" OR lang:"all")</str>
            <str name="wt">xml</str>
            <str name="lowercaseOperators">true</str>
            <str name="stopwords">true</str>
        </lst>
    </lst>
    <result name="response" numFound="2" start="0">
        <doc>
            <str name="title">dieren zaak</str>
            <str name="id">115_3699638</str>
        </doc>
        <doc>
            <str name="title">dier</str>
            <str name="id">115_3699637</str>
        </doc>
    </result>
    <lst name="debug">
        <str name="rawquerystring">title_search_global:(dieren zaak)</str>
        <str name="querystring">title_search_global:(dieren zaak)</str>
        <str name="parsedquery">
(+(title_search_global:dier title_search_global:zaak))/no_coord
</str>
        <str name="parsedquery_toString">
+(title_search_global:dier title_search_global:zaak)
</str>
        <lst name="explain">
            <str name="115_3699638">
5.489122 = (MATCH) sum of: 2.4387078 = (MATCH)
weight(title_search_global:dier in 51) [DefaultSimilarity], result of:
2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 =
queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546
= queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513)
0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak
in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 =
termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 =
idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight
in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51)
</str>
            <str name="115_3699637">
1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 =
(MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result
of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of:
0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product
of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2)
</str>
            <str name="110_141">
0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 =
(MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result
of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of:
0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product
of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
idf(docFreq=3, maxDocs=513) 0.5 = fieldNorm(doc=132) 0.5 = coord(1/2)
</str>
        </lst>
        <str name="QParser">ExtendedDismaxQParser</str>
        <null name="altquerystring" />
        <null name="boost_queries" />
        <arr name="parsed_boost_queries" />
        <null name="boostfuncs" />
        <arr name="filter_queries">
            <str>(lang:"nl" OR lang:"all")</str>
        </arr>
        <arr name="parsed_filter_queries">
            <str>lang:nl lang:all</str>
        </arr>
        <lst name="timing">
            <double name="time">7.0</double>
            <lst name="prepare">
                <double name="time">4.0</double>
                <lst name="query">
                    <double name="time">4.0</double>
                </lst>
                <lst name="facet">
                    <double name="time">0.0</double>
                </lst>
                <lst name="mlt">
                    <double name="time">0.0</double>
                </lst>
                <lst name="highlight">
                    <double name="time">0.0</double>
                </lst>
                <lst name="stats">
                    <double name="time">0.0</double>
                </lst>
                <lst name="debug">
                    <double name="time">0.0</double>
                </lst>
            </lst>
            <lst name="process">
                <double name="time">3.0</double>
                <lst name="query">
                    <double name="time">0.0</double>
                </lst>
                <lst name="facet">
                    <double name="time">0.0</double>
                </lst>
                <lst name="mlt">
                    <double name="time">0.0</double>
                </lst>
                <lst name="highlight">
                    <double name="time">0.0</double>
                </lst>
                <lst name="stats">
                    <double name="time">0.0</double>
                </lst>
                <lst name="debug">
                    <double name="time">3.0</double>
                </lst>
            </lst>
        </lst>
    </lst>
</response>


PS. had to laugh out loud about that professor joke :-D



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

Markus Jelsma-2
In reply to this post by PeterKerk
Hi,

Glad to hear you removed the gramming, but Kraaij-Pohlmann isn't going to solve all problems either, for example molens => molen, but molen => mool, and many more like that. You can solve this by adding manual rules to StemmerOverrideFilter, but due to the compound nature of words, you would need to add it for all the mills.

Regarding the compounds, Dutch is (more or less) just another Germanic language and uses compounds just like German, Swedish etc. To deal with that you can try the vanilla HyphenationCompoundWordTokenFilter (or something like that). Be sure not to set minWordLength too low, or you'll get plenty of bad results. The major drawback of this token filter is that it emits overlapping terms, and may not always work with compounds of which the head is a plural, just like dierenzaak, of scholierenkorting.

Also add a AccentFoldingFilter, or ICUNormalizer to get rid of accents, or you may have trouble finding a café.

Regards,
Markus
 
-----Original message-----

> From:PeterKerk <[hidden email]>
> Sent: Sunday 11th March 2018 23:55
> To: [hidden email]
> Subject: Re: Solr search engine configuration
>
> Sorry for this lengthy post, but I wanted to be complete.
>
> The only occurence of edismax in solrconfig.xml is this one:
>
> <requestHandler name="/scoresearch" class="solr.SearchHandler"
> default="true">
>  
> <lst name="defaults">
>  <str name="defType">edismax</str>    
>  <str name="echoParams">explicit</str>
>  <int name="rows">10</int>
>
>  <str name="qf">double_score</str>
>  <str name="debug">false</str>
>  <str name="q.alt">*:*</str>
> </lst>
> </requestHandler>
>
> I don't have a requestHandler named "/select".
>
>
> Also, removing the gramming definitely helped! :-)
>
> I tried to simplify my setup first and then expand, so what I have now is
> this:
>
>
> <fieldType name="searchtext_nl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_nl.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"></filter>
>
>
>       </analyzer>
>       <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_nl.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"></filter>
>
>
>       </analyzer>
>     </fieldType>
>
> <field name="title_search_global" type="searchtext_nl" indexed="true"
> stored="true"/>
>
> In my database I have these 4 values for "title" that populate
> "title_search_global"
>
> "Hi there dier something else"
> "Hi there dieren zaak something else"
> "Hi there dierenzaak something else"
> "Hi there dierzaak something else"
>
> ps. "dier" is singular of plural "dieren".
>
> Using this query:
> http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true
>
> These results are found:
> "Hi there dier something else"
> "Hi there dieren zaak something else"
>
> And these are NOT:
> "Hi there dierenzaak something else"
> "Hi there dierzaak something else"
>
> I'd expect it should be fairly easy (although I don't know how) to also
> include result "dierenzaak", by compounding the 2 query values. And yes you
> are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
> sure what logic would also include "dierzaak"
>
> Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
> exact match of "dieren zaak"
> So I also checked the usage of pf parameters with edismax (based on these
> links:
> https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html,
> http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/)
> And also for dismax:
> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter
>
> But I can't find any examples how to actually use these parameters?
>
>
> The search results, including debug info is here:
>
>
> <response>
>     <lst name="responseHeader">
>         <int name="status">0</int>
>         <int name="QTime">7</int>
>         <lst name="params">
>             <str name="q">title_search_global:(dieren zaak)</str>
>             <str name="defType">edismax</str>
>             <str name="debug">true</str>
>             <str name="indent">true</str>
>             <str name="qf">title_search_global</str>
>             <str name="fl">id,title</str>
>             <str name="fq">(lang:"nl" OR lang:"all")</str>
>             <str name="wt">xml</str>
>             <str name="lowercaseOperators">true</str>
>             <str name="stopwords">true</str>
>         </lst>
>     </lst>
>     <result name="response" numFound="2" start="0">
>         <doc>
>             <str name="title">dieren zaak</str>
>             <str name="id">115_3699638</str>
>         </doc>
>         <doc>
>             <str name="title">dier</str>
>             <str name="id">115_3699637</str>
>         </doc>
>     </result>
>     <lst name="debug">
>         <str name="rawquerystring">title_search_global:(dieren zaak)</str>
>         <str name="querystring">title_search_global:(dieren zaak)</str>
>         <str name="parsedquery">
> (+(title_search_global:dier title_search_global:zaak))/no_coord
> </str>
>         <str name="parsedquery_toString">
> +(title_search_global:dier title_search_global:zaak)
> </str>
>         <lst name="explain">
>             <str name="115_3699638">
> 5.489122 = (MATCH) sum of: 2.4387078 = (MATCH)
> weight(title_search_global:dier in 51) [DefaultSimilarity], result of:
> 2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 =
> queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546
> = queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0),
> with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513)
> 0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak
> in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 =
> termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 =
> idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight
> in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
> 6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51)
> </str>
>             <str name="115_3699637">
> 1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 =
> (MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result
> of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of:
> 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
> maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product
> of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
> idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2)
> </str>
>             <str name="110_141">
> 0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 =
> (MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result
> of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of:
> 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
> maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product
> of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
> idf(docFreq=3, maxDocs=513) 0.5 = fieldNorm(doc=132) 0.5 = coord(1/2)
> </str>
>         </lst>
>         <str name="QParser">ExtendedDismaxQParser</str>
>         <null name="altquerystring" />
>         <null name="boost_queries" />
>         <arr name="parsed_boost_queries" />
>         <null name="boostfuncs" />
>         <arr name="filter_queries">
>             <str>(lang:"nl" OR lang:"all")</str>
>         </arr>
>         <arr name="parsed_filter_queries">
>             <str>lang:nl lang:all</str>
>         </arr>
>         <lst name="timing">
>             <double name="time">7.0</double>
>             <lst name="prepare">
>                 <double name="time">4.0</double>
>                 <lst name="query">
>                     <double name="time">4.0</double>
>                 </lst>
>                 <lst name="facet">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="mlt">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="highlight">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="stats">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="debug">
>                     <double name="time">0.0</double>
>                 </lst>
>             </lst>
>             <lst name="process">
>                 <double name="time">3.0</double>
>                 <lst name="query">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="facet">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="mlt">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="highlight">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="stats">
>                     <double name="time">0.0</double>
>                 </lst>
>                 <lst name="debug">
>                     <double name="time">3.0</double>
>                 </lst>
>             </lst>
>         </lst>
>     </lst>
> </response>
>
>
> PS. had to laugh out loud about that professor joke :-D
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr search engine configuration

Erick Erickson
Peter:

bq: I don't have a requestHandler named "/select".

Right, that was just an example of a request handler, your
"/scoresearch" handler _does_ have edismax as your default "defType"
so assuming you're using that one it makes no difference at all
whether you specify &defType=edismax on the URL or not. You'd see a
differences if you specified "&defType=any_parser_other_than_dismax"
though ;)

As for the rest, I'll leave you in the much more capable hands of
Markus since he has, you know, real knowledge in this area rather than
my generalities....

Best,
Erick

On Mon, Mar 12, 2018 at 1:33 AM, Markus Jelsma
<[hidden email]> wrote:

> Hi,
>
> Glad to hear you removed the gramming, but Kraaij-Pohlmann isn't going to solve all problems either, for example molens => molen, but molen => mool, and many more like that. You can solve this by adding manual rules to StemmerOverrideFilter, but due to the compound nature of words, you would need to add it for all the mills.
>
> Regarding the compounds, Dutch is (more or less) just another Germanic language and uses compounds just like German, Swedish etc. To deal with that you can try the vanilla HyphenationCompoundWordTokenFilter (or something like that). Be sure not to set minWordLength too low, or you'll get plenty of bad results. The major drawback of this token filter is that it emits overlapping terms, and may not always work with compounds of which the head is a plural, just like dierenzaak, of scholierenkorting.
>
> Also add a AccentFoldingFilter, or ICUNormalizer to get rid of accents, or you may have trouble finding a café.
>
> Regards,
> Markus
>
> -----Original message-----
>> From:PeterKerk <[hidden email]>
>> Sent: Sunday 11th March 2018 23:55
>> To: [hidden email]
>> Subject: Re: Solr search engine configuration
>>
>> Sorry for this lengthy post, but I wanted to be complete.
>>
>> The only occurence of edismax in solrconfig.xml is this one:
>>
>>       <requestHandler name="/scoresearch" class="solr.SearchHandler"
>> default="true">
>>
>>                       <lst name="defaults">
>>                         <str name="defType">edismax</str>
>>                         <str name="echoParams">explicit</str>
>>                         <int name="rows">10</int>
>>
>>                         <str name="qf">double_score</str>
>>                         <str name="debug">false</str>
>>                         <str name="q.alt">*:*</str>
>>               </lst>
>>       </requestHandler>
>>
>> I don't have a requestHandler named "/select".
>>
>>
>> Also, removing the gramming definitely helped! :-)
>>
>> I tried to simplify my setup first and then expand, so what I have now is
>> this:
>>
>>
>>       <fieldType name="searchtext_nl" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer type="index">
>>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_nl.txt"/>
>>               <filter class="solr.LowerCaseFilterFactory"/>
>>               <filter class="solr.SnowballPorterFilterFactory" language="Kp"
>> protected="protwords_nl.txt"></filter>
>>
>>
>>       </analyzer>
>>       <analyzer type="query">
>>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_nl.txt"/>
>>               <filter class="solr.LowerCaseFilterFactory"/>
>>               <filter class="solr.SnowballPorterFilterFactory" language="Kp"
>> protected="protwords_nl.txt"></filter>
>>
>>
>>       </analyzer>
>>     </fieldType>
>>
>>       <field name="title_search_global" type="searchtext_nl" indexed="true"
>> stored="true"/>
>>
>> In my database I have these 4 values for "title" that populate
>> "title_search_global"
>>
>> "Hi there dier something else"
>> "Hi there dieren zaak something else"
>> "Hi there dierenzaak something else"
>> "Hi there dierzaak something else"
>>
>> ps. "dier" is singular of plural "dieren".
>>
>> Using this query:
>> http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true
>>
>> These results are found:
>> "Hi there dier something else"
>> "Hi there dieren zaak something else"
>>
>> And these are NOT:
>> "Hi there dierenzaak something else"
>> "Hi there dierzaak something else"
>>
>> I'd expect it should be fairly easy (although I don't know how) to also
>> include result "dierenzaak", by compounding the 2 query values. And yes you
>> are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
>> sure what logic would also include "dierzaak"
>>
>> Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
>> exact match of "dieren zaak"
>> So I also checked the usage of pf parameters with edismax (based on these
>> links:
>> https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html,
>> http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/)
>> And also for dismax:
>> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter
>>
>> But I can't find any examples how to actually use these parameters?
>>
>>
>> The search results, including debug info is here:
>>
>>
>> <response>
>>     <lst name="responseHeader">
>>         <int name="status">0</int>
>>         <int name="QTime">7</int>
>>         <lst name="params">
>>             <str name="q">title_search_global:(dieren zaak)</str>
>>             <str name="defType">edismax</str>
>>             <str name="debug">true</str>
>>             <str name="indent">true</str>
>>             <str name="qf">title_search_global</str>
>>             <str name="fl">id,title</str>
>>             <str name="fq">(lang:"nl" OR lang:"all")</str>
>>             <str name="wt">xml</str>
>>             <str name="lowercaseOperators">true</str>
>>             <str name="stopwords">true</str>
>>         </lst>
>>     </lst>
>>     <result name="response" numFound="2" start="0">
>>         <doc>
>>             <str name="title">dieren zaak</str>
>>             <str name="id">115_3699638</str>
>>         </doc>
>>         <doc>
>>             <str name="title">dier</str>
>>             <str name="id">115_3699637</str>
>>         </doc>
>>     </result>
>>     <lst name="debug">
>>         <str name="rawquerystring">title_search_global:(dieren zaak)</str>
>>         <str name="querystring">title_search_global:(dieren zaak)</str>
>>         <str name="parsedquery">
>> (+(title_search_global:dier title_search_global:zaak))/no_coord
>> </str>
>>         <str name="parsedquery_toString">
>> +(title_search_global:dier title_search_global:zaak)
>> </str>
>>         <lst name="explain">
>>             <str name="115_3699638">
>> 5.489122 = (MATCH) sum of: 2.4387078 = (MATCH)
>> weight(title_search_global:dier in 51) [DefaultSimilarity], result of:
>> 2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 =
>> queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546
>> = queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0),
>> with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513)
>> 0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak
>> in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 =
>> termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 =
>> idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight
>> in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
>> 6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51)
>> </str>
>>             <str name="115_3699637">
>> 1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 =
>> (MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result
>> of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of:
>> 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
>> maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product
>> of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
>> idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2)
>> </str>
>>             <str name="110_141">
>> 0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 =
>> (MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result
>> of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of:
>> 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
>> maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product
>> of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
>> idf(docFreq=3, maxDocs=513) 0.5 = fieldNorm(doc=132) 0.5 = coord(1/2)
>> </str>
>>         </lst>
>>         <str name="QParser">ExtendedDismaxQParser</str>
>>         <null name="altquerystring" />
>>         <null name="boost_queries" />
>>         <arr name="parsed_boost_queries" />
>>         <null name="boostfuncs" />
>>         <arr name="filter_queries">
>>             <str>(lang:"nl" OR lang:"all")</str>
>>         </arr>
>>         <arr name="parsed_filter_queries">
>>             <str>lang:nl lang:all</str>
>>         </arr>
>>         <lst name="timing">
>>             <double name="time">7.0</double>
>>             <lst name="prepare">
>>                 <double name="time">4.0</double>
>>                 <lst name="query">
>>                     <double name="time">4.0</double>
>>                 </lst>
>>                 <lst name="facet">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="mlt">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="highlight">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="stats">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="debug">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>             </lst>
>>             <lst name="process">
>>                 <double name="time">3.0</double>
>>                 <lst name="query">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="facet">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="mlt">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="highlight">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="stats">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="debug">
>>                     <double name="time">3.0</double>
>>                 </lst>
>>             </lst>
>>         </lst>
>>     </lst>
>> </response>
>>
>>
>> PS. had to laugh out loud about that professor joke :-D
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

PeterKerk
In reply to this post by Markus Jelsma-2
@Erick: thank you for clarifying!

@Markus:
I feel like I'm not (or at least should not be :-)) the first person to run
into these challenges.

"You can solve this by adding manual rules to StemmerOverrideFilter, but due
to the compound nature of words, you would need to add it for all the mills"

After Googling I found this:
https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb
and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt
as stemdict_nl.txt

My new fieldType definition now is:

        <fieldType name="searchtext_nl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_nl.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"></filter>
      </analyzer>
      <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_nl.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"></filter>
      </analyzer>
    </fieldType>
       
I trimmed stemdict_nl.txt for testing to just this:

aachen                        aach
aachener                      aachener

But on full-import it throws a http 500 error:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at
org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66)

Is my stemdict_nl.txt format incorrect?

And do you have examples of the HyphenationCompoundWordTokenFilter or
AccentFoldingFilter I can't find any.

I use Solr 4.3.1 btw, not sure if that matters.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

Markus Jelsma-2
In reply to this post by PeterKerk
Hello Peter,

StemmerOverride wants \t separated fields, that is probably the cause of the AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a proper example listed. I recommend putting a decompounder before a stemmer, and have an accent (or ICU) folder as one of the last filters.

About the diff, it looks like KP output, it has the same issues with whether or not a word needs double or single vowels in the root. It also shows issues with strong verbs/nouns (beveel/bevool). Having this list seems like having KP configured so you should drop it, and only list exceptions to KP rules in the dict file. This is not easy, so i recommend to stay in to your domain's vocabulary.

Also, unless you have a very specific need for it, drop the StopFilter. Nobody in these days should want a StopFilter unless they can justify it. We use them too, but only for very specific reasons, but never for text search. You might also want to have a WordDelimiterFilter as your first filter, look it up, you probably want to have it.

Markus

[1] https://lucene.apache.org/core/7_1_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html

 
 
-----Original message-----

> From:PeterKerk <[hidden email]>
> Sent: Monday 12th March 2018 23:16
> To: [hidden email]
> Subject: RE: Solr search engine configuration
>
> @Erick: thank you for clarifying!
>
> @Markus:
> I feel like I'm not (or at least should not be :-)) the first person to run
> into these challenges.
>
> "You can solve this by adding manual rules to StemmerOverrideFilter, but due
> to the compound nature of words, you would need to add it for all the mills"
>
> After Googling I found this:
> https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb
> and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt
> as stemdict_nl.txt
>
> My new fieldType definition now is:
>
> <fieldType name="searchtext_nl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_nl.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"></filter>
>       </analyzer>
>       <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_nl.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"></filter>
>       </analyzer>
>     </fieldType>
>
> I trimmed stemdict_nl.txt for testing to just this:
>
> aachen                        aach
> aachener                      aachener
>
> But on full-import it throws a http 500 error:
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at
> org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66)
>
> Is my stemdict_nl.txt format incorrect?
>
> And do you have examples of the HyphenationCompoundWordTokenFilter or
> AccentFoldingFilter I can't find any.
>
> I use Solr 4.3.1 btw, not sure if that matters.
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr search engine configuration

Shawn Heisey-2
In reply to this post by PeterKerk
On 3/12/2018 4:15 PM, PeterKerk wrote:
> I trimmed stemdict_nl.txt for testing to just this:
>
> aachen                        aach
> aachener                      aachener

According to the example here:

https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/stemdict.txt

The lines need to be tab separated.

I'm betting that you're running into this bug, which is still unresolved:

https://issues.apache.org/jira/browse/LUCENE-4545

The source file you have referenced uses spaces.  If those are still in
your file, it isn't going to work.  It appears that the way the code is
written (and is STILL written even in master, which will one day be
version 8.0), the separator must be a SINGLE tab.  I have confirmed that
multiple tabs or any number of spaces isn't going to work properly.

I will see what I can do about getting the bug fixed, but for now you're
going to have to fix all the separators in your dictionary file.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

PeterKerk
In reply to this post by Markus Jelsma-2
Markus,

Thanks again. Ok, 1 by 1:

StemmerOverride wants \t separated fields, that is probably the cause of the
AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a
proper example listed. I recommend putting a decompounder before a stemmer,
and have an accent (or ICU) folder as one of the last filters.

PVK COMMENT:
Looking for Decompounders and found a few links, btw a lot of the pages
these are linked to don't work.

https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages

http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
        https://wiki.apache.org/solr/LanguageAnalysis#Decompounding
                https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory
               
my stemdict_nl.txt now contains (words separated by a single tab):
aachen aach
aachener aachener
aalmoezen aalmoes
beveel bevool
dierenzaken dierenzaak

The problem before was indeed like @Shawn indicates that I had words in
there with a space like so:
dieren zaken dierenzaak


       
About the diff, it looks like KP output, it has the same issues with whether
or not a word needs double or single vowels in the root. It also shows
issues with strong verbs/nouns (beveel/bevool). Having this list seems like
having KP configured so you should drop it, and only list exceptions to KP
rules in the dict file. This is not easy, so i recommend to stay in to your
domain's vocabulary.

PVK COMMENT:
That's what I now did above right?


Also, unless you have a very specific need for it, drop the StopFilter.
Nobody in these days should want a StopFilter unless they can justify it. We
use them too, but only for very specific reasons, but never for text search.
You might also want to have a WordDelimiterFilter as your first filter, look
it up, you probably want to have it.

PVK COMMENT:
But without a Stopfilter, wont stopwords be included in searches? I though
that for example Google excluded these words in their algorithms?




This is what I have now:

        <fieldType name="searchtext_nl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
               
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
               
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
               
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>
               
               
                <filter class="solr.ASCIIFoldingFilterFactory"/>
               
      </analyzer>
      <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
               
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
               
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
                 
                 <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>

               
                <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

       
Now for both this query
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true       
and this one:
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
       
This result is found:
"Hi there dieren zaak something else"

And these are NOT:
"Hi there dier something else"
"Hi there dierenzaak something else"
"Hi there dierzaak something else"

What else do you recommend I try?
       



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

Markus Jelsma-2
In reply to this post by PeterKerk

 
-----Original message-----

> From:PeterKerk <[hidden email]>
> Sent: Tuesday 13th March 2018 14:24
> To: [hidden email]
> Subject: RE: Solr search engine configuration
>
> Markus,
>
> Thanks again. Ok, 1 by 1:
>
> StemmerOverride wants \t separated fields, that is probably the cause of the
> AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a
> proper example listed. I recommend putting a decompounder before a stemmer,
> and have an accent (or ICU) folder as one of the last filters.
>
> PVK COMMENT:
> Looking for Decompounders and found a few links, btw a lot of the pages
> these are linked to don't work.
>
> https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages
>
> http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
> https://wiki.apache.org/solr/LanguageAnalysis#Decompounding
> https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory

You must stay in the Javadoc section, there the examples are good, or the reference guide:
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions

>
> my stemdict_nl.txt now contains (words separated by a single tab):
> aachen aach
> aachener aachener
> aalmoezen aalmoes
> beveel bevool
> dierenzaken dierenzaak
>
> The problem before was indeed like @Shawn indicates that I had words in
> there with a space like so:
> dieren zaken dierenzaak
>
>
>
> About the diff, it looks like KP output, it has the same issues with whether
> or not a word needs double or single vowels in the root. It also shows
> issues with strong verbs/nouns (beveel/bevool). Having this list seems like
> having KP configured so you should drop it, and only list exceptions to KP
> rules in the dict file. This is not easy, so i recommend to stay in to your
> domain's vocabulary.
>
> PVK COMMENT:
> That's what I now did above right?

Almost, zaken -> zaak is already KP output, no need to input what the stemmer will do for you.

>
>
> Also, unless you have a very specific need for it, drop the StopFilter.
> Nobody in these days should want a StopFilter unless they can justify it. We
> use them too, but only for very specific reasons, but never for text search.
> You might also want to have a WordDelimiterFilter as your first filter, look
> it up, you probably want to have it.
>
> PVK COMMENT:
> But without a Stopfilter, wont stopwords be included in searches? I though
> that for example Google excluded these words in their algorithms?
>

Yes, stopwords are good! Keep them! And i am glad Google doesn't just strip stopwords.

>
>
> This is what I have now:
>
> <fieldType name="searchtext_nl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
>
>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>       </analyzer>
>       <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
>
>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>     </fieldType>

That looks fine, but you now you omitted the stemmer (Snowball). Put it after StemmerOverrideFilter, and before ASCIIFolding.

>
>
> Now for both this query
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true       
> and this one:
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
>
> This result is found:
> "Hi there dieren zaak something else"
>
> And these are NOT:
> "Hi there dier something else"
> "Hi there dierenzaak something else"
> "Hi there dierzaak something else"

This is because the decompounder doesn't split dierenzaak, just must test this in Solradmin before reindexing or trying. Once the decompounder splits dierenzaak, and a stemmer is in place, all except 'dier' will be found, depending on your mm-setting.

And did you reindex?

>
> What else do you recommend I try?
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr search engine configuration

Shawn Heisey-2
In reply to this post by PeterKerk
On 3/13/2018 7:24 AM, PeterKerk wrote:
> PVK COMMENT:
> But without a Stopfilter, wont stopwords be included in searches? I though
> that for example Google excluded these words in their algorithms?

I just did a google search for "to be or not to be".  It worked flawlessly.

If Google were using stopwords, that search would have returned
nothing.  The four words in that search are among the most frequent
words found in English prose.  This is a typical stopword list for English:

a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

To explain why the frequent responders on this list recommend not using
stopwords, and why the biggest search engine on the planet doesn't use
them, you need a small history lesson -- you have to know why stopword
filters were invented in the first place.

A search engine works by creating an uninverted index. This means for a
typical full-text index that there is a big list of words, and for each
of those words, there is a list that identifies the document, field
name, and text offset of where that word is found.  Without a stopword
filter, the biggest entry in an index for English is probably "the" ...
in a corpus of a few million documents, "the" might appear *billions* of
times.  So the list is BIG.  And when the search has to deal with a big
entry in the uninverted index, it's slower than normal.

Back in the annals of history (80s, 90s, etc) servers didn't have nearly
as much memory and CPU resources as they do now.  Eliminating these
giant entries in the index made a HUGE difference in search
performance.  A search that might take several seconds with the
stopwords included could be sped up to less than one second without them.

Even back then, the people who built stopword filters KNEW that they
were impacting search results.  The reason they implemented them anyway
was to greatly improve search *performance*.  They knew that a search
for "to be or not to be" or "the who" or any number of other similar
searches wouldn't work properly.  But the vast majority of searches were
not really affected by the stopword removal, and users got their results
really fast.

Today, with modern hardware, search engines are much less bothered by
having enormous entries in the uninverted index.  When stopwords are NOT
removed, you can get more accurate search results.  Yes, the index is
substantially bigger.  But modern hardware is easy to load up with a lot
of disk space, memory, and CPU capacity, and search with stopwords is
fast enough.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

PeterKerk
In reply to this post by Markus Jelsma-2
You must stay in the Javadoc section, there the examples are good, or the
reference guide:
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions

PVK COMMENT 1:
        This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on the
radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing
now severely degrade my result quality as opposed to
HyphenationCompoundWordTokenFilterFactory?


Almost, zaken -> zaak is already KP output, no need to input what the
stemmer will do for you.

PVK COMMENT 2:
        How do you know zaken -> zaak is already KP output? Is there a list
somewhere?
       
PVK COMMENT 3:
I now have:

        <fieldType name="searchtext_nl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
               
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
               
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
               
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>
               
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"/>
               
               
                <filter class="solr.ASCIIFoldingFilterFactory"/>
               
      </analyzer>
      <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
               
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
               
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
                 
                 <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>

                 <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"/>
                 
               
                <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

I tested in admin UI (and yes, I restart Solr and reindex every time I make
a change):
       
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true
returns:
"hi there dieren zaak something else"
"hi there dier something else"

http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
returns
"hi there dierenzaak something else"

So I added "dieren" to compounds_nl.txt

Now on "title_search_global:(dieren zaak)" it returns:
<doc>
    <str name="title">hi there dieren zaak something else</str>
    <str name="id">115_3699638</str>
</doc>
<doc>
    <str name="title">hi there dier something else</str>
    <str name="id">115_3699637</str>
</doc>
<doc>
    <str name="title">hi there dierenzaak something else</str>
    <str name="id">115_3699639</str>
</doc>

So it's starting to look good! :-)

What I want to know, how can I have Solr consider "dierenzaak" to be of
higher importance than just "dier" in the above results?

Also I'm still not 100% sure what my addition of "dieren" to
compounds_nl.txt actually does...I assume
DictionaryCompoundWordTokenFilterFactory just looks for that exact string
and if it finds it, considers that a separate word? Correct?

Thanks again!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

Markus Jelsma-2
In reply to this post by PeterKerk
Inline, cheers.

-----Original message-----

> From:PeterKerk <[hidden email]>
> Sent: Tuesday 13th March 2018 18:53
> To: [hidden email]
> Subject: RE: Solr search engine configuration
>
> You must stay in the Javadoc section, there the examples are good, or the
> reference guide:
> https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions
>
> PVK COMMENT 1:
> This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on the
> radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing
> now severely degrade my result quality as opposed to
> HyphenationCompoundWordTokenFilterFactory?

Just change version number, most filters are already quite old:
https://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

Dictionary vs Hyphenation, using Dictionary won't severely degrade results, and can be easier to use if you need to add words. If prefer the Hyphenater though, but it can bite. Stick to Dictionary, you are fine. But both (iirc) suffer from the same problems with overlapping words, or subwords that do not entire make up for the full compound (minus genetives or plural forms) this is a real issue.

>
>
> Almost, zaken -> zaak is already KP output, no need to input what the
> stemmer will do for you.
>
> PVK COMMENT 2:
> How do you know zaken -> zaak is already KP output? Is there a list
> somewhere?

I know because i've seen KPs output a million times by now. You should really access Solr's analysis GUI, it shows what filters emit, it is really helpful.

>
> PVK COMMENT 3:
> I now have:
>
> <fieldType name="searchtext_nl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
>
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"/>
>
>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
>
>       </analyzer>
>       <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>
> <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
>
> <filter class="solr.SnowballPorterFilterFactory" language="Kp"
> protected="protwords_nl.txt"/>
>
>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>     </fieldType>

Please increase minWordsize and minSubwordSize. There are no compounds with that few characters. minSubwordSize should be at least 4, or you will get a lot of crazy output due to problems states above.

>
> I tested in admin UI (and yes, I restart Solr and reindex every time I make
> a change):
>
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true
> returns:
> "hi there dieren zaak something else"
> "hi there dier something else"
>
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
> returns
> "hi there dierenzaak something else"
>
> So I added "dieren" to compounds_nl.txt
>
> Now on "title_search_global:(dieren zaak)" it returns:
> <doc>
>     <str name="title">hi there dieren zaak something else</str>
>     <str name="id">115_3699638</str>
> </doc>
> <doc>
>     <str name="title">hi there dier something else</str>
>     <str name="id">115_3699637</str>
> </doc>
> <doc>
>     <str name="title">hi there dierenzaak something else</str>
>     <str name="id">115_3699639</str>
> </doc>
>
> So it's starting to look good! :-)
>
> What I want to know, how can I have Solr consider "dierenzaak" to be of
> higher importance than just "dier" in the above results?

Does the decompounder support emitting the compound word as well? If so, enable it. It should help scoring compounds higher via IDF as they are less common.

>
> Also I'm still not 100% sure what my addition of "dieren" to
> compounds_nl.txt actually does...I assume
> DictionaryCompoundWordTokenFilterFactory just looks for that exact string
> and if it finds it, considers that a separate word? Correct?

Just check in analysis GUI, it will answer all these questions.

>
> Thanks again!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

PeterKerk
Cool, will do some more digging around in the analysis GUI first.

One last thing then on this comment of yours:
"Does the decompounder support emitting the compound word as well? If so,
enable it. It should help scoring compounds higher via IDF as they are less
common."

So I checked the Javadoc:
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html
To be sure I also checked the Javadoc for the alternative
:https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html,
but nothing there on emitting either.

Where can I see whether DictionaryCompoundWordTokenFilterFactory supports
emitting the compound work and how to enable it?

Thanks again! :-)



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

Markus Jelsma-2
In reply to this post by PeterKerk
Hi - In that case you need the KeywordRepeat and RemoveDuplicates filters as well, i'd suggest reading their Javadocs. With the docs and the analysis GUI, you can probably figure out their respective place in the tokenizer chain yourself.

Trusting on IDF is not really a fine controlled boosting mechanism but it should work more or less. We use payloads everywhere for fine controlled scoring, but that involves a lot of code.

Cheers,
Markus

-----Original message-----

> From:PeterKerk <[hidden email]>
> Sent: Tuesday 13th March 2018 21:35
> To: [hidden email]
> Subject: RE: Solr search engine configuration
>
> Cool, will do some more digging around in the analysis GUI first.
>
> One last thing then on this comment of yours:
> "Does the decompounder support emitting the compound word as well? If so,
> enable it. It should help scoring compounds higher via IDF as they are less
> common."
>
> So I checked the Javadoc:
> https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html
> To be sure I also checked the Javadoc for the alternative
> :https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html,
> but nothing there on emitting either.
>
> Where can I see whether DictionaryCompoundWordTokenFilterFactory supports
> emitting the compound work and how to enable it?
>
> Thanks again! :-)
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

RE: Solr search engine configuration

PeterKerk
Thanks, will look into all that :-)



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html