indexing two words, searching single word

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing two words, searching single word

Clemens Wyss DEV
Sounds like a rather simple issue:
if I index "sound stage" and search for "soundstage" I get no hits

What am I doing wrong
a) when indexing
b) when searching
?

Thx in advance
- Clemens
Reply | Threaded
Open this post in threaded view
|

RE: indexing two words, searching single word

Markus Jelsma-2
Hello,

If your case is English you could use synonyms to work around the problem of the few compound words of the language. However, would you be dealing with a Germanic compound language, the HyphenationCompoundWordTokenFilter [1] or DictionaryCompoundWordTokenFilter are a better choice. The former is much more flexible but has its drawbacks.

Regards,
Markus

https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

 
 
-----Original message-----

> From:Clemens Wyss DEV <[hidden email]>
> Sent: Friday 3rd August 2018 12:22
> To: [hidden email]
> Subject: indexing two words, searching single word
>
> Sounds like a rather simple issue:
> if I index "sound stage" and search for "soundstage" I get no hits
>
> What am I doing wrong
> a) when indexing
> b) when searching
> ?
>
> Thx in advance
> - Clemens
>
Reply | Threaded
Open this post in threaded view
|

AW: indexing two words, searching single word

Clemens Wyss DEV
Hi Markus,
thanks for the quick answer.

"sound stage" was just an example. We are looking for a generic solution ...

Is it "ok" to apply an NGRamFilter for query-analyzing?
<analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15" />
</analyzer>

I guess (besides the performance impact) this reduces search results accuracy?

-Clemens

-----Ursprüngliche Nachricht-----
Von: Markus Jelsma <[hidden email]>
Gesendet: Freitag, 3. August 2018 12:43
An: [hidden email]
Betreff: RE: indexing two words, searching single word

Hello,

If your case is English you could use synonyms to work around the problem of the few compound words of the language. However, would you be dealing with a Germanic compound language, the HyphenationCompoundWordTokenFilter [1] or DictionaryCompoundWordTokenFilter are a better choice. The former is much more flexible but has its drawbacks.

Regards,
Markus

https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

 
 
-----Original message-----

> From:Clemens Wyss DEV <[hidden email]>
> Sent: Friday 3rd August 2018 12:22
> To: [hidden email]
> Subject: indexing two words, searching single word
>
> Sounds like a rather simple issue:
> if I index "sound stage" and search for "soundstage" I get no hits
>
> What am I doing wrong
> a) when indexing
> b) when searching
> ?
>
> Thx in advance
> - Clemens
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing two words, searching single word

Alexandre Rafalovitch
But what is your generic problem then. Because you probably are not looking
for "andthe" kind of tokens.

However a shingle plus regex to remove whitespace can give you "anytwo
wordstogether smooshed" tokens in the index.

Regards,
     Alex


On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, <[hidden email]> wrote:

> Hi Markus,
> thanks for the quick answer.
>
> "sound stage" was just an example. We are looking for a generic solution
> ...
>
> Is it "ok" to apply an NGRamFilter for query-analyzing?
> <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.NGramFilterFactory" minGramSize="3"
> maxGramSize="15" />
> </analyzer>
>
> I guess (besides the performance impact) this reduces search results
> accuracy?
>
> -Clemens
>
> -----Ursprüngliche Nachricht-----
> Von: Markus Jelsma <[hidden email]>
> Gesendet: Freitag, 3. August 2018 12:43
> An: [hidden email]
> Betreff: RE: indexing two words, searching single word
>
> Hello,
>
> If your case is English you could use synonyms to work around the problem
> of the few compound words of the language. However, would you be dealing
> with a Germanic compound language, the HyphenationCompoundWordTokenFilter
> [1] or DictionaryCompoundWordTokenFilter are a better choice. The former is
> much more flexible but has its drawbacks.
>
> Regards,
> Markus
>
>
> https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
>
>
>
> -----Original message-----
> > From:Clemens Wyss DEV <[hidden email]>
> > Sent: Friday 3rd August 2018 12:22
> > To: [hidden email]
> > Subject: indexing two words, searching single word
> >
> > Sounds like a rather simple issue:
> > if I index "sound stage" and search for "soundstage" I get no hits
> >
> > What am I doing wrong
> > a) when indexing
> > b) when searching
> > ?
> >
> > Thx in advance
> > - Clemens
> >
>
Reply | Threaded
Open this post in threaded view
|

AW: indexing two words, searching single word

Clemens Wyss DEV
>Because you probably are not looking for "andthe" kind of tokens
(unfortunately) I guess I am, as we don't know what people enter...

> a shingle plus regex to remove whitespace
sounds interesting. How would that filter-chain look like? That would be an type="index"-analyzer?
I guess we could shingle after stop-word-filtering and I quess maxShingleSize="2" would suffice

-----Ursprüngliche Nachricht-----
Von: Alexandre Rafalovitch <[hidden email]>
Gesendet: Freitag, 3. August 2018 13:33
An: solr-user <[hidden email]>
Betreff: Re: indexing two words, searching single word

But what is your generic problem then. Because you probably are not looking for "andthe" kind of tokens.

However a shingle plus regex to remove whitespace can give you "anytwo wordstogether smooshed" tokens in the index.

Regards,
     Alex


On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, <[hidden email]> wrote:

> Hi Markus,
> thanks for the quick answer.
>
> "sound stage" was just an example. We are looking for a generic
> solution ...
>
> Is it "ok" to apply an NGRamFilter for query-analyzing?
> <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.NGramFilterFactory" minGramSize="3"
> maxGramSize="15" />
> </analyzer>
>
> I guess (besides the performance impact) this reduces search results
> accuracy?
>
> -Clemens
>
> -----Ursprüngliche Nachricht-----
> Von: Markus Jelsma <[hidden email]>
> Gesendet: Freitag, 3. August 2018 12:43
> An: [hidden email]
> Betreff: RE: indexing two words, searching single word
>
> Hello,
>
> If your case is English you could use synonyms to work around the
> problem of the few compound words of the language. However, would you
> be dealing with a Germanic compound language, the
> HyphenationCompoundWordTokenFilter
> [1] or DictionaryCompoundWordTokenFilter are a better choice. The
> former is much more flexible but has its drawbacks.
>
> Regards,
> Markus
>
>
> https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucen
> e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
>
>
>
> -----Original message-----
> > From:Clemens Wyss DEV <[hidden email]>
> > Sent: Friday 3rd August 2018 12:22
> > To: [hidden email]
> > Subject: indexing two words, searching single word
> >
> > Sounds like a rather simple issue:
> > if I index "sound stage" and search for "soundstage" I get no hits
> >
> > What am I doing wrong
> > a) when indexing
> > b) when searching
> > ?
> >
> > Thx in advance
> > - Clemens
> >
>
Reply | Threaded
Open this post in threaded view
|

AW: indexing two words, searching single word

Clemens Wyss DEV
<analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/> <!-- here weg go! -->
</analyzer>

seems to "work"

-----Ursprüngliche Nachricht-----
Von: Clemens Wyss DEV <[hidden email]>
Gesendet: Freitag, 3. August 2018 13:46
An: [hidden email]
Betreff: AW: indexing two words, searching single word

>Because you probably are not looking for "andthe" kind of tokens
(unfortunately) I guess I am, as we don't know what people enter...

> a shingle plus regex to remove whitespace
sounds interesting. How would that filter-chain look like? That would be an type="index"-analyzer?
I guess we could shingle after stop-word-filtering and I quess maxShingleSize="2" would suffice

-----Ursprüngliche Nachricht-----
Von: Alexandre Rafalovitch <[hidden email]>
Gesendet: Freitag, 3. August 2018 13:33
An: solr-user <[hidden email]>
Betreff: Re: indexing two words, searching single word

But what is your generic problem then. Because you probably are not looking for "andthe" kind of tokens.

However a shingle plus regex to remove whitespace can give you "anytwo wordstogether smooshed" tokens in the index.

Regards,
     Alex


On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, <[hidden email]> wrote:

> Hi Markus,
> thanks for the quick answer.
>
> "sound stage" was just an example. We are looking for a generic
> solution ...
>
> Is it "ok" to apply an NGRamFilter for query-analyzing?
> <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.NGramFilterFactory" minGramSize="3"
> maxGramSize="15" />
> </analyzer>
>
> I guess (besides the performance impact) this reduces search results
> accuracy?
>
> -Clemens
>
> -----Ursprüngliche Nachricht-----
> Von: Markus Jelsma <[hidden email]>
> Gesendet: Freitag, 3. August 2018 12:43
> An: [hidden email]
> Betreff: RE: indexing two words, searching single word
>
> Hello,
>
> If your case is English you could use synonyms to work around the
> problem of the few compound words of the language. However, would you
> be dealing with a Germanic compound language, the
> HyphenationCompoundWordTokenFilter
> [1] or DictionaryCompoundWordTokenFilter are a better choice. The
> former is much more flexible but has its drawbacks.
>
> Regards,
> Markus
>
>
> https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucen
> e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
>
>
>
> -----Original message-----
> > From:Clemens Wyss DEV <[hidden email]>
> > Sent: Friday 3rd August 2018 12:22
> > To: [hidden email]
> > Subject: indexing two words, searching single word
> >
> > Sounds like a rather simple issue:
> > if I index "sound stage" and search for "soundstage" I get no hits
> >
> > What am I doing wrong
> > a) when indexing
> > b) when searching
> > ?
> >
> > Thx in advance
> > - Clemens
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing two words, searching single word

Susheel Kumar-3
and as you suggested, use stop word before shingles...

On Fri, Aug 3, 2018 at 8:10 AM, Clemens Wyss DEV <[hidden email]>
wrote:

> <analyzer type="index">
>   <tokenizer class="solr.WhitespaceTokenizerFactory" />
>   <filter class="solr.LowerCaseFilterFactory" />
>   <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true" tokenSeparator=""/> <!-- here weg go! -->
> </analyzer>
>
> seems to "work"
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss DEV <[hidden email]>
> Gesendet: Freitag, 3. August 2018 13:46
> An: [hidden email]
> Betreff: AW: indexing two words, searching single word
>
> >Because you probably are not looking for "andthe" kind of tokens
> (unfortunately) I guess I am, as we don't know what people enter...
>
> > a shingle plus regex to remove whitespace
> sounds interesting. How would that filter-chain look like? That would be
> an type="index"-analyzer?
> I guess we could shingle after stop-word-filtering and I quess
> maxShingleSize="2" would suffice
>
> -----Ursprüngliche Nachricht-----
> Von: Alexandre Rafalovitch <[hidden email]>
> Gesendet: Freitag, 3. August 2018 13:33
> An: solr-user <[hidden email]>
> Betreff: Re: indexing two words, searching single word
>
> But what is your generic problem then. Because you probably are not
> looking for "andthe" kind of tokens.
>
> However a shingle plus regex to remove whitespace can give you "anytwo
> wordstogether smooshed" tokens in the index.
>
> Regards,
>      Alex
>
>
> On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, <[hidden email]>
> wrote:
>
> > Hi Markus,
> > thanks for the quick answer.
> >
> > "sound stage" was just an example. We are looking for a generic
> > solution ...
> >
> > Is it "ok" to apply an NGRamFilter for query-analyzing?
> > <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >         <filter class="solr.LowerCaseFilterFactory" />
> >         <filter class="solr.NGramFilterFactory" minGramSize="3"
> > maxGramSize="15" />
> > </analyzer>
> >
> > I guess (besides the performance impact) this reduces search results
> > accuracy?
> >
> > -Clemens
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Markus Jelsma <[hidden email]>
> > Gesendet: Freitag, 3. August 2018 12:43
> > An: [hidden email]
> > Betreff: RE: indexing two words, searching single word
> >
> > Hello,
> >
> > If your case is English you could use synonyms to work around the
> > problem of the few compound words of the language. However, would you
> > be dealing with a Germanic compound language, the
> > HyphenationCompoundWordTokenFilter
> > [1] or DictionaryCompoundWordTokenFilter are a better choice. The
> > former is much more flexible but has its drawbacks.
> >
> > Regards,
> > Markus
> >
> >
> > https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucen
> > e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> >
> >
> >
> > -----Original message-----
> > > From:Clemens Wyss DEV <[hidden email]>
> > > Sent: Friday 3rd August 2018 12:22
> > > To: [hidden email]
> > > Subject: indexing two words, searching single word
> > >
> > > Sounds like a rather simple issue:
> > > if I index "sound stage" and search for "soundstage" I get no hits
> > >
> > > What am I doing wrong
> > > a) when indexing
> > > b) when searching
> > > ?
> > >
> > > Thx in advance
> > > - Clemens
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

AW: indexing two words, searching single word

Clemens Wyss DEV
+1 ;)

-----Ursprüngliche Nachricht-----
Von: Susheel Kumar <[hidden email]>
Gesendet: Freitag, 3. August 2018 14:40
An: [hidden email]
Betreff: Re: indexing two words, searching single word

and as you suggested, use stop word before shingles...

On Fri, Aug 3, 2018 at 8:10 AM, Clemens Wyss DEV <[hidden email]>
wrote:

> <analyzer type="index">
>   <tokenizer class="solr.WhitespaceTokenizerFactory" />
>   <filter class="solr.LowerCaseFilterFactory" />
>   <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true" tokenSeparator=""/> <!-- here weg go! -->
> </analyzer>
>
> seems to "work"
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss DEV <[hidden email]>
> Gesendet: Freitag, 3. August 2018 13:46
> An: [hidden email]
> Betreff: AW: indexing two words, searching single word
>
> >Because you probably are not looking for "andthe" kind of tokens
> (unfortunately) I guess I am, as we don't know what people enter...
>
> > a shingle plus regex to remove whitespace
> sounds interesting. How would that filter-chain look like? That would
> be an type="index"-analyzer?
> I guess we could shingle after stop-word-filtering and I quess
> maxShingleSize="2" would suffice
>
> -----Ursprüngliche Nachricht-----
> Von: Alexandre Rafalovitch <[hidden email]>
> Gesendet: Freitag, 3. August 2018 13:33
> An: solr-user <[hidden email]>
> Betreff: Re: indexing two words, searching single word
>
> But what is your generic problem then. Because you probably are not
> looking for "andthe" kind of tokens.
>
> However a shingle plus regex to remove whitespace can give you "anytwo
> wordstogether smooshed" tokens in the index.
>
> Regards,
>      Alex
>
>
> On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, <[hidden email]>
> wrote:
>
> > Hi Markus,
> > thanks for the quick answer.
> >
> > "sound stage" was just an example. We are looking for a generic
> > solution ...
> >
> > Is it "ok" to apply an NGRamFilter for query-analyzing?
> > <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >         <filter class="solr.LowerCaseFilterFactory" />
> >         <filter class="solr.NGramFilterFactory" minGramSize="3"
> > maxGramSize="15" />
> > </analyzer>
> >
> > I guess (besides the performance impact) this reduces search results
> > accuracy?
> >
> > -Clemens
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Markus Jelsma <[hidden email]>
> > Gesendet: Freitag, 3. August 2018 12:43
> > An: [hidden email]
> > Betreff: RE: indexing two words, searching single word
> >
> > Hello,
> >
> > If your case is English you could use synonyms to work around the
> > problem of the few compound words of the language. However, would
> > you be dealing with a Germanic compound language, the
> > HyphenationCompoundWordTokenFilter
> > [1] or DictionaryCompoundWordTokenFilter are a better choice. The
> > former is much more flexible but has its drawbacks.
> >
> > Regards,
> > Markus
> >
> >
> > https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/luc
> > en
> > e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> >
> >
> >
> > -----Original message-----
> > > From:Clemens Wyss DEV <[hidden email]>
> > > Sent: Friday 3rd August 2018 12:22
> > > To: [hidden email]
> > > Subject: indexing two words, searching single word
> > >
> > > Sounds like a rather simple issue:
> > > if I index "sound stage" and search for "soundstage" I get no hits
> > >
> > > What am I doing wrong
> > > a) when indexing
> > > b) when searching
> > > ?
> > >
> > > Thx in advance
> > > - Clemens
> > >
> >
>