Shingles behavior

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Shingles behavior

Radu Gheorghe
Hello Solr users,

I’m quite puzzled about how shingles work. The way tokens are analysed looks fine to me, but the query seems too restrictive.

Here’s the sample use-case. I have three documents:

mona lisa smile
mona lisa
mona

I have a shingle filter set up like this (both index- and query-time):

> <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize=“4”/>

When I query for “Mona Lisa smile” (no quotes), I expect to get all three documents back, in that order. Because the first document matches all the terms:

mona
mona lisa
mona lisa smile
lisa
lisa smile
smile

And the second one matches only some, and the third document only matches one.

Instead, I only get the first document back. That’s because the query expects all the “words” to match:

> "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) shingle_field:mona lisa smile)))”,

The query above is generated by the Edismax query parser, when I’m using “shingle_field” as “df”.

Is there a way to get “any of the words” to match? I’ve tried all the options I can think of:
- different query parsers
- q.OP=OR
- mm=0 (or 1 or 0% or 10% or…)

Nothing seems to change the parsed query from the above.

I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by default, and minimum_should_match works as expected. The only difference I see between the two, on the analysis side, is that tokens start at 0 in Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that the default “text_en”, for example, also starts at position 1.

Is it just a bug that mm doesn’t work in the context of shingles? Or is there a workaround?

Thanks and best regards,
Radu
Reply | Threaded
Open this post in threaded view
|

Re: Shingles behavior

Alexandre Rafalovitch
Did you try it with 'sow' parameter both ways? I am not sure I fully
understand the question, especially with shingling on both passes
rather than just indexing one. But at least it is something to try and
is one of the difference areas between Solr and ES.

Regards,
   Alex.

On Tue, 19 May 2020 at 05:59, Radu Gheorghe <[hidden email]> wrote:

>
> Hello Solr users,
>
> I’m quite puzzled about how shingles work. The way tokens are analysed looks fine to me, but the query seems too restrictive.
>
> Here’s the sample use-case. I have three documents:
>
> mona lisa smile
> mona lisa
> mona
>
> I have a shingle filter set up like this (both index- and query-time):
>
> > <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize=“4”/>
>
> When I query for “Mona Lisa smile” (no quotes), I expect to get all three documents back, in that order. Because the first document matches all the terms:
>
> mona
> mona lisa
> mona lisa smile
> lisa
> lisa smile
> smile
>
> And the second one matches only some, and the third document only matches one.
>
> Instead, I only get the first document back. That’s because the query expects all the “words” to match:
>
> > "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) shingle_field:mona lisa smile)))”,
>
> The query above is generated by the Edismax query parser, when I’m using “shingle_field” as “df”.
>
> Is there a way to get “any of the words” to match? I’ve tried all the options I can think of:
> - different query parsers
> - q.OP=OR
> - mm=0 (or 1 or 0% or 10% or…)
>
> Nothing seems to change the parsed query from the above.
>
> I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by default, and minimum_should_match works as expected. The only difference I see between the two, on the analysis side, is that tokens start at 0 in Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that the default “text_en”, for example, also starts at position 1.
>
> Is it just a bug that mm doesn’t work in the context of shingles? Or is there a workaround?
>
> Thanks and best regards,
> Radu
Reply | Threaded
Open this post in threaded view
|

Re: Shingles behavior

Radu Gheorghe
Hi Alex, long time no see :)

I tried with sow, and that basically invalidates query-time shingles (it
only mathes mona OR lisa OR smile).

I'm using shingles at both index and query time as a substitute for pf2 and
pf3: the more shingles I match, the more relevant the document. Also,
higher order shingles naturally get lower frequencies, meaning they get a
"natural" boost.

Best regards,
Radu

joi, 21 mai 2020, 00:28 Alexandre Rafalovitch <[hidden email]> a scris:

> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
>
> Regards,
>    Alex.
>
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe <[hidden email]>
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed
> looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > > <filter class="solr.ShingleFilterFactory" minShingleSize="2"
> maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all
> three documents back, in that order. Because the first document matches all
> the terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only
> matches one.
> >
> > Instead, I only get the first document back. That’s because the query
> expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona
> +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona
> +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile)
> shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using
> “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the
> options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR”
> by default, and minimum_should_match works as expected. The only difference
> I see between the two, on the analysis side, is that tokens start at 0 in
> Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see
> that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is
> there a workaround?
> >
> > Thanks and best regards,
> > Radu
>
Reply | Threaded
Open this post in threaded view
|

Re: Shingles behavior

Radu Gheorghe
Turns out, it’s down to setting enableGraphQueries=false in the field definition. I completely missed that :(

> On 21 May 2020, at 07:49, Radu Gheorghe <[hidden email]> wrote:
>
> Hi Alex, long time no see :)
>
> I tried with sow, and that basically invalidates query-time shingles (it only mathes mona OR lisa OR smile).
>
> I'm using shingles at both index and query time as a substitute for pf2 and pf3: the more shingles I match, the more relevant the document. Also, higher order shingles naturally get lower frequencies, meaning they get a "natural" boost.
>
> Best regards,
> Radu
>
> joi, 21 mai 2020, 00:28 Alexandre Rafalovitch <[hidden email]> a scris:
> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
>
> Regards,
>    Alex.
>
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe <[hidden email]> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > > <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all three documents back, in that order. Because the first document matches all the terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only matches one.
> >
> > Instead, I only get the first document back. That’s because the query expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery((((+shingle_field:mona +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by default, and minimum_should_match works as expected. The only difference I see between the two, on the analysis side, is that tokens start at 0 in Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is there a workaround?
> >
> > Thanks and best regards,
> > Radu