Prefix + Suffix Wildcards in Searches

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Prefix + Suffix Wildcards in Searches

Chris Dempsey
Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) but
I'm looking into options for optimizing something like this:

> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
tag:*ms-reply-paid*

It's probably not a surprise that we're seeing performance issues with
something like this. My understanding is that using the wildcard on both
ends forces a full-text index search. Something like the above can't take
advantage of something like the ReverseWordFilter either. I believe
constructing `n-grams` is an option (*at the expense of index size*) but is
there anything I'm overlooking as a possible avenue to look into?
Reply | Threaded
Open this post in threaded view
|

Re: Prefix + Suffix Wildcards in Searches

Erick Erickson
How regular are your patterns? Are they arbitrary?
What I’m wondering is if you could shift your work the the
indexing end, perhaps even in an auxiliary field. Could you,
say, just index “paid”, “ms-reply-unpaid” etc? Then there
are no wildcards at all. This akin to “concept search”.

Otherwise ngramming is your best bet.

What’s the field type anyway? Is this field tokenized?

There are lots of options, but soooo much depends on whether
you can process the data such that you won’t need wildcards.

Best,
Erick

> On Jun 29, 2020, at 11:16 AM, Chris Dempsey <[hidden email]> wrote:
>
> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) but
> I'm looking into options for optimizing something like this:
>
>> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> tag:*ms-reply-paid*
>
> It's probably not a surprise that we're seeing performance issues with
> something like this. My understanding is that using the wildcard on both
> ends forces a full-text index search. Something like the above can't take
> advantage of something like the ReverseWordFilter either. I believe
> constructing `n-grams` is an option (*at the expense of index size*) but is
> there anything I'm overlooking as a possible avenue to look into?

Reply | Threaded
Open this post in threaded view
|

Re: Prefix + Suffix Wildcards in Searches

Chris Dempsey
First off, thanks for taking a look, Erick! I see you helping lots of folks
out here and I've learned a lot from your answers. Much appreciated!

> How regular are your patterns? Are they arbitrary?

Good question. :) That's data that I should have included in the initial
post but both the values in the `tag` field and the search query itself are
totally arbitrary (*i.e. user entered values*). I see where you're going if
the set of either part was limited.

> What’s the field type anyway? Is this field tokenized?

<field name="tag" type="text_kwt_fd_lc" indexed="true" stored="true"
multiValued="true"/>

<fieldType name="text_kwt_fd_lc" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
    <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"
preserveOriginal="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory"
withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2"
maxFractionAsterisk="0"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

On Mon, Jun 29, 2020 at 10:33 AM Erick Erickson <[hidden email]>
wrote:

> How regular are your patterns? Are they arbitrary?
> What I’m wondering is if you could shift your work the the
> indexing end, perhaps even in an auxiliary field. Could you,
> say, just index “paid”, “ms-reply-unpaid” etc? Then there
> are no wildcards at all. This akin to “concept search”.
>
> Otherwise ngramming is your best bet.
>
> What’s the field type anyway? Is this field tokenized?
>
> There are lots of options, but soooo much depends on whether
> you can process the data such that you won’t need wildcards.
>
> Best,
> Erick
>
> > On Jun 29, 2020, at 11:16 AM, Chris Dempsey <[hidden email]> wrote:
> >
> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
> but
> > I'm looking into options for optimizing something like this:
> >
> >> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> > tag:*ms-reply-paid*
> >
> > It's probably not a surprise that we're seeing performance issues with
> > something like this. My understanding is that using the wildcard on both
> > ends forces a full-text index search. Something like the above can't take
> > advantage of something like the ReverseWordFilter either. I believe
> > constructing `n-grams` is an option (*at the expense of index size*) but
> is
> > there anything I'm overlooking as a possible avenue to look into?
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Prefix + Suffix Wildcards in Searches

Erick Erickson
I was afraid of “totally arbitrary”

OK, this field type is going to surprise the heck out of you. Whitespace
tokenizer is really stupid. It’ll include punctuation for instance. Take
a look at the admin UI/analysis page and pick your field and put some
creative entries in and you’ll see what I mean.

So let’s get some use-cases in place. Can users enter tags like
blahms-reply-unpaidnonsense and expect to find it with *ms-reply-unpaid*?
Or is the entry something like
my dog has ms-reply-unpaid and is mangy
? If the latter, simple token searching will work fine, there’s no need for
wildcards at all.

FWIW,
Erick

> On Jun 29, 2020, at 11:46 AM, Chris Dempsey <[hidden email]> wrote:
>
> First off, thanks for taking a look, Erick! I see you helping lots of folks
> out here and I've learned a lot from your answers. Much appreciated!
>
>> How regular are your patterns? Are they arbitrary?
>
> Good question. :) That's data that I should have included in the initial
> post but both the values in the `tag` field and the search query itself are
> totally arbitrary (*i.e. user entered values*). I see where you're going if
> the set of either part was limited.
>
>> What’s the field type anyway? Is this field tokenized?
>
> <field name="tag" type="text_kwt_fd_lc" indexed="true" stored="true"
> multiValued="true"/>
>
> <fieldType name="text_kwt_fd_lc" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>    <analyzer type="index">
>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.ASCIIFoldingFilterFactory"
> preserveOriginal="true" />
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.ReversedWildcardFilterFactory"
> withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2"
> maxFractionAsterisk="0"/>
>    </analyzer>
>    <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>    </analyzer>
> </fieldType>
>
> On Mon, Jun 29, 2020 at 10:33 AM Erick Erickson <[hidden email]>
> wrote:
>
>> How regular are your patterns? Are they arbitrary?
>> What I’m wondering is if you could shift your work the the
>> indexing end, perhaps even in an auxiliary field. Could you,
>> say, just index “paid”, “ms-reply-unpaid” etc? Then there
>> are no wildcards at all. This akin to “concept search”.
>>
>> Otherwise ngramming is your best bet.
>>
>> What’s the field type anyway? Is this field tokenized?
>>
>> There are lots of options, but soooo much depends on whether
>> you can process the data such that you won’t need wildcards.
>>
>> Best,
>> Erick
>>
>>> On Jun 29, 2020, at 11:16 AM, Chris Dempsey <[hidden email]> wrote:
>>>
>>> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>> but
>>> I'm looking into options for optimizing something like this:
>>>
>>>> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>>> tag:*ms-reply-paid*
>>>
>>> It's probably not a surprise that we're seeing performance issues with
>>> something like this. My understanding is that using the wildcard on both
>>> ends forces a full-text index search. Something like the above can't take
>>> advantage of something like the ReverseWordFilter either. I believe
>>> constructing `n-grams` is an option (*at the expense of index size*) but
>> is
>>> there anything I'm overlooking as a possible avenue to look into?
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Prefix + Suffix Wildcards in Searches

Mikhail Khludnev-2
In reply to this post by Chris Dempsey
Hello, Chris.
I suppose index time analysis can yield these terms:
"paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
expensive wildcard queries. Here's why it's worth to avoid them
https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam

On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <[hidden email]> wrote:

> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) but
> I'm looking into options for optimizing something like this:
>
> > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> tag:*ms-reply-paid*
>
> It's probably not a surprise that we're seeing performance issues with
> something like this. My understanding is that using the wildcard on both
> ends forces a full-text index search. Something like the above can't take
> advantage of something like the ReverseWordFilter either. I believe
> constructing `n-grams` is an option (*at the expense of index size*) but is
> there anything I'm overlooking as a possible avenue to look into?
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Prefix + Suffix Wildcards in Searches

Chris Dempsey
@Erick,

You've got the idea. Basically the users can attach zero or more tags (*that
they create*) to a document. So as an example say they've created the tags
(this example is just a small subset of the total tags):

   - paid
   - invoice-paid
   - ms-reply-unpaid-2019
   - credit-ms-reply-unpaid
   - ms-reply-paid-2019
   - ms-reply-paid-2020

and attached them in various combinations to documents. They then want to
find all documents by tag that don't contain the characters "paid" anywhere
in the tag, don't contain tags with the characters "ms-reply-unpaid", but
do include documents tagged with the characters "ms-reply-paid".

The obvious suggestion would be to have the users just use the entire tag
(i.e. don't let them do a "contains") as a condition to eliminate the
wildcards - which would work -  but unfortunately we have customers with (*not
joking*) over 100K different tags (*why have a taxonomy like that is yet a
different issue*). I'm willing to accept that in our scenario n-grams might
be the Solr-based answer (the other being to change what "contains" means
within our application) but thought I'd check I hadn't overlooked any other
options. :)

On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev <[hidden email]> wrote:

> Hello, Chris.
> I suppose index time analysis can yield these terms:
> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
> expensive wildcard queries. Here's why it's worth to avoid them
> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>
> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <[hidden email]> wrote:
>
> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
> but
> > I'm looking into options for optimizing something like this:
> >
> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> > tag:*ms-reply-paid*
> >
> > It's probably not a surprise that we're seeing performance issues with
> > something like this. My understanding is that using the wildcard on both
> > ends forces a full-text index search. Something like the above can't take
> > advantage of something like the ReverseWordFilter either. I believe
> > constructing `n-grams` is an option (*at the expense of index size*) but
> is
> > there anything I'm overlooking as a possible avenue to look into?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
Reply | Threaded
Open this post in threaded view
|

Re: Prefix + Suffix Wildcards in Searches

Chris Dempsey
@Mikhail

Thanks for the link! I'll read through that.

On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey <[hidden email]> wrote:

> @Erick,
>
> You've got the idea. Basically the users can attach zero or more tags (*that
> they create*) to a document. So as an example say they've created the
> tags (this example is just a small subset of the total tags):
>
>    - paid
>    - invoice-paid
>    - ms-reply-unpaid-2019
>    - credit-ms-reply-unpaid
>    - ms-reply-paid-2019
>    - ms-reply-paid-2020
>
> and attached them in various combinations to documents. They then want to
> find all documents by tag that don't contain the characters "paid" anywhere
> in the tag, don't contain tags with the characters "ms-reply-unpaid", but
> do include documents tagged with the characters "ms-reply-paid".
>
> The obvious suggestion would be to have the users just use the entire tag
> (i.e. don't let them do a "contains") as a condition to eliminate the
> wildcards - which would work -  but unfortunately we have customers with (*not
> joking*) over 100K different tags (*why have a taxonomy like that is yet
> a different issue*). I'm willing to accept that in our scenario n-grams
> might be the Solr-based answer (the other being to change what "contains"
> means within our application) but thought I'd check I hadn't overlooked any
> other options. :)
>
> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev <[hidden email]> wrote:
>
>> Hello, Chris.
>> I suppose index time analysis can yield these terms:
>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
>> expensive wildcard queries. Here's why it's worth to avoid them
>>
>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>>
>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <[hidden email]> wrote:
>>
>> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>> but
>> > I'm looking into options for optimizing something like this:
>> >
>> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>> > tag:*ms-reply-paid*
>> >
>> > It's probably not a surprise that we're seeing performance issues with
>> > something like this. My understanding is that using the wildcard on both
>> > ends forces a full-text index search. Something like the above can't
>> take
>> > advantage of something like the ReverseWordFilter either. I believe
>> > constructing `n-grams` is an option (*at the expense of index size*)
>> but is
>> > there anything I'm overlooking as a possible avenue to look into?
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Prefix + Suffix Wildcards in Searches

Erick Erickson
That’s not quite the question I was asking.

Let’s take "…that don’t contain the characters ‘paid’ “.

Start with the fact that no matter what the mechanics of
implementing pre-and-post wildcards, something like

*:* -tags:*paid*

would exclude a doc with a tag of "credit-ms-reply-unpaid" or
"ms-reply-unpaid-2019”. I really think this is an XY problem,
You’re assuming that the solution is pre-and-post wildcards
without a precise definition of the problem you’re trying to solve.

Do they want to exclude things with the characters ‘ia’ or ‘id’? Or
is their “unit of exclusion” the _entire_ word ‘paid’? Or can we
define it so? Because if we can, what I wrote yesterday about
using proper tokenization and phrase queries will work.

If you break up all your tags in your example into individual
tokens on non-alphanumerics, then your problem is much simpler,
excluding “*paid*” becomes

-tags:paid

excluding “*ms-reply*” becomes

-tags:”ms reply”

trying to exclude “*ms-unpaid*”

would _not_ exclude the doc with the tag "credit-ms-reply-unpaid”
because “ms” and “unpaid” are not sequential.

_Including_ is the same argument.

BTW, this is where “positionIncrementGap” comes in. If they can
define multiple tags in each document, phrase searching with
a gap greater than 1 (100 is the usual default) _and_ each tag
is an entry in a multiValued field, you can prevent matching
across tags with phrase searches. Consider two tags “ms-tag1”
and “paid-2019”. You don’t want “*tag1-paid*” to exclude this
doc I’d imagine. The positionIncrementGap takes care of this in the
phrase case. Remember that in this solution, the dashes aren’t
included in each token.

prefix only or postfix only would be a little tricky, one idea would be
to copyField into an _untokenized_ field and search
there in those cases. But even here, you need to determine precisely
what you expect. What would “*d-2019” return? Would it return
something ending in “ms-reply-paid-2019”?

Alternatively, you wouldn’t need a copyField if you introduced
special tokens before and after each tag, so indexing “invoice-paid”
would index tokens:
specialbegintoken invoice paid specialendtoken
and searching for

*paid

becomes tag:“paid specialendtoken"

Best,
Erick

> On Jun 30, 2020, at 7:29 AM, Chris Dempsey <[hidden email]> wrote:
>
> @Mikhail
>
> Thanks for the link! I'll read through that.
>
> On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey <[hidden email]> wrote:
>
>> @Erick,
>>
>> You've got the idea. Basically the users can attach zero or more tags (*that
>> they create*) to a document. So as an example say they've created the
>> tags (this example is just a small subset of the total tags):
>>
>>   - paid
>>   - invoice-paid
>>   - ms-reply-unpaid-2019
>>   - credit-ms-reply-unpaid
>>   - ms-reply-paid-2019
>>   - ms-reply-paid-2020
>>
>> and attached them in various combinations to documents. They then want to
>> find all documents by tag that don't contain the characters "paid" anywhere
>> in the tag, don't contain tags with the characters "ms-reply-unpaid", but
>> do include documents tagged with the characters "ms-reply-paid".
>>
>> The obvious suggestion would be to have the users just use the entire tag
>> (i.e. don't let them do a "contains") as a condition to eliminate the
>> wildcards - which would work -  but unfortunately we have customers with (*not
>> joking*) over 100K different tags (*why have a taxonomy like that is yet
>> a different issue*). I'm willing to accept that in our scenario n-grams
>> might be the Solr-based answer (the other being to change what "contains"
>> means within our application) but thought I'd check I hadn't overlooked any
>> other options. :)
>>
>> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev <[hidden email]> wrote:
>>
>>> Hello, Chris.
>>> I suppose index time analysis can yield these terms:
>>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
>>> expensive wildcard queries. Here's why it's worth to avoid them
>>>
>>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>>>
>>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <[hidden email]> wrote:
>>>
>>>> Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>>> but
>>>> I'm looking into options for optimizing something like this:
>>>>
>>>>> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>>>> tag:*ms-reply-paid*
>>>>
>>>> It's probably not a surprise that we're seeing performance issues with
>>>> something like this. My understanding is that using the wildcard on both
>>>> ends forces a full-text index search. Something like the above can't
>>> take
>>>> advantage of something like the ReverseWordFilter either. I believe
>>>> constructing `n-grams` is an option (*at the expense of index size*)
>>> but is
>>>> there anything I'm overlooking as a possible avenue to look into?
>>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>>