Dealing with multi-word keywords and SOW=true

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Dealing with multi-word keywords and SOW=true

Ashwin Ramesh
Hi everybody,

I am using the edismax parser and have noticed a very specific behaviour
with how sow=true (default) handles multiword keywords.

We have a field called 'keywords', which uses the general
KeywordTokenizerFactory. There are also other text fields like title and
description. etc.

When we index a document with a keyword "ice cream", for example, we know
it gets indexed into that field as "ice cream".

However, at query time, I noticed that if we run an Edismax query:
q=ice cream
qf=keywords

I do not get that document back as a match. This is due to sow=true
splitting the user's query and the final tokens not being present in the
keywords field.

I was wondering what the best practise around this was? Some thoughts I
have had:

1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
cream" -> "ice-cream"
2. Additionally index the separate words as keywords also. E.g. "ice cream"
-> "ice cream", "ice", "cream". However this method will result in the loss
of intent (q=ice would return this document).
3. Add a boost query which is an edismax query where we explicitly set
sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*

Is there an industry practise solution to handle this type of problem? Keep
in mind that the other text fields may also include these terms. E.g.
title="This is ice cream", which would match the query. This specific
problem affects the keywords field for the obvious reason that the indexing
pipeline does not tokenize keywords.

Thank you for all your amazing help,

Regards,

Ash

--
*P.S. We've launched a new blog to share the latest ideas and case studies
from our team. Check it out here: product.canva.com
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the
world to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
 <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://twitter.com/canva>  <https://facebook.com/canva
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>






Reply | Threaded
Open this post in threaded view
|

Re: Dealing with multi-word keywords and SOW=true

Erick Erickson
Have you tried taking your keyword field out of the “qf” param and adding it explicitly? As keyword:”ice cream”

Best,
Erick

> On Sep 30, 2019, at 5:27 AM, Ashwin Ramesh <[hidden email]> wrote:
>
> Hi everybody,
>
> I am using the edismax parser and have noticed a very specific behaviour
> with how sow=true (default) handles multiword keywords.
>
> We have a field called 'keywords', which uses the general
> KeywordTokenizerFactory. There are also other text fields like title and
> description. etc.
>
> When we index a document with a keyword "ice cream", for example, we know
> it gets indexed into that field as "ice cream".
>
> However, at query time, I noticed that if we run an Edismax query:
> q=ice cream
> qf=keywords
>
> I do not get that document back as a match. This is due to sow=true
> splitting the user's query and the final tokens not being present in the
> keywords field.
>
> I was wondering what the best practise around this was? Some thoughts I
> have had:
>
> 1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
> cream" -> "ice-cream"
> 2. Additionally index the separate words as keywords also. E.g. "ice cream"
> -> "ice cream", "ice", "cream". However this method will result in the loss
> of intent (q=ice would return this document).
> 3. Add a boost query which is an edismax query where we explicitly set
> sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
> sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*
>
> Is there an industry practise solution to handle this type of problem? Keep
> in mind that the other text fields may also include these terms. E.g.
> title="This is ice cream", which would match the query. This specific
> problem affects the keywords field for the obvious reason that the indexing
> pipeline does not tokenize keywords.
>
> Thank you for all your amazing help,
>
> Regards,
>
> Ash
>
> --
> *P.S. We've launched a new blog to share the latest ideas and case studies
> from our team. Check it out here: product.canva.com
> <https://product.canva.com/>. ***
> ** <https://www.canva.com/>Empowering the
> world to design
> Also, we're hiring. Apply here!
> <https://about.canva.com/careers/>
> <https://twitter.com/canva>
> <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> <https://twitter.com/canva>  <https://facebook.com/canva>  
> <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
>
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Dealing with multi-word keywords and SOW=true

Ashwin Ramesh
Thanks Erick, that seems to work!

Should I leave it in qf also? For example the query "blue dog" may be
represented as separate tokens in the keyword index.



On Mon, Sep 30, 2019 at 9:32 PM Erick Erickson <[hidden email]>
wrote:

> Have you tried taking your keyword field out of the “qf” param and adding
> it explicitly? As keyword:”ice cream”
>
> Best,
> Erick
>
> > On Sep 30, 2019, at 5:27 AM, Ashwin Ramesh <[hidden email]> wrote:
> >
> > Hi everybody,
> >
> > I am using the edismax parser and have noticed a very specific behaviour
> > with how sow=true (default) handles multiword keywords.
> >
> > We have a field called 'keywords', which uses the general
> > KeywordTokenizerFactory. There are also other text fields like title and
> > description. etc.
> >
> > When we index a document with a keyword "ice cream", for example, we know
> > it gets indexed into that field as "ice cream".
> >
> > However, at query time, I noticed that if we run an Edismax query:
> > q=ice cream
> > qf=keywords
> >
> > I do not get that document back as a match. This is due to sow=true
> > splitting the user's query and the final tokens not being present in the
> > keywords field.
> >
> > I was wondering what the best practise around this was? Some thoughts I
> > have had:
> >
> > 1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
> > cream" -> "ice-cream"
> > 2. Additionally index the separate words as keywords also. E.g. "ice
> cream"
> > -> "ice cream", "ice", "cream". However this method will result in the
> loss
> > of intent (q=ice would return this document).
> > 3. Add a boost query which is an edismax query where we explicitly set
> > sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
> > sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*
> >
> > Is there an industry practise solution to handle this type of problem?
> Keep
> > in mind that the other text fields may also include these terms. E.g.
> > title="This is ice cream", which would match the query. This specific
> > problem affects the keywords field for the obvious reason that the
> indexing
> > pipeline does not tokenize keywords.
> >
> > Thank you for all your amazing help,
> >
> > Regards,
> >
> > Ash
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > <https://product.canva.com/>. ***
> > ** <https://www.canva.com/>Empowering the
> > world to design
> > Also, we're hiring. Apply here!
> > <https://about.canva.com/careers/>
> > <https://twitter.com/canva>
> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> > <https://twitter.com/canva>  <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
>
>

--
*P.S. We've launched a new blog to share the latest ideas and case studies
from our team. Check it out here: product.canva.com
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the
world to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
 <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://twitter.com/canva>  <https://facebook.com/canva
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>






Reply | Threaded
Open this post in threaded view
|

Re: Dealing with multi-word keywords and SOW=true

Erick Erickson
You should not leave it in the qf field. You’re getting confused by the difference between query _parsing_ and the analysis chain. The parsing turns your top-level query of “ice cream” (assuming without quotes) into something like

f1:ice f1:cream f2:ice f2:cream

This is happening way before analysis takes over. what you need is for both “ice” and “cream” to be passed as a unit to the analysis chain, and if you rely on the qf parameter it won’t happen.

Best,
Erick

> On Sep 30, 2019, at 7:24 PM, Ashwin Ramesh <[hidden email]> wrote:
>
> Thanks Erick, that seems to work!
>
> Should I leave it in qf also? For example the query "blue dog" may be
> represented as separate tokens in the keyword index.
>
>
>
> On Mon, Sep 30, 2019 at 9:32 PM Erick Erickson <[hidden email]>
> wrote:
>
>> Have you tried taking your keyword field out of the “qf” param and adding
>> it explicitly? As keyword:”ice cream”
>>
>> Best,
>> Erick
>>
>>> On Sep 30, 2019, at 5:27 AM, Ashwin Ramesh <[hidden email]> wrote:
>>>
>>> Hi everybody,
>>>
>>> I am using the edismax parser and have noticed a very specific behaviour
>>> with how sow=true (default) handles multiword keywords.
>>>
>>> We have a field called 'keywords', which uses the general
>>> KeywordTokenizerFactory. There are also other text fields like title and
>>> description. etc.
>>>
>>> When we index a document with a keyword "ice cream", for example, we know
>>> it gets indexed into that field as "ice cream".
>>>
>>> However, at query time, I noticed that if we run an Edismax query:
>>> q=ice cream
>>> qf=keywords
>>>
>>> I do not get that document back as a match. This is due to sow=true
>>> splitting the user's query and the final tokens not being present in the
>>> keywords field.
>>>
>>> I was wondering what the best practise around this was? Some thoughts I
>>> have had:
>>>
>>> 1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
>>> cream" -> "ice-cream"
>>> 2. Additionally index the separate words as keywords also. E.g. "ice
>> cream"
>>> -> "ice cream", "ice", "cream". However this method will result in the
>> loss
>>> of intent (q=ice would return this document).
>>> 3. Add a boost query which is an edismax query where we explicitly set
>>> sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
>>> sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*
>>>
>>> Is there an industry practise solution to handle this type of problem?
>> Keep
>>> in mind that the other text fields may also include these terms. E.g.
>>> title="This is ice cream", which would match the query. This specific
>>> problem affects the keywords field for the obvious reason that the
>> indexing
>>> pipeline does not tokenize keywords.
>>>
>>> Thank you for all your amazing help,
>>>
>>> Regards,
>>>
>>> Ash
>>>
>>> --
>>> *P.S. We've launched a new blog to share the latest ideas and case
>> studies
>>> from our team. Check it out here: product.canva.com
>>> <https://product.canva.com/>. ***
>>> ** <https://www.canva.com/>Empowering the
>>> world to design
>>> Also, we're hiring. Apply here!
>>> <https://about.canva.com/careers/>
>>> <https://twitter.com/canva>
>>> <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
>>> <https://twitter.com/canva>  <https://facebook.com/canva>
>>> <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
> --
> *P.S. We've launched a new blog to share the latest ideas and case studies
> from our team. Check it out here: product.canva.com
> <https://product.canva.com/>. ***
> ** <https://www.canva.com/>Empowering the
> world to design
> Also, we're hiring. Apply here!
> <https://about.canva.com/careers/>
> <https://twitter.com/canva>
> <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> <https://twitter.com/canva>  <https://facebook.com/canva>  
> <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
>
>
>
>
>
>