Use of Payloads

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Use of Payloads

Alan Woodward-3
Hi all,

The new intervals queries are now nearly at feature parity with Spans; the implementations still outstanding are all to do with using payloads.  Currently, span queries allow you to filter out spans based on the payloads of the matching terms, and also allow you to modify the score of the query as a whole based on those payloads.  I’d like to get some idea of how people are actually using these functions.

In terms of filtering, adding an IntervalSource that wraps a simple term and filters it out based on the payload will be simple enough.  Adding this for compound intervals is more complicated, and I think trickier to reason about, so I’d like to try and avoid doing this if possible - feedback on actual use-cases would be helpful here.

For scoring, intervals use a completely different scoring mechanism to Spans, just returning a scaled score between 0 and [boost].  To include term weighting as well, users should combine the Intervals query with a boolean query consisting of all terms used in the IntervalsSource.  This doesn’t mix so well with payloads, but an alternative option here could be to add a PayloadTermQuery that can adjust the term frequency of a term on a particular document via a payload function.

What do people think?  Are there cases that I’ve missed, or other possible uses here?

- Alan
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use of Payloads

david.w.smiley@gmail.com
Hi Alan,

I've built custom SpanQuery derivatives that filter matches based on encoded information in the payload.  Section/Page/Paragraph/Sentence IDs can be put here, and the case of the original surface form can be encoded here as well.  I'm sure others can come up with creative uses.  My uses of this only needed to occur at the SpanTermQuery level, and so no aggregation concerns.

I've certainly heard of influencing the score based on a payload but I don't recall that I've had to do it personally.  Erik Hatcher; how about you?

~ David

On Thu, Feb 7, 2019 at 4:26 AM Alan Woodward <[hidden email]> wrote:
Hi all,

The new intervals queries are now nearly at feature parity with Spans; the implementations still outstanding are all to do with using payloads.  Currently, span queries allow you to filter out spans based on the payloads of the matching terms, and also allow you to modify the score of the query as a whole based on those payloads.  I’d like to get some idea of how people are actually using these functions.

In terms of filtering, adding an IntervalSource that wraps a simple term and filters it out based on the payload will be simple enough.  Adding this for compound intervals is more complicated, and I think trickier to reason about, so I’d like to try and avoid doing this if possible - feedback on actual use-cases would be helpful here.

For scoring, intervals use a completely different scoring mechanism to Spans, just returning a scaled score between 0 and [boost].  To include term weighting as well, users should combine the Intervals query with a boolean query consisting of all terms used in the IntervalsSource.  This doesn’t mix so well with payloads, but an alternative option here could be to add a PayloadTermQuery that can adjust the term frequency of a term on a particular document via a payload function.

What do people think?  Are there cases that I’ve missed, or other possible uses here?

- Alan
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

--
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
Reply | Threaded
Open this post in threaded view
|

RE: Use of Payloads

Uwe Schindler
In reply to this post by Alan Woodward-3
Hi,

I think the main reason why there are Payload implementation inside Spans are the fact that the payloads are stored together with the positions in the postings. Due to performance reasons, back at that time, the processing of payloads was put into the span query series, because then you can score by payload and do position based stuff in a single pass.

I agree that adding that to the IntervalSource API is hard, because IntervalSource does not know anything about payloads, so a combination of different queries won't work. And as you said, the soring is separated.

Payloads are mostly used for scoring, but I don't remember any use case I had in the last 5 years that made use of this - it was just too slow. And term-level boosts are seldomly used. In most cases people stick with document-level boosts (docvalues). Nowadays I'd also recommend FeatureField for term/keyword/category-level scoring.

One thing that payloads were used are NLP features like word type annotations and filtering based on that, which requires (of course support in spans). But in most cases the better way to do this is to add the annotation into the term text and do simple term queries (like terms called "lucene#propernoun").

IMHO, adding a PayloadTermQuery-like type to change the term frequency based on a function of payload is fine, but can easily be modelled with FeatureField, too.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Alan Woodward <[hidden email]>
> Sent: Thursday, February 7, 2019 10:27 AM
> To: [hidden email]
> Subject: Use of Payloads
>
> Hi all,
>
> The new intervals queries are now nearly at feature parity with Spans; the
> implementations still outstanding are all to do with using payloads.
> Currently, span queries allow you to filter out spans based on the payloads
> of the matching terms, and also allow you to modify the score of the query
> as a whole based on those payloads.  I’d like to get some idea of how people
> are actually using these functions.
>
> In terms of filtering, adding an IntervalSource that wraps a simple term and
> filters it out based on the payload will be simple enough.  Adding this for
> compound intervals is more complicated, and I think trickier to reason about,
> so I’d like to try and avoid doing this if possible - feedback on actual use-
> cases would be helpful here.
>
> For scoring, intervals use a completely different scoring mechanism to Spans,
> just returning a scaled score between 0 and [boost].  To include term
> weighting as well, users should combine the Intervals query with a boolean
> query consisting of all terms used in the IntervalsSource.  This doesn’t mix so
> well with payloads, but an alternative option here could be to add a
> PayloadTermQuery that can adjust the term frequency of a term on a
> particular document via a payload function.
>
> What do people think?  Are there cases that I’ve missed, or other possible
> uses here?
>
> - Alan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use of Payloads

Michael Gibney
Hi Alan,

At the moment I'm using Payloads (exposed via TermSpans) to store positionLength in the index per-position (definitely in Uwe's category of "[because] payloads are stored together with the positions in the postings"). I'm using the positionLength for precise SpanNearQuery phrase matching with index-time synonyms/token-graphs.

I'm not sure how directly relevant positionLength would be to IntervalSource. But more generally, I can say that I really appreciate having access to Payloads as a generic framework for implementation of experimental features that rely on per-position indexed attributes.

Michael

On Wed, Feb 13, 2019 at 3:27 AM Uwe Schindler <[hidden email]> wrote:
Hi,

I think the main reason why there are Payload implementation inside Spans are the fact that the payloads are stored together with the positions in the postings. Due to performance reasons, back at that time, the processing of payloads was put into the span query series, because then you can score by payload and do position based stuff in a single pass.

I agree that adding that to the IntervalSource API is hard, because IntervalSource does not know anything about payloads, so a combination of different queries won't work. And as you said, the soring is separated.

Payloads are mostly used for scoring, but I don't remember any use case I had in the last 5 years that made use of this - it was just too slow. And term-level boosts are seldomly used. In most cases people stick with document-level boosts (docvalues). Nowadays I'd also recommend FeatureField for term/keyword/category-level scoring.

One thing that payloads were used are NLP features like word type annotations and filtering based on that, which requires (of course support in spans). But in most cases the better way to do this is to add the annotation into the term text and do simple term queries (like terms called "lucene#propernoun").

IMHO, adding a PayloadTermQuery-like type to change the term frequency based on a function of payload is fine, but can easily be modelled with FeatureField, too.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Alan Woodward <[hidden email]>
> Sent: Thursday, February 7, 2019 10:27 AM
> To: [hidden email]
> Subject: Use of Payloads
>
> Hi all,
>
> The new intervals queries are now nearly at feature parity with Spans; the
> implementations still outstanding are all to do with using payloads.
> Currently, span queries allow you to filter out spans based on the payloads
> of the matching terms, and also allow you to modify the score of the query
> as a whole based on those payloads.  I’d like to get some idea of how people
> are actually using these functions.
>
> In terms of filtering, adding an IntervalSource that wraps a simple term and
> filters it out based on the payload will be simple enough.  Adding this for
> compound intervals is more complicated, and I think trickier to reason about,
> so I’d like to try and avoid doing this if possible - feedback on actual use-
> cases would be helpful here.
>
> For scoring, intervals use a completely different scoring mechanism to Spans,
> just returning a scaled score between 0 and [boost].  To include term
> weighting as well, users should combine the Intervals query with a boolean
> query consisting of all terms used in the IntervalsSource.  This doesn’t mix so
> well with payloads, but an alternative option here could be to add a
> PayloadTermQuery that can adjust the term frequency of a term on a
> particular document via a payload function.
>
> What do people think?  Are there cases that I’ve missed, or other possible
> uses here?
>
> - Alan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use of Payloads

Alan Woodward-3
Hey Michael, that’s a really interesting use case.  This should be possible using intervals as well - you could write an IntervalsSource that reads payloads and overrides end() to increment it by the encoded length.

For the filtering cases, it should be easy to add a new factory method that takes a Term and a Predicate<BytesRef> that allows you to filter out particular terms based on their payloads.

I’m still interested in seeing how people are using it for scoring as well, so please keep replying to the thread.

On 13 Feb 2019, at 15:21, Michael Gibney <[hidden email]> wrote:

Hi Alan,

At the moment I'm using Payloads (exposed via TermSpans) to store positionLength in the index per-position (definitely in Uwe's category of "[because] payloads are stored together with the positions in the postings"). I'm using the positionLength for precise SpanNearQuery phrase matching with index-time synonyms/token-graphs.

I'm not sure how directly relevant positionLength would be to IntervalSource. But more generally, I can say that I really appreciate having access to Payloads as a generic framework for implementation of experimental features that rely on per-position indexed attributes.

Michael

On Wed, Feb 13, 2019 at 3:27 AM Uwe Schindler <[hidden email]> wrote:
Hi,

I think the main reason why there are Payload implementation inside Spans are the fact that the payloads are stored together with the positions in the postings. Due to performance reasons, back at that time, the processing of payloads was put into the span query series, because then you can score by payload and do position based stuff in a single pass.

I agree that adding that to the IntervalSource API is hard, because IntervalSource does not know anything about payloads, so a combination of different queries won't work. And as you said, the soring is separated.

Payloads are mostly used for scoring, but I don't remember any use case I had in the last 5 years that made use of this - it was just too slow. And term-level boosts are seldomly used. In most cases people stick with document-level boosts (docvalues). Nowadays I'd also recommend FeatureField for term/keyword/category-level scoring.

One thing that payloads were used are NLP features like word type annotations and filtering based on that, which requires (of course support in spans). But in most cases the better way to do this is to add the annotation into the term text and do simple term queries (like terms called "lucene#propernoun").

IMHO, adding a PayloadTermQuery-like type to change the term frequency based on a function of payload is fine, but can easily be modelled with FeatureField, too.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Alan Woodward <[hidden email]>
> Sent: Thursday, February 7, 2019 10:27 AM
> To: [hidden email]
> Subject: Use of Payloads
>
> Hi all,
>
> The new intervals queries are now nearly at feature parity with Spans; the
> implementations still outstanding are all to do with using payloads.
> Currently, span queries allow you to filter out spans based on the payloads
> of the matching terms, and also allow you to modify the score of the query
> as a whole based on those payloads.  I’d like to get some idea of how people
> are actually using these functions.
>
> In terms of filtering, adding an IntervalSource that wraps a simple term and
> filters it out based on the payload will be simple enough.  Adding this for
> compound intervals is more complicated, and I think trickier to reason about,
> so I’d like to try and avoid doing this if possible - feedback on actual use-
> cases would be helpful here.
>
> For scoring, intervals use a completely different scoring mechanism to Spans,
> just returning a scaled score between 0 and [boost].  To include term
> weighting as well, users should combine the Intervals query with a boolean
> query consisting of all terms used in the IntervalsSource.  This doesn’t mix so
> well with payloads, but an alternative option here could be to add a
> PayloadTermQuery that can adjust the term frequency of a term on a
> particular document via a payload function.
>
> What do people think?  Are there cases that I’ve missed, or other possible
> uses here?
>
> - Alan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Use of Payloads

Erick Erickson
I've seen payloads used in an interesting use-case. Storing different
values in the _same_ term's payload to be used in A/B testing. I.e. I
might index
a-123|b-234|c-456
and then store it as a blob. Now when I attach "&experiment=a" to a
query, custom code extracts "123" from the payload and uses that in
scoring calculations.

FWIW,
Erick

On Wed, Feb 13, 2019 at 8:59 AM Alan Woodward <[hidden email]> wrote:

>
> Hey Michael, that’s a really interesting use case.  This should be possible using intervals as well - you could write an IntervalsSource that reads payloads and overrides end() to increment it by the encoded length.
>
> For the filtering cases, it should be easy to add a new factory method that takes a Term and a Predicate<BytesRef> that allows you to filter out particular terms based on their payloads.
>
> I’m still interested in seeing how people are using it for scoring as well, so please keep replying to the thread.
>
> On 13 Feb 2019, at 15:21, Michael Gibney <[hidden email]> wrote:
>
> Hi Alan,
>
> At the moment I'm using Payloads (exposed via TermSpans) to store positionLength in the index per-position (definitely in Uwe's category of "[because] payloads are stored together with the positions in the postings"). I'm using the positionLength for precise SpanNearQuery phrase matching with index-time synonyms/token-graphs.
>
> I'm not sure how directly relevant positionLength would be to IntervalSource. But more generally, I can say that I really appreciate having access to Payloads as a generic framework for implementation of experimental features that rely on per-position indexed attributes.
>
> Michael
>
> On Wed, Feb 13, 2019 at 3:27 AM Uwe Schindler <[hidden email]> wrote:
>>
>> Hi,
>>
>> I think the main reason why there are Payload implementation inside Spans are the fact that the payloads are stored together with the positions in the postings. Due to performance reasons, back at that time, the processing of payloads was put into the span query series, because then you can score by payload and do position based stuff in a single pass.
>>
>> I agree that adding that to the IntervalSource API is hard, because IntervalSource does not know anything about payloads, so a combination of different queries won't work. And as you said, the soring is separated.
>>
>> Payloads are mostly used for scoring, but I don't remember any use case I had in the last 5 years that made use of this - it was just too slow. And term-level boosts are seldomly used. In most cases people stick with document-level boosts (docvalues). Nowadays I'd also recommend FeatureField for term/keyword/category-level scoring.
>>
>> One thing that payloads were used are NLP features like word type annotations and filtering based on that, which requires (of course support in spans). But in most cases the better way to do this is to add the annotation into the term text and do simple term queries (like terms called "lucene#propernoun").
>>
>> IMHO, adding a PayloadTermQuery-like type to change the term frequency based on a function of payload is fine, but can easily be modelled with FeatureField, too.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> http://www.thetaphi.de
>> eMail: [hidden email]
>>
>> > -----Original Message-----
>> > From: Alan Woodward <[hidden email]>
>> > Sent: Thursday, February 7, 2019 10:27 AM
>> > To: [hidden email]
>> > Subject: Use of Payloads
>> >
>> > Hi all,
>> >
>> > The new intervals queries are now nearly at feature parity with Spans; the
>> > implementations still outstanding are all to do with using payloads.
>> > Currently, span queries allow you to filter out spans based on the payloads
>> > of the matching terms, and also allow you to modify the score of the query
>> > as a whole based on those payloads.  I’d like to get some idea of how people
>> > are actually using these functions.
>> >
>> > In terms of filtering, adding an IntervalSource that wraps a simple term and
>> > filters it out based on the payload will be simple enough.  Adding this for
>> > compound intervals is more complicated, and I think trickier to reason about,
>> > so I’d like to try and avoid doing this if possible - feedback on actual use-
>> > cases would be helpful here.
>> >
>> > For scoring, intervals use a completely different scoring mechanism to Spans,
>> > just returning a scaled score between 0 and [boost].  To include term
>> > weighting as well, users should combine the Intervals query with a boolean
>> > query consisting of all terms used in the IntervalsSource.  This doesn’t mix so
>> > well with payloads, but an alternative option here could be to add a
>> > PayloadTermQuery that can adjust the term frequency of a term on a
>> > particular document via a payload function.
>> >
>> > What do people think?  Are there cases that I’ve missed, or other possible
>> > uses here?
>> >
>> > - Alan
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]