Lucene 7.x custom Scorer on point values

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene 7.x custom Scorer on point values

Dominik Safaric
Recently I've implemented a custom Query that in turn scores documents
using a custom Scorer implementation using a long primitive point values.
The associated field is multi valued and has doc values enabled. For
retrieving these multi valued longs I've used LeafReader.document() within
the Scorer implementation. However, the invocation requires iterating
through the space of matching documents which may induce performance
degradations.

Hence my question is, what would be the most efficient implementation of a
custom Scorer that computes scores based on the value of a multi valued
long points field?

Thanks in advance,
Dominik
Reply | Threaded
Open this post in threaded view
|

RE: Lucene 7.x custom Scorer on point values

Uwe Schindler
Hi,

You would need to index that as numeric docvalues. Just add another field of type numeric docvalues with same or different name and use the LeafReader's docvalues accessors to fetch values. But that's all way too hard. You can create function queries without hazzle using the function queries package. Or much better: I'd use the lucene expressions module to do this. It allows you to express the scoring formula as a javascript formula and use all docvalues fields in your document to calculate the final score.

In both cases there is no need to create a custom scorer and everything works efficient. Creating own scorers just for this is way to complicated and not recommended. This leads to usage errors like you have discovered: slow stored fields, misusage of docvalues APIs (those are iterators, too) or other problems.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Dominik Safaric [mailto:[hidden email]]
> Sent: Wednesday, October 11, 2017 11:23 AM
> To: [hidden email]
> Subject: Lucene 7.x custom Scorer on point values
>
> Recently I've implemented a custom Query that in turn scores documents
> using a custom Scorer implementation using a long primitive point values.
> The associated field is multi valued and has doc values enabled. For
> retrieving these multi valued longs I've used LeafReader.document() within
> the Scorer implementation. However, the invocation requires iterating
> through the space of matching documents which may induce performance
> degradations.
>
> Hence my question is, what would be the most efficient implementation of a
> custom Scorer that computes scores based on the value of a multi valued
> long points field?
>
> Thanks in advance,
> Dominik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene 7.x custom Scorer on point values

Dominik Safaric
Thanks Uwe for the clarification.

The values are already indexed as numeric docvalues, i.e. numeric
point-docvalues. In both cases, either by implementing a custom scorer or
function query I would need to access the point values for the matched/hit
documents. How can I derive these values given a DocIdSetIterator (subset
of documents i.e. hit documents ids) and a LeafContextReader. Using the
getSortedNumericDocValues("field") can derive me the longs in question,
however these values are sorted using Long.compare whereas in my case order
of the values for a particular field matters.

Kind regards,
Dominik

2017-10-11 11:43 GMT+02:00 Uwe Schindler <[hidden email]>:

> Hi,
>
> You would need to index that as numeric docvalues. Just add another field
> of type numeric docvalues with same or different name and use the
> LeafReader's docvalues accessors to fetch values. But that's all way too
> hard. You can create function queries without hazzle using the function
> queries package. Or much better: I'd use the lucene expressions module to
> do this. It allows you to express the scoring formula as a javascript
> formula and use all docvalues fields in your document to calculate the
> final score.
>
> In both cases there is no need to create a custom scorer and everything
> works efficient. Creating own scorers just for this is way to complicated
> and not recommended. This leads to usage errors like you have discovered:
> slow stored fields, misusage of docvalues APIs (those are iterators, too)
> or other problems.
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
> > -----Original Message-----
> > From: Dominik Safaric [mailto:[hidden email]]
> > Sent: Wednesday, October 11, 2017 11:23 AM
> > To: [hidden email]
> > Subject: Lucene 7.x custom Scorer on point values
> >
> > Recently I've implemented a custom Query that in turn scores documents
> > using a custom Scorer implementation using a long primitive point values.
> > The associated field is multi valued and has doc values enabled. For
> > retrieving these multi valued longs I've used LeafReader.document()
> within
> > the Scorer implementation. However, the invocation requires iterating
> > through the space of matching documents which may induce performance
> > degradations.
> >
> > Hence my question is, what would be the most efficient implementation of
> a
> > custom Scorer that computes scores based on the value of a multi valued
> > long points field?
> >
> > Thanks in advance,
> > Dominik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Lucene 7.x custom Scorer on point values

Uwe Schindler
Hi,

if you have multiple docvalues for the same field in the same document, the order is undefined. The original order is not preserved, sorry. How many values per document do you have? If it’s a fixed number or low, I'd go with single valued fields.

If you really need multi-valued docvalues where the order is preserved, you can go and use binary bytes instead and encode your values into it. But this is much more expensive to use during scoring (decoding overhead,...).

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Dominik Safaric [mailto:[hidden email]]
> Sent: Wednesday, October 11, 2017 1:39 PM
> To: [hidden email]
> Subject: Re: Lucene 7.x custom Scorer on point values
>
> Thanks Uwe for the clarification.
>
> The values are already indexed as numeric docvalues, i.e. numeric
> point-docvalues. In both cases, either by implementing a custom scorer or
> function query I would need to access the point values for the matched/hit
> documents. How can I derive these values given a DocIdSetIterator (subset
> of documents i.e. hit documents ids) and a LeafContextReader. Using the
> getSortedNumericDocValues("field") can derive me the longs in question,
> however these values are sorted using Long.compare whereas in my case
> order
> of the values for a particular field matters.
>
> Kind regards,
> Dominik
>
> 2017-10-11 11:43 GMT+02:00 Uwe Schindler <[hidden email]>:
>
> > Hi,
> >
> > You would need to index that as numeric docvalues. Just add another field
> > of type numeric docvalues with same or different name and use the
> > LeafReader's docvalues accessors to fetch values. But that's all way too
> > hard. You can create function queries without hazzle using the function
> > queries package. Or much better: I'd use the lucene expressions module to
> > do this. It allows you to express the scoring formula as a javascript
> > formula and use all docvalues fields in your document to calculate the
> > final score.
> >
> > In both cases there is no need to create a custom scorer and everything
> > works efficient. Creating own scorers just for this is way to complicated
> > and not recommended. This leads to usage errors like you have discovered:
> > slow stored fields, misusage of docvalues APIs (those are iterators, too)
> > or other problems.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: [hidden email]
> >
> > > -----Original Message-----
> > > From: Dominik Safaric [mailto:[hidden email]]
> > > Sent: Wednesday, October 11, 2017 11:23 AM
> > > To: [hidden email]
> > > Subject: Lucene 7.x custom Scorer on point values
> > >
> > > Recently I've implemented a custom Query that in turn scores documents
> > > using a custom Scorer implementation using a long primitive point values.
> > > The associated field is multi valued and has doc values enabled. For
> > > retrieving these multi valued longs I've used LeafReader.document()
> > within
> > > the Scorer implementation. However, the invocation requires iterating
> > > through the space of matching documents which may induce
> performance
> > > degradations.
> > >
> > > Hence my question is, what would be the most efficient implementation
> of
> > a
> > > custom Scorer that computes scores based on the value of a multi valued
> > > long points field?
> > >
> > > Thanks in advance,
> > > Dominik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene 7.x custom Scorer on point values

Dominik Safaric
The number of values per document per field is equal to 47.

Unfortunately using binary fields is not an option because a binary field
is not searchable. However, using a keyword field where the array of long
values would be equivalent to a hex encoded binary array and later
retrieving them as binary data might do the trick. But before that, could
you please explain how keyword fields are stored within Lucene? I'm asking
because unfortunately I haven't found any information about it online.

Thanks,
Dominik

2017-10-11 13:59 GMT+02:00 Uwe Schindler <[hidden email]>:

> Hi,
>
> if you have multiple docvalues for the same field in the same document,
> the order is undefined. The original order is not preserved, sorry. How
> many values per document do you have? If it’s a fixed number or low, I'd go
> with single valued fields.
>
> If you really need multi-valued docvalues where the order is preserved,
> you can go and use binary bytes instead and encode your values into it. But
> this is much more expensive to use during scoring (decoding overhead,...).
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
> > -----Original Message-----
> > From: Dominik Safaric [mailto:[hidden email]]
> > Sent: Wednesday, October 11, 2017 1:39 PM
> > To: [hidden email]
> > Subject: Re: Lucene 7.x custom Scorer on point values
> >
> > Thanks Uwe for the clarification.
> >
> > The values are already indexed as numeric docvalues, i.e. numeric
> > point-docvalues. In both cases, either by implementing a custom scorer or
> > function query I would need to access the point values for the
> matched/hit
> > documents. How can I derive these values given a DocIdSetIterator (subset
> > of documents i.e. hit documents ids) and a LeafContextReader. Using the
> > getSortedNumericDocValues("field") can derive me the longs in question,
> > however these values are sorted using Long.compare whereas in my case
> > order
> > of the values for a particular field matters.
> >
> > Kind regards,
> > Dominik
> >
> > 2017-10-11 11:43 GMT+02:00 Uwe Schindler <[hidden email]>:
> >
> > > Hi,
> > >
> > > You would need to index that as numeric docvalues. Just add another
> field
> > > of type numeric docvalues with same or different name and use the
> > > LeafReader's docvalues accessors to fetch values. But that's all way
> too
> > > hard. You can create function queries without hazzle using the function
> > > queries package. Or much better: I'd use the lucene expressions module
> to
> > > do this. It allows you to express the scoring formula as a javascript
> > > formula and use all docvalues fields in your document to calculate the
> > > final score.
> > >
> > > In both cases there is no need to create a custom scorer and everything
> > > works efficient. Creating own scorers just for this is way to
> complicated
> > > and not recommended. This leads to usage errors like you have
> discovered:
> > > slow stored fields, misusage of docvalues APIs (those are iterators,
> too)
> > > or other problems.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > Achterdiek 19, D-28357 Bremen
> > > http://www.thetaphi.de
> > > eMail: [hidden email]
> > >
> > > > -----Original Message-----
> > > > From: Dominik Safaric [mailto:[hidden email]]
> > > > Sent: Wednesday, October 11, 2017 11:23 AM
> > > > To: [hidden email]
> > > > Subject: Lucene 7.x custom Scorer on point values
> > > >
> > > > Recently I've implemented a custom Query that in turn scores
> documents
> > > > using a custom Scorer implementation using a long primitive point
> values.
> > > > The associated field is multi valued and has doc values enabled. For
> > > > retrieving these multi valued longs I've used LeafReader.document()
> > > within
> > > > the Scorer implementation. However, the invocation requires iterating
> > > > through the space of matching documents which may induce
> > performance
> > > > degradations.
> > > >
> > > > Hence my question is, what would be the most efficient implementation
> > of
> > > a
> > > > custom Scorer that computes scores based on the value of a multi
> valued
> > > > long points field?
> > > >
> > > > Thanks in advance,
> > > > Dominik
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Lucene 7.x custom Scorer on point values

Uwe Schindler
Hi,

I was talking about a solely binary DocValues field. Not searchable, stored whatever. A completely separate field that stores the values in order in binary form (e.g. 47*4 bytes if it's ints or floats) just for scoring. DocValues fields other than numeric are binary by default!

But for _exactly_ 47 values I'd use 47 separate numeric docvalues-only fields like "value01, value02, value03". The searchable stuff is multivlaued and just "value". But using 47 numeric fields at scoring time is a bit much to read. Is there no possibility to combine all those values into fewer fields, soely used for scoring (e.g, like 2 values like a linear factor and a quadratic factor or whatever). It's hard to image that you need all values while scoring!

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Dominik Safaric [mailto:[hidden email]]
> Sent: Thursday, October 12, 2017 8:53 AM
> To: [hidden email]
> Subject: Re: Lucene 7.x custom Scorer on point values
>
> The number of values per document per field is equal to 47.
>
> Unfortunately using binary fields is not an option because a binary field
> is not searchable. However, using a keyword field where the array of long
> values would be equivalent to a hex encoded binary array and later
> retrieving them as binary data might do the trick. But before that, could
> you please explain how keyword fields are stored within Lucene? I'm asking
> because unfortunately I haven't found any information about it online.
>
> Thanks,
> Dominik
>
> 2017-10-11 13:59 GMT+02:00 Uwe Schindler <[hidden email]>:
>
> > Hi,
> >
> > if you have multiple docvalues for the same field in the same document,
> > the order is undefined. The original order is not preserved, sorry. How
> > many values per document do you have? If it’s a fixed number or low, I'd go
> > with single valued fields.
> >
> > If you really need multi-valued docvalues where the order is preserved,
> > you can go and use binary bytes instead and encode your values into it. But
> > this is much more expensive to use during scoring (decoding overhead,...).
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: [hidden email]
> >
> > > -----Original Message-----
> > > From: Dominik Safaric [mailto:[hidden email]]
> > > Sent: Wednesday, October 11, 2017 1:39 PM
> > > To: [hidden email]
> > > Subject: Re: Lucene 7.x custom Scorer on point values
> > >
> > > Thanks Uwe for the clarification.
> > >
> > > The values are already indexed as numeric docvalues, i.e. numeric
> > > point-docvalues. In both cases, either by implementing a custom scorer or
> > > function query I would need to access the point values for the
> > matched/hit
> > > documents. How can I derive these values given a DocIdSetIterator (subset
> > > of documents i.e. hit documents ids) and a LeafContextReader. Using the
> > > getSortedNumericDocValues("field") can derive me the longs in question,
> > > however these values are sorted using Long.compare whereas in my case
> > > order
> > > of the values for a particular field matters.
> > >
> > > Kind regards,
> > > Dominik
> > >
> > > 2017-10-11 11:43 GMT+02:00 Uwe Schindler <[hidden email]>:
> > >
> > > > Hi,
> > > >
> > > > You would need to index that as numeric docvalues. Just add another
> > field
> > > > of type numeric docvalues with same or different name and use the
> > > > LeafReader's docvalues accessors to fetch values. But that's all way
> > too
> > > > hard. You can create function queries without hazzle using the function
> > > > queries package. Or much better: I'd use the lucene expressions module
> > to
> > > > do this. It allows you to express the scoring formula as a javascript
> > > > formula and use all docvalues fields in your document to calculate the
> > > > final score.
> > > >
> > > > In both cases there is no need to create a custom scorer and everything
> > > > works efficient. Creating own scorers just for this is way to
> > complicated
> > > > and not recommended. This leads to usage errors like you have
> > discovered:
> > > > slow stored fields, misusage of docvalues APIs (those are iterators,
> > too)
> > > > or other problems.
> > > >
> > > > Uwe
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > Achterdiek 19, D-28357 Bremen
> > > > http://www.thetaphi.de
> > > > eMail: [hidden email]
> > > >
> > > > > -----Original Message-----
> > > > > From: Dominik Safaric [mailto:[hidden email]]
> > > > > Sent: Wednesday, October 11, 2017 11:23 AM
> > > > > To: [hidden email]
> > > > > Subject: Lucene 7.x custom Scorer on point values
> > > > >
> > > > > Recently I've implemented a custom Query that in turn scores
> > documents
> > > > > using a custom Scorer implementation using a long primitive point
> > values.
> > > > > The associated field is multi valued and has doc values enabled. For
> > > > > retrieving these multi valued longs I've used LeafReader.document()
> > > > within
> > > > > the Scorer implementation. However, the invocation requires iterating
> > > > > through the space of matching documents which may induce
> > > performance
> > > > > degradations.
> > > > >
> > > > > Hence my question is, what would be the most efficient implementation
> > > of
> > > > a
> > > > > custom Scorer that computes scores based on the value of a multi
> > valued
> > > > > long points field?
> > > > >
> > > > > Thanks in advance,
> > > > > Dominik
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]