additional term meta data

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

additional term meta data

John Wang-9
Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.


Thank you

-John
Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

Martin Gainty
how to access first and last doc-id?
for which lucene version will you be targeting your merge?

Request: please submit testcase to show proper operation

Thanks John!
martin-


From: John Wang <[hidden email]>
Sent: Tuesday, January 5, 2021 8:19 PM
To: [hidden email] <[hidden email]>
Subject: additional term meta data
 
Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.


Thank you

-John
Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

Martin Gainty
In reply to this post by John Wang-9
how to access first and last?
which version will you be merging 


From: John Wang <[hidden email]>
Sent: Tuesday, January 5, 2021 8:19 PM
To: [hidden email] <[hidden email]>
Subject: additional term meta data
 
Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.


Thank you

-John
Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

John Wang-9
Hey Martin:

There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description.


I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment.

Any advice or feedback is much appreciated.

Thank you!

-John

On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <[hidden email]> wrote:
how to access first and last?
which version will you be merging 


From: John Wang <[hidden email]>
Sent: Tuesday, January 5, 2021 8:19 PM
To: [hidden email] <[hidden email]>
Subject: additional term meta data
 
Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.


Thank you

-John
Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

Martin Gainty
appears you are targeting 9.0 for your code
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java
(Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros)

<RANT>
someone had the bright idea to nuke ant 8.x build.xml without consulting anyone
not a fan of ant but the execution model of gradle is woefully inflexible in comparison to maven
</RANT>

i will try with 90 distro to get the codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your TestLucene84PostingsFormat will run w/o fail or error

Thx
martin-


From: John Wang <[hidden email]>
Sent: Wednesday, January 6, 2021 10:15 AM
To: [hidden email] <[hidden email]>
Subject: Re: additional term meta data
 
Hey Martin:

There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description.


I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment.

Any advice or feedback is much appreciated.

Thank you!

-John

On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <[hidden email]> wrote:
how to access first and last?
which version will you be merging 


From: John Wang <[hidden email]>
Sent: Tuesday, January 5, 2021 8:19 PM
To: [hidden email] <[hidden email]>
Subject: additional term meta data
 
Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.


Thank you

-John
Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

John Wang-9
Thank you, Martin!

You can apply the patch to the 8.7 build by just ignoring the changes to Lucene90xxx. Appreciate the help and guidance!

-John


On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty <[hidden email]> wrote:
appears you are targeting 9.0 for your code
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java
(Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros)

<RANT>
someone had the bright idea to nuke ant 8.x build.xml without consulting anyone
not a fan of ant but the execution model of gradle is woefully inflexible in comparison to maven
</RANT>

i will try with 90 distro to get the codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your TestLucene84PostingsFormat will run w/o fail or error

Thx
martin-


From: John Wang <[hidden email]>
Sent: Wednesday, January 6, 2021 10:15 AM
To: [hidden email] <[hidden email]>
Subject: Re: additional term meta data
 
Hey Martin:

There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description.


I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment.

Any advice or feedback is much appreciated.

Thank you!

-John

On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <[hidden email]> wrote:
how to access first and last?
which version will you be merging 


From: John Wang <[hidden email]>
Sent: Tuesday, January 5, 2021 8:19 PM
To: [hidden email] <[hidden email]>
Subject: additional term meta data
 
Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.


Thank you

-John
Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

Simon Willnauer-4
John, can you explain what the usecase for such a new API is? I don't
see a user of the API in your code. Is there a query you can optimize
with this or what is the reasoning behind this change? I personally
think it's quite invasive to add this information and there must be a
good reason to add this to the TermsEnum? I also don't think we should
have an option on the field for this if we add it but if we don't do
that it's quite a heavy change so I am on the fence if we should even
consider this?
I wonder if you can use the TermsEnum#getAttributeSource() API instead
and add this as a dedicated attribute which is present if the info is
stored. That way you can build your own PostingsFormat that does store
this information?

simon

On Wed, Jan 6, 2021 at 8:06 PM John Wang <[hidden email]> wrote:

>
> Thank you, Martin!
>
> You can apply the patch to the 8.7 build by just ignoring the changes to Lucene90xxx. Appreciate the help and guidance!
>
> -John
>
>
> On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty <[hidden email]> wrote:
>>
>> appears you are targeting 9.0 for your code
>> lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java
>> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros)
>>
>> <RANT>
>> someone had the bright idea to nuke ant 8.x build.xml without consulting anyone
>> not a fan of ant but the execution model of gradle is woefully inflexible in comparison to maven
>> </RANT>
>>
>> i will try with 90 distro to get the codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your TestLucene84PostingsFormat will run w/o fail or error
>>
>> Thx
>> martin-
>>
>> ________________________________
>> From: John Wang <[hidden email]>
>> Sent: Wednesday, January 6, 2021 10:15 AM
>> To: [hidden email] <[hidden email]>
>> Subject: Re: additional term meta data
>>
>> Hey Martin:
>>
>> There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description.
>>
>> Here is the link to the beginning of the tests: https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142
>>
>> I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment.
>>
>> Any advice or feedback is much appreciated.
>>
>> Thank you!
>>
>> -John
>>
>> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <[hidden email]> wrote:
>>
>> how to access first and last?
>> which version will you be merging
>>
>> ________________________________
>> From: John Wang <[hidden email]>
>> Sent: Tuesday, January 5, 2021 8:19 PM
>> To: [hidden email] <[hidden email]>
>> Subject: additional term meta data
>>
>> Hi folks:
>>
>> We like to propose a feature to add additional per-term metadata to the term diction.
>>
>> Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.
>>
>> We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.
>>
>> https://github.com/dashbase/lucene-solr/pull/1
>>
>> Thank you
>>
>> -John

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

John Wang-9
Hi Simon:

This might be specific to us, it makes sense not making such core changes If not needed. 

Here is our use case anyway:

We first sort the index in time order, so docids can be used as proxy for time.

In the VoIP world, we are using Lucene to stitch call flows, which is similar to the APM/tracing use case. To optimally get the range of the transaction, using first and last docid helps without the need to traverse the posting list.

It would be ideal for us to not have to modify Lucene, would be great to understand how getting AttributeSource helps with this case. Let me spend some time learning about it.

Thank you for the suggestion!

-John




On Fri, Jan 8, 2021 at 11:19 PM Simon Willnauer <[hidden email]> wrote:
John, can you explain what the usecase for such a new API is? I don't
see a user of the API in your code. Is there a query you can optimize
with this or what is the reasoning behind this change? I personally
think it's quite invasive to add this information and there must be a
good reason to add this to the TermsEnum? I also don't think we should
have an option on the field for this if we add it but if we don't do
that it's quite a heavy change so I am on the fence if we should even
consider this?
I wonder if you can use the TermsEnum#getAttributeSource() API instead
and add this as a dedicated attribute which is present if the info is
stored. That way you can build your own PostingsFormat that does store
this information?

simon

On Wed, Jan 6, 2021 at 8:06 PM John Wang <[hidden email]> wrote:
>
> Thank you, Martin!
>
> You can apply the patch to the 8.7 build by just ignoring the changes to Lucene90xxx. Appreciate the help and guidance!
>
> -John
>
>
> On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty <[hidden email]> wrote:
>>
>> appears you are targeting 9.0 for your code
>> lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java
>> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros)
>>
>> <RANT>
>> someone had the bright idea to nuke ant 8.x build.xml without consulting anyone
>> not a fan of ant but the execution model of gradle is woefully inflexible in comparison to maven
>> </RANT>
>>
>> i will try with 90 distro to get the codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your TestLucene84PostingsFormat will run w/o fail or error
>>
>> Thx
>> martin-
>>
>> ________________________________
>> From: John Wang <[hidden email]>
>> Sent: Wednesday, January 6, 2021 10:15 AM
>> To: [hidden email] <[hidden email]>
>> Subject: Re: additional term meta data
>>
>> Hey Martin:
>>
>> There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description.
>>
>> Here is the link to the beginning of the tests: https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142
>>
>> I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment.
>>
>> Any advice or feedback is much appreciated.
>>
>> Thank you!
>>
>> -John
>>
>> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <[hidden email]> wrote:
>>
>> how to access first and last?
>> which version will you be merging
>>
>> ________________________________
>> From: John Wang <[hidden email]>
>> Sent: Tuesday, January 5, 2021 8:19 PM
>> To: [hidden email] <[hidden email]>
>> Subject: additional term meta data
>>
>> Hi folks:
>>
>> We like to propose a feature to add additional per-term metadata to the term diction.
>>
>> Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.
>>
>> We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.
>>
>> https://github.com/dashbase/lucene-solr/pull/1
>>
>> Thank you
>>
>> -John

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: additional term meta data

Martin Gainty
In reply to this post by Martin Gainty
close to finish testing but i need help finding this testcase
RamUsageTester

any ideas?

Thanks John!
martin


From: Martin Gainty <[hidden email]>
Sent: Wednesday, January 6, 2021 6:28 AM
To: [hidden email] <[hidden email]>
Subject: Re: additional term meta data
 
how to access first and last?
which version will you be merging 


From: John Wang <[hidden email]>
Sent: Tuesday, January 5, 2021 8:19 PM
To: [hidden email] <[hidden email]>
Subject: additional term meta data
 
Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.


Thank you

-John