RFC: N-2 compatibility for file formats

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: N-2 compatibility for file formats

Simon Willnauer-4
Hello all,

Currently Lucene supports reading and writing indices that have been
created with the current or previous (N-1) version of Lucene. Lucene
refuses to open an index created by N-2 or earlier versions.
I would like to propose that Lucene adds support for opening indices
created by version N-2 in read-only mode. Here's what I have in mind:

- Read-only support. You can open a reader on an index created by
version N-2, but you cannot open an IndexWriter on it, meaning that
you can neither delete, update, add documents or force-merge N-2
indices.

- File-format compatibility only. File-format compatibility enables
reading the content of old indices, but not more. Everything that is
done on top of file formats like analysis or the encoding of length
normalization factors is not guaranteed and only supported on a
best-effort basis.

The reason I came up with these limitations is because I wanted to
make the scope minimal in order to retain Lucene's ability to move
forward. If there is consensus to move forward with this, I would like
to target Lucene 9.0 with this change.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Ishan Chattopadhyaya
Sounds great, +1

On Wed, Jan 6, 2021 at 3:10 PM Simon Willnauer <[hidden email]> wrote:
Hello all,

Currently Lucene supports reading and writing indices that have been
created with the current or previous (N-1) version of Lucene. Lucene
refuses to open an index created by N-2 or earlier versions.
I would like to propose that Lucene adds support for opening indices
created by version N-2 in read-only mode. Here's what I have in mind:

- Read-only support. You can open a reader on an index created by
version N-2, but you cannot open an IndexWriter on it, meaning that
you can neither delete, update, add documents or force-merge N-2
indices.

- File-format compatibility only. File-format compatibility enables
reading the content of old indices, but not more. Everything that is
done on top of file formats like analysis or the encoding of length
normalization factors is not guaranteed and only supported on a
best-effort basis.

The reason I came up with these limitations is because I wanted to
make the scope minimal in order to retain Lucene's ability to move
forward. If there is consensus to move forward with this, I would like
to target Lucene 9.0 with this change.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

David Smiley
+1 -- Lucene should not _prevent_ this.

I forget where things stood in the past conversations about this subject... I think most recently raised by Erick Ericson.  I recall that we don't want to maintain the code to read older indices... which I sympathize with... but I recall there is code that actively *blocks* you (end user) from reading N-2 which I think goes too far, _forcing_ you to fork Lucene to work around that.  At least a user should be able to maintain however far back if they have their own codecs that they maintain (as I do at work).

~ David Smiley
Apache Lucene/Solr Search Developer


On Wed, Jan 6, 2021 at 8:48 AM Ishan Chattopadhyaya <[hidden email]> wrote:
Sounds great, +1

On Wed, Jan 6, 2021 at 3:10 PM Simon Willnauer <[hidden email]> wrote:
Hello all,

Currently Lucene supports reading and writing indices that have been
created with the current or previous (N-1) version of Lucene. Lucene
refuses to open an index created by N-2 or earlier versions.
I would like to propose that Lucene adds support for opening indices
created by version N-2 in read-only mode. Here's what I have in mind:

- Read-only support. You can open a reader on an index created by
version N-2, but you cannot open an IndexWriter on it, meaning that
you can neither delete, update, add documents or force-merge N-2
indices.

- File-format compatibility only. File-format compatibility enables
reading the content of old indices, but not more. Everything that is
done on top of file formats like analysis or the encoding of length
normalization factors is not guaranteed and only supported on a
best-effort basis.

The reason I came up with these limitations is because I wanted to
make the scope minimal in order to retain Lucene's ability to move
forward. If there is consensus to move forward with this, I would like
to target Lucene 9.0 with this change.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Michael Sokolov-4
In reply to this post by Simon Willnauer-4
In practice what would this mean? We relax the restriction that David
mentions, and we keep old codecs around in backwards-codecs for two
major releases instead of one? Are there other implications? Suppose
we had a Query that relied on a specific index format, which gets
retired. We keep the index format code around - do we also need to
remember to maintain the old Query?

-Mike

On Wed, Jan 6, 2021 at 4:41 AM Simon Willnauer
<[hidden email]> wrote:

>
> Hello all,
>
> Currently Lucene supports reading and writing indices that have been
> created with the current or previous (N-1) version of Lucene. Lucene
> refuses to open an index created by N-2 or earlier versions.
> I would like to propose that Lucene adds support for opening indices
> created by version N-2 in read-only mode. Here's what I have in mind:
>
> - Read-only support. You can open a reader on an index created by
> version N-2, but you cannot open an IndexWriter on it, meaning that
> you can neither delete, update, add documents or force-merge N-2
> indices.
>
> - File-format compatibility only. File-format compatibility enables
> reading the content of old indices, but not more. Everything that is
> done on top of file formats like analysis or the encoding of length
> normalization factors is not guaranteed and only supported on a
> best-effort basis.
>
> The reason I came up with these limitations is because I wanted to
> make the scope minimal in order to retain Lucene's ability to move
> forward. If there is consensus to move forward with this, I would like
> to target Lucene 9.0 with this change.
>
> Simon
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Dawid Weiss-2

I see a more difficult problem in the opposite - say, a new Query that requires something from
the index that older indexes (codecs) don't have. Then running such a query would result in, I assume, 
an exception? Things get awkward when you have existing systems that wish to gradually upgrade 
so that some segments are in older codecs and newer segments are in newer codecs.

But in general I'm quite ok with keeping N-2 compatibility if it's not too much trouble.

D.

On Wed, Jan 6, 2021 at 4:21 PM Michael Sokolov <[hidden email]> wrote:
In practice what would this mean? We relax the restriction that David
mentions, and we keep old codecs around in backwards-codecs for two
major releases instead of one? Are there other implications? Suppose
we had a Query that relied on a specific index format, which gets
retired. We keep the index format code around - do we also need to
remember to maintain the old Query?

-Mike

On Wed, Jan 6, 2021 at 4:41 AM Simon Willnauer
<[hidden email]> wrote:
>
> Hello all,
>
> Currently Lucene supports reading and writing indices that have been
> created with the current or previous (N-1) version of Lucene. Lucene
> refuses to open an index created by N-2 or earlier versions.
> I would like to propose that Lucene adds support for opening indices
> created by version N-2 in read-only mode. Here's what I have in mind:
>
> - Read-only support. You can open a reader on an index created by
> version N-2, but you cannot open an IndexWriter on it, meaning that
> you can neither delete, update, add documents or force-merge N-2
> indices.
>
> - File-format compatibility only. File-format compatibility enables
> reading the content of old indices, but not more. Everything that is
> done on top of file formats like analysis or the encoding of length
> normalization factors is not guaranteed and only supported on a
> best-effort basis.
>
> The reason I came up with these limitations is because I wanted to
> make the scope minimal in order to retain Lucene's ability to move
> forward. If there is consensus to move forward with this, I would like
> to target Lucene 9.0 with this change.
>
> Simon
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Yonik Seeley
In reply to this post by Simon Willnauer-4
On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <[hidden email]> wrote:
 You can open a reader on an index created by
version N-2, but you cannot open an IndexWriter on it

+1
There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.

-Yonik

 
Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

jim ferenczi
The proposal is only about keeping the ability to read file-format up to N-2. Everything that is done on top of the file format is not guaranteed and should be supported on a best-effort basis.
That's an important aspect if we don't want to block innovation. So in practice that means that queries that require some specific file format or analyzers that change behaviors in major versions would not be part of the extended guarantee.


Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <[hidden email]> a écrit :
On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <[hidden email]> wrote:
 You can open a reader on an index created by
version N-2, but you cannot open an IndexWriter on it

+1
There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.

-Yonik

 
Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Simon Willnauer-4
I can provide some examples of BWC issues and what we would do if it
happened in the future:

- negative offsets: in this case it would be best effort to add a
wrapper around the older formats to check if the offsets go backwards
on the read side and throw an exception to prevent consumers making
the assumption that offsets go forward only from failing or going OOM
etc.
- norms encoding: in this case it would be best effort in the older
norms formats to convert to the newer encodings.
- the removal of numeric fields queries would not fall under the
promises we make with compatibility of N-2 and it would be the
responsibility of the user to keep the code around that understands
the value of a field.

I hope this clarifies some of the aspects?

we would only do all this for the reading end, for writing we would
reject indices that are older than N-1

simon


On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <[hidden email]> wrote:

>
> The proposal is only about keeping the ability to read file-format up to N-2. Everything that is done on top of the file format is not guaranteed and should be supported on a best-effort basis.
> That's an important aspect if we don't want to block innovation. So in practice that means that queries that require some specific file format or analyzers that change behaviors in major versions would not be part of the extended guarantee.
>
>
> Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <[hidden email]> a écrit :
>>
>> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <[hidden email]> wrote:
>>>
>>>  You can open a reader on an index created by
>>> version N-2, but you cannot open an IndexWriter on it
>>
>>
>> +1
>> There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.
>>
>> -Yonik
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Michael McCandless-2
+1, I like the idea in general.

We will have to work out the details in practice as we come across "index breaking" changes, and where/how to draw the line of "best effort".  But I think this is an improvement for our users over the hard check we now have for "only N-1", and likely not so much development effort?

I think where it might get interesting is if we want to make a Codec API change, maybe to optimize a interesting use-cases, and then we must do some development to fix N-2 BWC codec (as well as N-1 BWC codec that we already must fix for such an example, today).

Some users seem to keep their indices alive for a very long time!

On Sat, Jan 9, 2021 at 6:13 AM Simon Willnauer <[hidden email]> wrote:
I can provide some examples of BWC issues and what we would do if it
happened in the future:

- negative offsets: in this case it would be best effort to add a
wrapper around the older formats to check if the offsets go backwards
on the read side and throw an exception to prevent consumers making
the assumption that offsets go forward only from failing or going OOM
etc.
- norms encoding: in this case it would be best effort in the older
norms formats to convert to the newer encodings.
- the removal of numeric fields queries would not fall under the
promises we make with compatibility of N-2 and it would be the
responsibility of the user to keep the code around that understands
the value of a field.

I hope this clarifies some of the aspects?

we would only do all this for the reading end, for writing we would
reject indices that are older than N-1

simon


On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <[hidden email]> wrote:
>
> The proposal is only about keeping the ability to read file-format up to N-2. Everything that is done on top of the file format is not guaranteed and should be supported on a best-effort basis.
> That's an important aspect if we don't want to block innovation. So in practice that means that queries that require some specific file format or analyzers that change behaviors in major versions would not be part of the extended guarantee.
>
>
> Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <[hidden email]> a écrit :
>>
>> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <[hidden email]> wrote:
>>>
>>>  You can open a reader on an index created by
>>> version N-2, but you cannot open an IndexWriter on it
>>
>>
>> +1
>> There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.
>>
>> -Yonik
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Adrien Grand
+1 this strikes to me as a good balance between increasing backward compatibility guarantees and still keeping room for innovation.

David, actually I would like to advocate in favor of still disallowing opening N-2 indices by default, as they might not match Lucene's current expectations (e.g. using a different encoding for norms due to LUCENE-7730), and using Lucene's current analyzers/similarities/queries might trigger surprising behavior. My preference would be to expose the ability to open N-2 indices behind an expert API/flag that documents limitations with N-2 indices.

Mike, I wondered about this question too. As you pointed out, I think that we will generally be ok given that the N-2 compatibility layer will very likely be the same as the N-1 compatibility layer that we need to develop anyway. I tried to think of examples when that wouldn't work but couldn't find any (which doesn't mean that there is none, but hopefully it would be rare).



On Mon, Jan 11, 2021 at 4:57 PM Michael McCandless <[hidden email]> wrote:
+1, I like the idea in general.

We will have to work out the details in practice as we come across "index breaking" changes, and where/how to draw the line of "best effort".  But I think this is an improvement for our users over the hard check we now have for "only N-1", and likely not so much development effort?

I think where it might get interesting is if we want to make a Codec API change, maybe to optimize a interesting use-cases, and then we must do some development to fix N-2 BWC codec (as well as N-1 BWC codec that we already must fix for such an example, today).

Some users seem to keep their indices alive for a very long time!

On Sat, Jan 9, 2021 at 6:13 AM Simon Willnauer <[hidden email]> wrote:
I can provide some examples of BWC issues and what we would do if it
happened in the future:

- negative offsets: in this case it would be best effort to add a
wrapper around the older formats to check if the offsets go backwards
on the read side and throw an exception to prevent consumers making
the assumption that offsets go forward only from failing or going OOM
etc.
- norms encoding: in this case it would be best effort in the older
norms formats to convert to the newer encodings.
- the removal of numeric fields queries would not fall under the
promises we make with compatibility of N-2 and it would be the
responsibility of the user to keep the code around that understands
the value of a field.

I hope this clarifies some of the aspects?

we would only do all this for the reading end, for writing we would
reject indices that are older than N-1

simon


On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <[hidden email]> wrote:
>
> The proposal is only about keeping the ability to read file-format up to N-2. Everything that is done on top of the file format is not guaranteed and should be supported on a best-effort basis.
> That's an important aspect if we don't want to block innovation. So in practice that means that queries that require some specific file format or analyzers that change behaviors in major versions would not be part of the extended guarantee.
>
>
> Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <[hidden email]> a écrit :
>>
>> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <[hidden email]> wrote:
>>>
>>>  You can open a reader on an index created by
>>> version N-2, but you cannot open an IndexWriter on it
>>
>>
>> +1
>> There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.
>>
>> -Yonik
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



--
Adrien
Reply | Threaded
Open this post in threaded view
|

Re: RFC: N-2 compatibility for file formats

Simon Willnauer-4
thanks for all the feedback, I opened
https://issues.apache.org/jira/browse/LUCENE-9669 to address this
further.

On Wed, Jan 13, 2021 at 2:58 PM Adrien Grand <[hidden email]> wrote:

>
> +1 this strikes to me as a good balance between increasing backward compatibility guarantees and still keeping room for innovation.
>
> David, actually I would like to advocate in favor of still disallowing opening N-2 indices by default, as they might not match Lucene's current expectations (e.g. using a different encoding for norms due to LUCENE-7730), and using Lucene's current analyzers/similarities/queries might trigger surprising behavior. My preference would be to expose the ability to open N-2 indices behind an expert API/flag that documents limitations with N-2 indices.
>
> Mike, I wondered about this question too. As you pointed out, I think that we will generally be ok given that the N-2 compatibility layer will very likely be the same as the N-1 compatibility layer that we need to develop anyway. I tried to think of examples when that wouldn't work but couldn't find any (which doesn't mean that there is none, but hopefully it would be rare).
>
>
>
> On Mon, Jan 11, 2021 at 4:57 PM Michael McCandless <[hidden email]> wrote:
>>
>> +1, I like the idea in general.
>>
>> We will have to work out the details in practice as we come across "index breaking" changes, and where/how to draw the line of "best effort".  But I think this is an improvement for our users over the hard check we now have for "only N-1", and likely not so much development effort?
>>
>> I think where it might get interesting is if we want to make a Codec API change, maybe to optimize a interesting use-cases, and then we must do some development to fix N-2 BWC codec (as well as N-1 BWC codec that we already must fix for such an example, today).
>>
>> Some users seem to keep their indices alive for a very long time!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Jan 9, 2021 at 6:13 AM Simon Willnauer <[hidden email]> wrote:
>>>
>>> I can provide some examples of BWC issues and what we would do if it
>>> happened in the future:
>>>
>>> - negative offsets: in this case it would be best effort to add a
>>> wrapper around the older formats to check if the offsets go backwards
>>> on the read side and throw an exception to prevent consumers making
>>> the assumption that offsets go forward only from failing or going OOM
>>> etc.
>>> - norms encoding: in this case it would be best effort in the older
>>> norms formats to convert to the newer encodings.
>>> - the removal of numeric fields queries would not fall under the
>>> promises we make with compatibility of N-2 and it would be the
>>> responsibility of the user to keep the code around that understands
>>> the value of a field.
>>>
>>> I hope this clarifies some of the aspects?
>>>
>>> we would only do all this for the reading end, for writing we would
>>> reject indices that are older than N-1
>>>
>>> simon
>>>
>>>
>>> On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <[hidden email]> wrote:
>>> >
>>> > The proposal is only about keeping the ability to read file-format up to N-2. Everything that is done on top of the file format is not guaranteed and should be supported on a best-effort basis.
>>> > That's an important aspect if we don't want to block innovation. So in practice that means that queries that require some specific file format or analyzers that change behaviors in major versions would not be part of the extended guarantee.
>>> >
>>> >
>>> > Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <[hidden email]> a écrit :
>>> >>
>>> >> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <[hidden email]> wrote:
>>> >>>
>>> >>>  You can open a reader on an index created by
>>> >>> version N-2, but you cannot open an IndexWriter on it
>>> >>
>>> >>
>>> >> +1
>>> >> There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.
>>> >>
>>> >> -Yonik
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]