Sequence IDs for NRT deletes

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Sequence IDs for NRT deletes

Jason Rutherglen
Michael B and I have been discussing the per segment doc writers
and RT patches/branch. A small improvement we can add to trunk
from this is the sequence IDs for deletes, which would improve
the existing NRT system by avoiding the cloning of bit vectors.
Implementing segment deleted docs via sequence IDs would
additionally provide a path way for the future RT branch merge
into trunk. It could be best to break up the RT patches as much
as possible as they touch on many parts of the Lucene
IndexWriter subsystem.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sequence IDs for NRT deletes

Michael McCandless-2
Breaking up RT patches into baby steps would be great :)  Actually is
the RT branch active (I haven't seen commits going in).

Eg, is per-segment DocWriter separable from the RT changes (seems like
it should/could be)?

The biggest downside of sequence IDs is increase RAM usage right?  Ie,
today each deletion takes 1 bit, but with sequence IDs it's 32X bigger
(an int), I think?  Are there other downsides?

Then, checking if a doc is deleted becomes an int compare instead of a
bit lookup, right?  And, we don't have to clone the deletions during
reopen.

So this is an appropriate tradeoff for apps that need to reopen after
every change to the index.  But for apps reopening less often (eg
maybe up to 10X per second), this may not be a good tradeoff (ie they
are willing to spend more time in the reopen if it reduces RAM
footprint).  Maybe the deletes impl should be pluggable and apps can
pick...

Mike

On Tue, Jul 20, 2010 at 12:33 PM, Jason Rutherglen
<[hidden email]> wrote:

> Michael B and I have been discussing the per segment doc writers
> and RT patches/branch. A small improvement we can add to trunk
> from this is the sequence IDs for deletes, which would improve
> the existing NRT system by avoiding the cloning of bit vectors.
> Implementing segment deleted docs via sequence IDs would
> additionally provide a path way for the future RT branch merge
> into trunk. It could be best to break up the RT patches as much
> as possible as they touch on many parts of the Lucene
> IndexWriter subsystem.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Sequence IDs for NRT deletes

Uwe Schindler
> The biggest downside of sequence IDs is increase RAM usage right?  Ie,
today
> each deletion takes 1 bit, but with sequence IDs it's 32X bigger (an int),
I
> think?  Are there other downsides?

It only takes this much space for in-ram deletes of in-ram segments. On disk
the deletes get bits again and also for already committed segments. That is
what Michael told me in Berlin.

> Then, checking if a doc is deleted becomes an int compare instead of a bit
> lookup, right?  And, we don't have to clone the deletions during reopen.
>
> So this is an appropriate tradeoff for apps that need to reopen after
every
> change to the index.  But for apps reopening less often (eg maybe up to
10X

> per second), this may not be a good tradeoff (ie they are willing to spend
> more time in the reopen if it reduces RAM footprint).  Maybe the deletes
> impl should be pluggable and apps can pick...
>
> Mike
>
> On Tue, Jul 20, 2010 at 12:33 PM, Jason Rutherglen
> <[hidden email]> wrote:
> > Michael B and I have been discussing the per segment doc writers and
> > RT patches/branch. A small improvement we can add to trunk from this
> > is the sequence IDs for deletes, which would improve the existing NRT
> > system by avoiding the cloning of bit vectors.
> > Implementing segment deleted docs via sequence IDs would additionally
> > provide a path way for the future RT branch merge into trunk. It could
> > be best to break up the RT patches as much as possible as they touch
> > on many parts of the Lucene IndexWriter subsystem.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email] For
> > additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] For additional
> commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sequence IDs for NRT deletes

Jason Rutherglen
> On disk the deletes get bits again and also for already committed segments.

I was thinking we'd use sequence ids for the on disk segments, as they
were deleted from, and yes, that'd imply that the sequence id deletes
would be converted back to BVs when written to disk.  That's probably
not a big deal.

On Tue, Jul 20, 2010 at 10:09 AM, Uwe Schindler <[hidden email]> wrote:

>> The biggest downside of sequence IDs is increase RAM usage right?  Ie,
> today
>> each deletion takes 1 bit, but with sequence IDs it's 32X bigger (an int),
> I
>> think?  Are there other downsides?
>
> It only takes this much space for in-ram deletes of in-ram segments. On disk
> the deletes get bits again and also for already committed segments. That is
> what Michael told me in Berlin.
>
>> Then, checking if a doc is deleted becomes an int compare instead of a bit
>> lookup, right?  And, we don't have to clone the deletions during reopen.
>>
>> So this is an appropriate tradeoff for apps that need to reopen after
> every
>> change to the index.  But for apps reopening less often (eg maybe up to
> 10X
>> per second), this may not be a good tradeoff (ie they are willing to spend
>> more time in the reopen if it reduces RAM footprint).  Maybe the deletes
>> impl should be pluggable and apps can pick...
>>
>> Mike
>>
>> On Tue, Jul 20, 2010 at 12:33 PM, Jason Rutherglen
>> <[hidden email]> wrote:
>> > Michael B and I have been discussing the per segment doc writers and
>> > RT patches/branch. A small improvement we can add to trunk from this
>> > is the sequence IDs for deletes, which would improve the existing NRT
>> > system by avoiding the cloning of bit vectors.
>> > Implementing segment deleted docs via sequence IDs would additionally
>> > provide a path way for the future RT branch merge into trunk. It could
>> > be best to break up the RT patches as much as possible as they touch
>> > on many parts of the Lucene IndexWriter subsystem.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email] For
>> > additional commands, e-mail: [hidden email]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email] For additional
>> commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sequence IDs for NRT deletes

Jason Rutherglen
In reply to this post by Michael McCandless-2
> Breaking up RT patches into baby steps would be great :)
> Actually is the RT branch active (I haven't seen commits going
> in).

From what we discussed, my impression is that the RT changes
will be substantial and the sequence ids seem to be something
that can be implemented now in trunk, then at least a small
piece of RT would be implemented and tested. A small isolated
improvement.

> The biggest downside of sequence IDs is increase RAM usage
> right?

Yes, however the garbage collection would decrease. We should
make seq id deletes pluggable.

> Then, checking if a doc is deleted becomes an int compare
> instead of a bit lookup, right?

Right it'd be an int compare, I think this'd be ok?

On Tue, Jul 20, 2010 at 10:03 AM, Michael McCandless
<[hidden email]> wrote:

> Breaking up RT patches into baby steps would be great :)  Actually is
> the RT branch active (I haven't seen commits going in).
>
> Eg, is per-segment DocWriter separable from the RT changes (seems like
> it should/could be)?
>
> The biggest downside of sequence IDs is increase RAM usage right?  Ie,
> today each deletion takes 1 bit, but with sequence IDs it's 32X bigger
> (an int), I think?  Are there other downsides?
>
> Then, checking if a doc is deleted becomes an int compare instead of a
> bit lookup, right?  And, we don't have to clone the deletions during
> reopen.
>
> So this is an appropriate tradeoff for apps that need to reopen after
> every change to the index.  But for apps reopening less often (eg
> maybe up to 10X per second), this may not be a good tradeoff (ie they
> are willing to spend more time in the reopen if it reduces RAM
> footprint).  Maybe the deletes impl should be pluggable and apps can
> pick...
>
> Mike
>
> On Tue, Jul 20, 2010 at 12:33 PM, Jason Rutherglen
> <[hidden email]> wrote:
>> Michael B and I have been discussing the per segment doc writers
>> and RT patches/branch. A small improvement we can add to trunk
>> from this is the sequence IDs for deletes, which would improve
>> the existing NRT system by avoiding the cloning of bit vectors.
>> Implementing segment deleted docs via sequence IDs would
>> additionally provide a path way for the future RT branch merge
>> into trunk. It could be best to break up the RT patches as much
>> as possible as they touch on many parts of the Lucene
>> IndexWriter subsystem.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sequence IDs for NRT deletes

Michael McCandless-2
On Tue, Jul 20, 2010 at 1:44 PM, Jason Rutherglen
<[hidden email]> wrote:
>> Breaking up RT patches into baby steps would be great :)
>> Actually is the RT branch active (I haven't seen commits going
>> in).
>
> From what we discussed, my impression is that the RT changes
> will be substantial and the sequence ids seem to be something
> that can be implemented now in trunk, then at least a small
> piece of RT would be implemented and tested. A small isolated
> improvement.

I agree.

>> The biggest downside of sequence IDs is increase RAM usage
>> right?
>
> Yes, however the garbage collection would decrease.

Right, much less GC if app frequently reopens.  But a 32X increase in
RAM usage is not trivial; I think we shouldn't enable it by default?

> We should make seq id deletes pluggable.

+1 -- make the deletes impl pluggable (ie one impl is sequence IDs;
another is BitVector).

>> Then, checking if a doc is deleted becomes an int compare
>> instead of a bit lookup, right?
>
> Right it'd be an int compare, I think this'd be ok?

Yeah maybe even more than OK; could be this is faster than looking up
a bit!  (Though it's much more traffic on the memory bus).  Apps may
want to use this even outside of NRT.  Have you tested?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sequence IDs for NRT deletes

Jason Rutherglen
> Right, much less GC if app frequently reopens.  But a 32X increase in
> RAM usage is not trivial; I think we shouldn't enable it by default?

Right, the RAM usage is quite high!  Is there a more compact
representation we could use?  Ah well, either way for good RT
performance, there are some users who may want to use this option.

> Have you tested?

The test would be a basic benchmark of queries against BV vs. an int[]
of deletes?

On Tue, Jul 20, 2010 at 12:17 PM, Michael McCandless
<[hidden email]> wrote:

> On Tue, Jul 20, 2010 at 1:44 PM, Jason Rutherglen
> <[hidden email]> wrote:
>>> Breaking up RT patches into baby steps would be great :)
>>> Actually is the RT branch active (I haven't seen commits going
>>> in).
>>
>> From what we discussed, my impression is that the RT changes
>> will be substantial and the sequence ids seem to be something
>> that can be implemented now in trunk, then at least a small
>> piece of RT would be implemented and tested. A small isolated
>> improvement.
>
> I agree.
>
>>> The biggest downside of sequence IDs is increase RAM usage
>>> right?
>>
>> Yes, however the garbage collection would decrease.
>
> Right, much less GC if app frequently reopens.  But a 32X increase in
> RAM usage is not trivial; I think we shouldn't enable it by default?
>
>> We should make seq id deletes pluggable.
>
> +1 -- make the deletes impl pluggable (ie one impl is sequence IDs;
> another is BitVector).
>
>>> Then, checking if a doc is deleted becomes an int compare
>>> instead of a bit lookup, right?
>>
>> Right it'd be an int compare, I think this'd be ok?
>
> Yeah maybe even more than OK; could be this is faster than looking up
> a bit!  (Though it's much more traffic on the memory bus).  Apps may
> want to use this even outside of NRT.  Have you tested?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sequence IDs for NRT deletes

Michael McCandless-2
On Tue, Jul 20, 2010 at 4:21 PM, Jason Rutherglen
<[hidden email]> wrote:
>> Right, much less GC if app frequently reopens.  But a 32X increase in
>> RAM usage is not trivial; I think we shouldn't enable it by default?
>
> Right, the RAM usage is quite high!  Is there a more compact
> representation we could use?  Ah well, either way for good RT
> performance, there are some users who may want to use this option.

Well, packed ints are more compact, but the decode cost would probably
be catastrophic :)

Maybe you could also use a smaller type (byte[], short[]) for sequence
ids, but, you'd then have to handle wraparound/overflow.  (In fact
even w/ int[] you have to handle wraparound?  long[] is probably safe
:) )  EG on overflow, you'd have to allocate all new (zero'd) arrays
for the next re-opened reader?

>> Have you tested?
>
> The test would be a basic benchmark of queries against BV vs. an int[]
> of deletes?

Yes, in a normal reader  (ie, not testing NRT -- just testing cost of
applying deletes via int cmp instead of BV lookup).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sequence IDs for NRT deletes

Jason Rutherglen
> long[] is probably safe

Yeah it's safe for most things...

> short[]

That could be a much better option for minimizing RAM usage, and then
implement wraparound.

On Wed, Jul 21, 2010 at 3:12 AM, Michael McCandless
<[hidden email]> wrote:

> On Tue, Jul 20, 2010 at 4:21 PM, Jason Rutherglen
> <[hidden email]> wrote:
>>> Right, much less GC if app frequently reopens.  But a 32X increase in
>>> RAM usage is not trivial; I think we shouldn't enable it by default?
>>
>> Right, the RAM usage is quite high!  Is there a more compact
>> representation we could use?  Ah well, either way for good RT
>> performance, there are some users who may want to use this option.
>
> Well, packed ints are more compact, but the decode cost would probably
> be catastrophic :)
>
> Maybe you could also use a smaller type (byte[], short[]) for sequence
> ids, but, you'd then have to handle wraparound/overflow.  (In fact
> even w/ int[] you have to handle wraparound?  long[] is probably safe
> :) )  EG on overflow, you'd have to allocate all new (zero'd) arrays
> for the next re-opened reader?
>
>>> Have you tested?
>>
>> The test would be a basic benchmark of queries against BV vs. an int[]
>> of deletes?
>
> Yes, in a normal reader  (ie, not testing NRT -- just testing cost of
> applying deletes via int cmp instead of BV lookup).
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

unsubscribe

Peter Bruhn Andersen
In reply to this post by Jason Rutherglen



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]