Lock-less commits

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Lock-less commits

Michael McCandless-2

I think it's possible to modify Lucene's commit process so that it
does not require any commit locking at all.

This would be a big win because it would prevent all the various messy
errors (FileNotFound exceptions on instantiating an IndexReader,
Access Denied errors on renaming X.new -> X, Lock obtain timed out
from leftover lock files, etc.) that Lucene users keep coming across.

Also, indices against remote (NFS, Samba) filesystems, where current
locking has known issues that users seem to hit fairly often, would
then be fine.

I'd like to get feedback on this idea (am I missing something?) and if
there are no objections I can submit a full patch.

I have an initial implementation that passes all unit tests.  It also
runs fine with a writer/searcher stress test: the writer adding docs
to an index stored on NFS, and a multi-threaded reader on a separate
(Windows XP, mounted over Samba) machine continuously re-instantiating
an IndexSearcher and doing a search against the same index.

The basic idea is to change all commits (from SegmentReader or
IndexWriter) so that we never write to an existing file that a reader
could be reading from.  Instead, always write to a new file name using
sequentially numbered files.  For example, for "segments", on every
commit, write to a the sequence: segments.1, segments.2, segments.3,
etc.  Likewise for the *.del and *.fN (norms) files that
SegmentReaders write to.

Disk usage should be the same, even temporarily when merging, because
we still remove the old segments after merging.

We can also get rid of the "deletable" file (and associated errors
renaming deletable.new -> deletable) because we can compute what's
deletable according to "what's not referenced by current segments
file."

This means IndexReader, on opening an index, finds the most recent
segments file and loads it.  If, when loading the segments, it hits a
FileNotFound exception, and a newer segments file has appeared, it
re-tries against the new one.

This does entail small changes to the index file format.
Specifically, file names are different (they have new .N suffixes),
and, the contents of the segments file is expanded to contain details
about which del/norm files are current for each segment.

Note that the write lock is still needed to catch people accidentally
creating two writers on one index.  But since this lock file isn't
obtained/released as frequently as the current commit lock, I would
expect fewer issues from it.

This change should be fully backwards compatible, meaning the new code
would read the old index format and I believe existing APIs should not
change.  But, if there are applications (maybe Solr?) that peek inside
the index files expecting (for example) a file named "segments" to be
there then such cases would need to be fixed.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Yonik Seeley-2
> The basic idea is to change all commits (from SegmentReader or
> IndexWriter) so that we never write to an existing file that a reader
> could be reading from.  Instead, always write to a new file name using
> sequentially numbered files.  For example, for "segments", on every
> commit, write to a the sequence: segments.1, segments.2, segments.3,
> etc.  Likewise for the *.del and *.fN (norms) files that
> SegmentReaders write to.

Interesting idea...
How do you get around races between opening and deleting?

I assume for the writer, you would
  1) write new segments
  2) write new 'segments.3'
  3) delete unused segments (those referenced by 'segments.2')

But what happens when a reader comes along at point 1.5, say, opens
the latest 'segments.2' file, and then tries to open some of the
segments files at 3.5?
I guess the reader could retry... checking for a new segments file.
This could happen more than once (hopefully it wouldn't lead to
starvation... that would be unlikely).

> We can also get rid of the "deletable" file (and associated errors
> renaming deletable.new -> deletable) because we can compute what's
> deletable according to "what's not referenced by current segments
> file."

If the segments file is written last, how does an asynchronous deleter
tell what will be part of a future index?  I guess it's doable if all
file types have sequence numbers...

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 8/18/06, Michael McCandless <[hidden email]> wrote:

>
> I think it's possible to modify Lucene's commit process so that it
> does not require any commit locking at all.
>
> This would be a big win because it would prevent all the various messy
> errors (FileNotFound exceptions on instantiating an IndexReader,
> Access Denied errors on renaming X.new -> X, Lock obtain timed out
> from leftover lock files, etc.) that Lucene users keep coming across.
>
> Also, indices against remote (NFS, Samba) filesystems, where current
> locking has known issues that users seem to hit fairly often, would
> then be fine.
>
> I'd like to get feedback on this idea (am I missing something?) and if
> there are no objections I can submit a full patch.
>
> I have an initial implementation that passes all unit tests.  It also
> runs fine with a writer/searcher stress test: the writer adding docs
> to an index stored on NFS, and a multi-threaded reader on a separate
> (Windows XP, mounted over Samba) machine continuously re-instantiating
> an IndexSearcher and doing a search against the same index.
>

>
> Disk usage should be the same, even temporarily when merging, because
> we still remove the old segments after merging.
>

>
> This means IndexReader, on opening an index, finds the most recent
> segments file and loads it.  If, when loading the segments, it hits a
> FileNotFound exception, and a newer segments file has appeared, it
> re-tries against the new one.
>
> This does entail small changes to the index file format.
> Specifically, file names are different (they have new .N suffixes),
> and, the contents of the segments file is expanded to contain details
> about which del/norm files are current for each segment.
>
> Note that the write lock is still needed to catch people accidentally
> creating two writers on one index.  But since this lock file isn't
> obtained/released as frequently as the current commit lock, I would
> expect fewer issues from it.
>
> This change should be fully backwards compatible, meaning the new code
> would read the old index format and I believe existing APIs should not
> change.  But, if there are applications (maybe Solr?) that peek inside
> the index files expecting (for example) a file named "segments" to be
> there then such cases would need to be fixed.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2

>> The basic idea is to change all commits (from SegmentReader or
>> IndexWriter) so that we never write to an existing file that a reader
>> could be reading from.  Instead, always write to a new file name using
>> sequentially numbered files.  For example, for "segments", on every
>> commit, write to a the sequence: segments.1, segments.2, segments.3,
>> etc.  Likewise for the *.del and *.fN (norms) files that
>> SegmentReaders write to.
>
> Interesting idea...
> How do you get around races between opening and deleting?
>
> I assume for the writer, you would
>  1) write new segments
>  2) write new 'segments.3'
>  3) delete unused segments (those referenced by 'segments.2')
>
> But what happens when a reader comes along at point 1.5, say, opens
> the latest 'segments.2' file, and then tries to open some of the
> segments files at 3.5?
> I guess the reader could retry... checking for a new segments file.
> This could happen more than once (hopefully it wouldn't lead to
> starvation... that would be unlikely).

Yes, exactly.

And specifically, the reader only retries if, on hitting a FileNotFound
exception, it then checks & sees that a newer segments file is
available.  This way if there is a "true" FileNotFound exception due to
some sort of index corruption or something, we will [correctly] throw it.

It could in theory lead to starvation but this should be rare in
practice unless you have an IndexWriter that's constantly committing.

Also note that this should be no worse than what we have today, where
you would also likely hit starvation and get a "Lock obtain timed out"
thrown (eg see http://issues.apache.org/jira/browse/LUCENE-307).

In my stress test (shared index with writer accessing it over NFS and 3
reader threads doing "open indexsearcher; search" over and over, via
Samba share) the IndexSearchers do retry but so far never more than
once.  Of course this will depend heavily on details of the use case ...

>> We can also get rid of the "deletable" file (and associated errors
>> renaming deletable.new -> deletable) because we can compute what's
>> deletable according to "what's not referenced by current segments
>> file."
>
> If the segments file is written last, how does an asynchronous deleter
> tell what will be part of a future index?  I guess it's doable if all
> file types have sequence numbers...

Well, in my current implementation I don't have a truly asynchronous
deleter.  If I did have that then you're right I'd need to not delete
the "new and in progress" files.  We could consider something like that
in the future ...

Instead, I still do all deletes [synchronously] in the same places as
the current code, with the write lock held.  For example, during a
commit, we delete old segments immediately after writing the new
segments file, and then again after creating a compound file (if index
is using compound files).  Likewise when a SegmentReader commits new
deletes/norms.

Also one neat possibility this could lead to in the future is to
explicitly keep "virtual snapshots" at points in time, but within a
single index (vs eg the hard-link snapshots that Solr does).

For example if you want to index a bunch of docs, but not make them
visible yet for searching, with the current code, you have to make sure
never to restart an IndexSearcher.  But if your app server goes down
(say), then all IndexSearchers will come back up and make your indexed
docs visible.

But with this new approach (plus some additional code that I'm not
planning on doing for starters), it would be possible for an
IndexSearcher to explicitly say "I'd like to re-open the snapshot of the
index as of 3 days ago", for example.  This would require more smarts in
the reclaiming of old files ... but at least this could be a first step
towards that.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Yonik Seeley-2
On 8/18/06, Michael McCandless <[hidden email]> wrote:
> It could in theory lead to starvation but this should be rare in
> practice unless you have an IndexWriter that's constantly committing.

An index with a small mergeFactor (say 2) and a small maxBufferedDocs
(default 10), would have segments deleted every
mergeFactor*maxBufferedDocs when rapidly adding documents.  It might
help to start opening segments with the *last* segment, where segment
deletions are most likely to happen.

Also, when loading a .del file, how would one tell if it didn't exist
or if it was just deleted?
I guess one would always need to write a .del file even if no docs
were deleted.  Or, one could just order the deletes (delete optional
files in a segment last).

One would also have to worry about partially deleted segments on
Windows... while removing a segment, some of the files might fail to
delete (due to still being open) and some might succeed.

> Well, in my current implementation I don't have a truly asynchronous
> deleter.

Yeah, that's not really needed I guess.

This idea is worth kicking around more for the future (maybe for when
the index format changes again), but it's probably too much change for
right now (Lucene 2.0.x), right?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2

>> It could in theory lead to starvation but this should be rare in
>> practice unless you have an IndexWriter that's constantly committing.
>
> An index with a small mergeFactor (say 2) and a small maxBufferedDocs
> (default 10), would have segments deleted every
> mergeFactor*maxBufferedDocs when rapidly adding documents.  It might
> help to start opening segments with the *last* segment, where segment
> deletions are most likely to happen.

That is true.  I like the idea of opening last segments first -- I'll do
that.

> Also, when loading a .del file, how would one tell if it didn't exist
> or if it was just deleted?
> I guess one would always need to write a .del file even if no docs
> were deleted.  Or, one could just order the deletes (delete optional
> files in a segment last).

Right, in order to handle this, I've modified the segments file to
also contain the current "generation" (the .N suffix) of each
segment's .del & norms suffixes.  This way when SegmentReader reads
the segment, it knows exactly which del/norms files it's supposed to
find.  For "doUndeleteAll()" I write a zero-length .del.N+1 file.
SegmentReader is already writing a new segments file when it commits
(in today's code).

> One would also have to worry about partially deleted segments on
> Windows... while removing a segment, some of the files might fail to
> delete (due to still being open) and some might succeed.

Yes, I think this case is handled correctly.  Once all searchers using
those old segments are closed, then the next commit that runs will
remove those files (just like it does today).

Not having to read/write the deletable file should make things more
robust (there was a thread recently on users list about hitting an
exception because deletable.new couldn't be deleted on Windows).

> This idea is worth kicking around more for the future (maybe for when
> the index format changes again), but it's probably too much change for
> right now (Lucene 2.0.x), right?

Yes I don't think this should go in for a 2.0.x point release.  Maybe
for a 2.1.x?  Or I guess whenever we next have a major enough release
to allow changing of the index format.

I do think the benefits are sizable, though, so we should not wait too
too long :) The number of poor people who post to the users list with
errant Access Denied, FileNotFound, lock obtain timed out, etc.,
exceptions is quite large.  There was just one today that I'm going to
go try to respond to next.  Plus the prospect of working just fine on
remote filesystems is great!

OK I will keep working through this & running stress tests on it to
see if I can uncover any issues...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Robert Engels
i don't think these changes are going to work. With multiple writers  
and or readers doing deletes, without serializing the writes you  
will  have inconsistencies - and the del files will need to be unioned.

That is:

station A opens the index
station B opens the index
station A deletes some documents creating segment.del1
station B deletes some documents creating segment.del2

when station C opens the index (or when the segment is merged) del1  
and del2 need to be merged.

The locking enforces that writers are serialized - you cannot remove  
this restriction unless you merge the writes when reading.


On Aug 18, 2006, at 1:41 PM, Michael McCandless wrote:

>
>>> It could in theory lead to starvation but this should be rare in
>>> practice unless you have an IndexWriter that's constantly  
>>> committing.
>> An index with a small mergeFactor (say 2) and a small maxBufferedDocs
>> (default 10), would have segments deleted every
>> mergeFactor*maxBufferedDocs when rapidly adding documents.  It might
>> help to start opening segments with the *last* segment, where segment
>> deletions are most likely to happen.
>
> That is true.  I like the idea of opening last segments first --  
> I'll do
> that.
>
>> Also, when loading a .del file, how would one tell if it didn't exist
>> or if it was just deleted?
>> I guess one would always need to write a .del file even if no docs
>> were deleted.  Or, one could just order the deletes (delete optional
>> files in a segment last).
>
> Right, in order to handle this, I've modified the segments file to
> also contain the current "generation" (the .N suffix) of each
> segment's .del & norms suffixes.  This way when SegmentReader reads
> the segment, it knows exactly which del/norms files it's supposed to
> find.  For "doUndeleteAll()" I write a zero-length .del.N+1 file.
> SegmentReader is already writing a new segments file when it commits
> (in today's code).
>
>> One would also have to worry about partially deleted segments on
>> Windows... while removing a segment, some of the files might fail to
>> delete (due to still being open) and some might succeed.
>
> Yes, I think this case is handled correctly.  Once all searchers using
> those old segments are closed, then the next commit that runs will
> remove those files (just like it does today).
>
> Not having to read/write the deletable file should make things more
> robust (there was a thread recently on users list about hitting an
> exception because deletable.new couldn't be deleted on Windows).
>
>> This idea is worth kicking around more for the future (maybe for when
>> the index format changes again), but it's probably too much change  
>> for
>> right now (Lucene 2.0.x), right?
>
> Yes I don't think this should go in for a 2.0.x point release.  Maybe
> for a 2.1.x?  Or I guess whenever we next have a major enough release
> to allow changing of the index format.
>
> I do think the benefits are sizable, though, so we should not wait too
> too long :) The number of poor people who post to the users list with
> errant Access Denied, FileNotFound, lock obtain timed out, etc.,
> exceptions is quite large.  There was just one today that I'm going to
> go try to respond to next.  Plus the prospect of working just fine on
> remote filesystems is great!
>
> OK I will keep working through this & running stress tests on it to
> see if I can uncover any issues...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2

> i don't think these changes are going to work. With multiple writers and
> or readers doing deletes, without serializing the writes you will  have
> inconsistencies - and the del files will need to be unioned.
>
> That is:
>
> station A opens the index
> station B opens the index
> station A deletes some documents creating segment.del1
> station B deletes some documents creating segment.del2
>
> when station C opens the index (or when the segment is merged) del1 and
> del2 need to be merged.
>
> The locking enforces that writers are serialized - you cannot remove
> this restriction unless you merge the writes when reading.

Sorry, I should be very clear: I am not proposing we remove the write
lock.  The write lock must definitely remain (for the reasons /
examples you list above).  Only one writer can be open at a time
against the index.

The commit lock, which is used to ensure that when an IndexReader
opens the index, no writer is changing it at that moment (and v/v), is
I think the more problematic of the two.

The reason is, the write lock is really a safety net: it's up to you
to use Lucene in such a way that you never try to create two writers
at the same time.  You can use IndexModifier.  Or you can do your own
switching between IndexReader/IndexWriter.  Or you can use the patch
in LUCENE-565 so that IndexWriter is able to delete documents.  But in
all these cases, the write lock is really just a safety net: it
catches you if you accidentally violate this constraint and then you
go and fix your code accordingly.  You would typically catch this in
development / testing because it's a coding / design error.

The commit lock is more troublesome because it really serves an active
purpose in typical Lucene apps when there's otherwise no app level
logic to synchronize opening an IndexReader vs when a writer is
committing.  The writers can commit whenever they want to (well
IndexWriter at least).  And an IndexReader initialization is often
unpredictable (whenever you restart you App server instance, etc.).
So the timing of these events does require active serialization as
things stands now.

Because of this, an index stored on a remote store (eg, NFS, Samba),
where our current locking implementation is known [silently] not to
work, will eventually cause an errant FileNotFound or an Access Denied
exception.  And this is insidious because it may work fine during
initial development and testing only to strike after some time in
production.  This is why I'd like to change commits to not require
locking at all (by never re-using the same file name), while keeping
the write locking.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Robert Engels
I am betting that if your remote locking has issues, you will have  
the similar problems (since your new code requires accurate reading  
of the directory to determine the "latest" files). I also believe  
that directory reads like this are VERY inefficient in most cases.

I think these proposed changes are invalid. I suggest using a plugin-
in lock provider that uses the OS level lock methods available with  
FileChannel in order to assure lock consistency. If your OS is not  
honoring these, you probably need the changes to be performed there  
(and not in Lucene).


On Aug 18, 2006, at 3:52 PM, Michael McCandless wrote:

>
>> i don't think these changes are going to work. With multiple  
>> writers and or readers doing deletes, without serializing the  
>> writes you will  have inconsistencies - and the del files will  
>> need to be unioned.
>> That is:
>> station A opens the index
>> station B opens the index
>> station A deletes some documents creating segment.del1
>> station B deletes some documents creating segment.del2
>> when station C opens the index (or when the segment is merged)  
>> del1 and del2 need to be merged.
>> The locking enforces that writers are serialized - you cannot  
>> remove this restriction unless you merge the writes when reading.
>
> Sorry, I should be very clear: I am not proposing we remove the write
> lock.  The write lock must definitely remain (for the reasons /
> examples you list above).  Only one writer can be open at a time
> against the index.
>
> The commit lock, which is used to ensure that when an IndexReader
> opens the index, no writer is changing it at that moment (and v/v), is
> I think the more problematic of the two.
>
> The reason is, the write lock is really a safety net: it's up to you
> to use Lucene in such a way that you never try to create two writers
> at the same time.  You can use IndexModifier.  Or you can do your own
> switching between IndexReader/IndexWriter.  Or you can use the patch
> in LUCENE-565 so that IndexWriter is able to delete documents.  But in
> all these cases, the write lock is really just a safety net: it
> catches you if you accidentally violate this constraint and then you
> go and fix your code accordingly.  You would typically catch this in
> development / testing because it's a coding / design error.
>
> The commit lock is more troublesome because it really serves an active
> purpose in typical Lucene apps when there's otherwise no app level
> logic to synchronize opening an IndexReader vs when a writer is
> committing.  The writers can commit whenever they want to (well
> IndexWriter at least).  And an IndexReader initialization is often
> unpredictable (whenever you restart you App server instance, etc.).
> So the timing of these events does require active serialization as
> things stands now.
>
> Because of this, an index stored on a remote store (eg, NFS, Samba),
> where our current locking implementation is known [silently] not to
> work, will eventually cause an errant FileNotFound or an Access Denied
> exception.  And this is insidious because it may work fine during
> initial development and testing only to strike after some time in
> production.  This is why I'd like to change commits to not require
> locking at all (by never re-using the same file name), while keeping
> the write locking.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Robert Engels
In reply to this post by Michael McCandless-2
Also, the commit lock is there to allow the merge process to remove  
unused segments. Without it, a reader might get half way through  
reading the segments, only to find some missing, and then have to  
restart reading again. In a highly interactive environment this would  
be too inefficient.

On Aug 18, 2006, at 3:52 PM, Michael McCandless wrote:

>
>> i don't think these changes are going to work. With multiple  
>> writers and or readers doing deletes, without serializing the  
>> writes you will  have inconsistencies - and the del files will  
>> need to be unioned.
>> That is:
>> station A opens the index
>> station B opens the index
>> station A deletes some documents creating segment.del1
>> station B deletes some documents creating segment.del2
>> when station C opens the index (or when the segment is merged)  
>> del1 and del2 need to be merged.
>> The locking enforces that writers are serialized - you cannot  
>> remove this restriction unless you merge the writes when reading.
>
> Sorry, I should be very clear: I am not proposing we remove the write
> lock.  The write lock must definitely remain (for the reasons /
> examples you list above).  Only one writer can be open at a time
> against the index.
>
> The commit lock, which is used to ensure that when an IndexReader
> opens the index, no writer is changing it at that moment (and v/v), is
> I think the more problematic of the two.
>
> The reason is, the write lock is really a safety net: it's up to you
> to use Lucene in such a way that you never try to create two writers
> at the same time.  You can use IndexModifier.  Or you can do your own
> switching between IndexReader/IndexWriter.  Or you can use the patch
> in LUCENE-565 so that IndexWriter is able to delete documents.  But in
> all these cases, the write lock is really just a safety net: it
> catches you if you accidentally violate this constraint and then you
> go and fix your code accordingly.  You would typically catch this in
> development / testing because it's a coding / design error.
>
> The commit lock is more troublesome because it really serves an active
> purpose in typical Lucene apps when there's otherwise no app level
> logic to synchronize opening an IndexReader vs when a writer is
> committing.  The writers can commit whenever they want to (well
> IndexWriter at least).  And an IndexReader initialization is often
> unpredictable (whenever you restart you App server instance, etc.).
> So the timing of these events does require active serialization as
> things stands now.
>
> Because of this, an index stored on a remote store (eg, NFS, Samba),
> where our current locking implementation is known [silently] not to
> work, will eventually cause an errant FileNotFound or an Access Denied
> exception.  And this is insidious because it may work fine during
> initial development and testing only to strike after some time in
> production.  This is why I'd like to change commits to not require
> locking at all (by never re-using the same file name), while keeping
> the write locking.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2

> Also, the commit lock is there to allow the merge process to remove
> unused segments. Without it, a reader might get half way through reading
> the segments, only to find some missing, and then have to restart
> reading again. In a highly interactive environment this would be too
> inefficient.

OK this is a good point.  I will test the added cost of my changes
(over the normal costs of instantiating an IndexSearcher) when I run
benchmarks ...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Robert Engels
You also have to make sure you test this on non-Windows systems.  
Since a delete in Windows is prevented when the file is open, but non-
Windows system do not have this limitation so there is a far greater  
chance you will have an inconsistent index.


On Aug 18, 2006, at 5:00 PM, Michael McCandless wrote:

>
>> Also, the commit lock is there to allow the merge process to  
>> remove unused segments. Without it, a reader might get half way  
>> through reading the segments, only to find some missing, and then  
>> have to restart reading again. In a highly interactive environment  
>> this would be too inefficient.
>
> OK this is a good point.  I will test the added cost of my changes
> (over the normal costs of instantiating an IndexSearcher) when I run
> benchmarks ...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2
In reply to this post by Robert Engels

> I am betting that if your remote locking has issues, you will have the
> similar problems (since your new code requires accurate reading of the
> directory to determine the "latest" files). I also believe that
> directory reads like this are VERY inefficient in most cases.

OK, I will test the cost with benchmarks...

> I think these proposed changes are invalid. I suggest using a plugin-in
> lock provider that uses the OS level lock methods available with
> FileChannel in order to assure lock consistency. If your OS is not
> honoring these, you probably need the changes to be performed there (and
> not in Lucene).

Yes I agree, and this is in process:

   http://issues.apache.org/jira/browse/LUCENE-635

I think even if we can do lock-less commits, we would still want to use
native locks for the write locks.

I'm also working on an OS level locking implementation that subclasses
LockFactory.  However on an initial test I found that my test NFS server
(just a default Ubuntu 6.06 install) does not have locking enabled
(though it is an option if I reconfigure it, run it in kernel mode,
etc.).  Then there was this spooky attempt in the past to use OS level
locking over NFS:

   http://marc2.theaimsgroup.com/?l=lucene-dev&m=108322303929090&w=2

Hopefully that particular failure was from bugs in the JVM.

Anyway, the conclusion I generally come to when working with file locks
is that there are always many system level nuances / isssues /
challenges to getting them to work properly.  And so if we can use a
lock-less protocol for our commits we can prevent all the corresponding
problems.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2
In reply to this post by Robert Engels

> You also have to make sure you test this on non-Windows systems. Since a
> delete in Windows is prevented when the file is open, but non-Windows
> system do not have this limitation so there is a far greater chance you
> will have an inconsistent index.

Excellent point, will do.

I'm now testing a writer on Linux & 3 reader threads (opening searcher,
doing one search, closing it, over and over) on Windows sharing a
filesystem over NFS (from Linux) and over Samba (from Windows) with no
issues.

I'll test reversing these two, running both reader & writers on only
Linux & only Windows, many other interesting tests, etc.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On 8/18/06, Michael McCandless <[hidden email]> wrote:
> > One would also have to worry about partially deleted segments on
> > Windows... while removing a segment, some of the files might fail to
> > delete (due to still being open) and some might succeed.
>
> Yes, I think this case is handled correctly.  Once all searchers using
> those old segments are closed, then the next commit that runs will
> remove those files (just like it does today).

Unix systems don't have to worry about this.
Windows systems use "deletable" to track what they should try to delete later.
How are you handling it?  Get a full directory listing and try to
remove any older segment files?

I agree with the benefits of not requiring locking to open an
IndexReader, but I wonder at the performance cost of the directory
listing needed to "garbage collect" old segment files and to find the
newest "segments.xxx"

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2
Yonik Seeley wrote:

> On 8/18/06, Michael McCandless <[hidden email]> wrote:
>> > One would also have to worry about partially deleted segments on
>> > Windows... while removing a segment, some of the files might fail to
>> > delete (due to still being open) and some might succeed.
>>
>> Yes, I think this case is handled correctly.  Once all searchers using
>> those old segments are closed, then the next commit that runs will
>> remove those files (just like it does today).
>
> Unix systems don't have to worry about this.
> Windows systems use "deletable" to track what they should try to delete
> later.
> How are you handling it?  Get a full directory listing and try to
> remove any older segment files?
>
> I agree with the benefits of not requiring locking to open an
> IndexReader, but I wonder at the performance cost of the directory
> listing needed to "garbage collect" old segment files and to find the
> newest "segments.xxx"

Agreed, it looks like the big tradeoff here is possible performance
loss (due to directory listings & IndexReader having to retry) vs.
better robustness with a lock-less design.

I'm planning on doing benchmarks to measure net system throughput
(#docs indexed, #queries run) across different OS's, remote/local
filesystems, and across "high frequency" to "low frequency" of
re-opening the IndexSearchers.

The good news (in my testing so far) is that the lock-less design is
functionally correct.

On deletable: yes, I'm currently GC'ing unused segments by doing a
full directory listing.  But, this is actually a separable change:
reading/writing the deletable file only requires the write lock, so I
could keep the current approach.  I was wanting to avoid the
File.renameTo deletable.new -> deletable which has hit "Access Denied"
on Windows in the past.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Yonik Seeley-2
On 8/20/06, Michael McCandless <[hidden email]> wrote:
> On deletable: yes, I'm currently GC'ing unused segments by doing a
> full directory listing.

Actually, you could get a full directory listing once per IndexWriter
and keep the results up-to-date in memory (including deletes that
fail).  No need for a "deletable" file, and the directory-listing hit
is only taken once per IndexWriter instance, not once per merge.

IndexWriters also need to open IndexReaders (SegmentReaders) for
merging... I don't know if you needed to modify SegmentReader in a way
that reduces performance, but if so it might be possible to make a
special package protected factory method for use by IndexWriter that
regains any performance loss by making certain assumptions.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2
Yonik Seeley wrote:
> On 8/20/06, Michael McCandless <[hidden email]> wrote:
>> On deletable: yes, I'm currently GC'ing unused segments by doing a
>> full directory listing.
>
> Actually, you could get a full directory listing once per IndexWriter
> and keep the results up-to-date in memory (including deletes that
> fail).  No need for a "deletable" file, and the directory-listing hit
> is only taken once per IndexWriter instance, not once per merge.

Excellent!  I will take this approach.

> IndexWriters also need to open IndexReaders (SegmentReaders) for
> merging... I don't know if you needed to modify SegmentReader in a way
> that reduces performance, but if so it might be possible to make a
> special package protected factory method for use by IndexWriter that
> regains any performance loss by making certain assumptions.

So far, I believe my mods to SegmentReader should not affect
performance.  It's just when instantiating the SegmentInfos (well,
SegmentInfos.read()) that I do a directory listing to find the latest
"generation" segments.N file.  When IndexWriter creates the
SegmentMerger, since it uses its own SegmentInfo's to get() each
SegmentReader, all the necessary details (which .del.N and norms files
are "current") are in the SegmentInfo and so SegmentReader doesn't
need to do any extra "work".  Still this is a good suggestion to
remember for future work.

Thanks for all the feedback!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Robert Engels
I don't think you can do this. If two different writers are opened  
for the same indexed, you always need to read the directory since the  
other may have created new segments.

On Aug 20, 2006, at 1:35 PM, Michael McCandless wrote:

> Yonik Seeley wrote:
>> On 8/20/06, Michael McCandless <[hidden email]> wrote:
>>> On deletable: yes, I'm currently GC'ing unused segments by doing a
>>> full directory listing.
>> Actually, you could get a full directory listing once per IndexWriter
>> and keep the results up-to-date in memory (including deletes that
>> fail).  No need for a "deletable" file, and the directory-listing hit
>> is only taken once per IndexWriter instance, not once per merge.
>
> Excellent!  I will take this approach.
>
>> IndexWriters also need to open IndexReaders (SegmentReaders) for
>> merging... I don't know if you needed to modify SegmentReader in a  
>> way
>> that reduces performance, but if so it might be possible to make a
>> special package protected factory method for use by IndexWriter that
>> regains any performance loss by making certain assumptions.
>
> So far, I believe my mods to SegmentReader should not affect
> performance.  It's just when instantiating the SegmentInfos (well,
> SegmentInfos.read()) that I do a directory listing to find the latest
> "generation" segments.N file.  When IndexWriter creates the
> SegmentMerger, since it uses its own SegmentInfo's to get() each
> SegmentReader, all the necessary details (which .del.N and norms files
> are "current") are in the SegmentInfo and so SegmentReader doesn't
> need to do any extra "work".  Still this is a good suggestion to
> remember for future work.
>
> Thanks for all the feedback!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Michael McCandless-2
robert engels wrote:
> I don't think you can do this. If two different writers are opened for
> the same indexed, you always need to read the directory since the other
> may have created new segments.

This case should be OK.  You have to close one IndexWriter before
opening the other (only 1 writer at a time per index), so when the other
one is opened it would refresh it's "deletable" list.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lock-less commits

Robert Engels
Then keeping the segments in memory is not helpful, as every open of  
the writer needs to traverse the directory (since another writer  
still could have created segments).

For example,

Computer A opens writer, modifies index, closes writer.
Computer B opens writer (this must read the directory)....

No reason to keep the filename/segment infos around...


On Aug 20, 2006, at 9:33 PM, Michael McCandless wrote:

> robert engels wrote:
>> I don't think you can do this. If two different writers are opened  
>> for the same indexed, you always need to read the directory since  
>> the other may have created new segments.
>
> This case should be OK.  You have to close one IndexWriter before  
> opening the other (only 1 writer at a time per index), so when the  
> other one is opened it would refresh it's "deletable" list.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12