[lucy-user] C library, how to check index is healthy

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] C library, how to check index is healthy

serkanmulayim@gmail.com
Hi guys,

I have a few questions for maintaining the health of the index in the C library.

1- How do we check that the index is healthy for SEARCHING (e.g. creating a searcher) without a crash? As I see there is no problem in creating a Searcher even if there is a lock (write.lock or merge.lock)

2- How do we check that the index is healthy for INDEXING (e.g. creating a new indexer). I believe if the index is healthy(answer to the first question) and there is no LOCK file (e.g. write.lock or merge.lock), then we can assume that index is healthy and we can create a new indexer, right. (Assuming that there is no write permission issues or no disk space issues)

3- What are the lock types? As far as I see there are only write.lock and  merge.lock. Are there any others? If we close the application calling Lucy before the indexer is destroyed, is there an index recovery strategy. What would the implications of simply deleting write.lock and merge.lock be?

Thanks,
Serkan
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library, how to check index is healthy

Nick Wellnhofer
On 13/02/2017 20:44, Serkan Mulayim wrote:
> 1- How do we check that the index is healthy for SEARCHING (e.g. creating a searcher) without a crash? As I see there is no problem in creating a Searcher even if there is a lock (write.lock or merge.lock)

First of all, Lucy should never "crash" in the sense of a segfault. If it
does, this is a bug that should be reported.

Unless your index is on a shared volume like NFS, it can always be searched.

> 2- How do we check that the index is healthy for INDEXING (e.g. creating a new indexer). I believe if the index is healthy(answer to the first question) and there is no LOCK file (e.g. write.lock or merge.lock), then we can assume that index is healthy and we can create a new indexer, right. (Assuming that there is no write permission issues or no disk space issues)

You can always create a new Indexer. The worst that can happen is that a
LockErr exception is thrown after the Indexer failed to acquire a lock. Note
that by default, Indexer retries to get a lock for 1000 ms (one second). This
can be configured with IndexManager:

     https://lucy.apache.org/docs/c/Lucy/Index/IndexManager.html

> 3- What are the lock types? As far as I see there are only write.lock and  merge.lock. Are there any others?

This is explained in the documentation:

     https://lucy.apache.org/docs/c/Lucy/Docs/FileLocking.html

> If we close the application calling Lucy before the indexer is destroyed, is there an index recovery strategy.

Lucy uses an atomic rename operation when committing data so a crashing
Indexer should never corrupt the index.

> What would the implications of simply deleting write.lock and merge.lock be?

In most cases, this shouldn't be necessary. Lucy stores the PID of the process
that created a lock and tries to clear stale lock files from crashed
processes. But this won't work if another processes reuses the PID. If you're
absolutely sure that a lock doesn't belong to an active Indexer, you can
delete the lock directories manually.

Side note: This could be improved by supporting locking mechanisms that
release locks automatically if a process crashes. But these are OS-dependent
and aren't guaranteed to work reliably over NFS:

- `fcntl(F_SETLK)` or `lockf` on POSIX (unsuitable for multi-threaded
   operation).
- `flock` on BSD, Linux.
- `CreateFile` with a 0 sharing mode on Windows.

Nick


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library, how to check index is healthy

Peter Karman
Nick Wellnhofer wrote on 2/14/17 9:03 AM:

> On 13/02/2017 20:44, Serkan Mulayim wrote:
>> 1- How do we check that the index is healthy for SEARCHING (e.g. creating a
>> searcher) without a crash? As I see there is no problem in creating a Searcher
>> even if there is a lock (write.lock or merge.lock)
>
> First of all, Lucy should never "crash" in the sense of a segfault. If it does,
> this is a bug that should be reported.
>
> Unless your index is on a shared volume like NFS, it can always be searched.
>

One trick to keep in mind is that if the index underlying a Searcher changes (as
through indexing or document deletion), you must detect that change and open a
new Searcher. Because of mmap it's very fast to spawn a new Searcher, but
sometimes you'll see stale results if you persist one too long.

An example of how Dezi does that here:
https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/Lucy/Searcher.pm#L406

tl;dr is that Dezi writes its own index metadata header that includes a UUID and
timestamp for the last time the index was updated, and checks that UUID against
the current Searcher to know if it is stale and needs to be re-created.


--
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library, how to check index is healthy

Tilghman Lesher
In reply to this post by Nick Wellnhofer
On Tue, Feb 14, 2017 at 9:03 AM, Nick Wellnhofer <[hidden email]> wrote:

> On 13/02/2017 20:44, Serkan Mulayim wrote:
>> What would the implications of simply deleting write.lock and merge.lock
>> be?
>
>
> In most cases, this shouldn't be necessary. Lucy stores the PID of the
> process that created a lock and tries to clear stale lock files from crashed
> processes. But this won't work if another processes reuses the PID. If
> you're absolutely sure that a lock doesn't belong to an active Indexer, you
> can delete the lock directories manually.
>
> Side note: This could be improved by supporting locking mechanisms that
> release locks automatically if a process crashes. But these are OS-dependent
> and aren't guaranteed to work reliably over NFS:
>
> - `fcntl(F_SETLK)` or `lockf` on POSIX (unsuitable for multi-threaded
>   operation).
> - `flock` on BSD, Linux.
> - `CreateFile` with a 0 sharing mode on Windows.

As another sidenote, there are techniques for reliable exclusive
locking when the datastore is NFS.  Namely, instead of using the
default locking mechanisms in Unix, you can use the link(2) system
interface (which is an atomic operation on NFS) with an agreed-upon
name for your lock.  For example, if your shared volume was "/shared",
then you could create a temporary file using mkstemp on the volume,
then attempt to link(2) the temporary file to that known lockfile
name, "/shared/lock".  If the link succeeds, you have the lock, but if
the operation fails, another process obtained the lock.  This method
does require that your processes clean up (i.e. delete) the file when
you want to release the lock, however.

When it comes to rebuilding the index, we typically build the index
under a temporary directory name, then swap out the directories to the
production path using a forced-symlink (ln -sf).  As long as the old
index is kept for the maximum length of time of a searcher process,
there's no danger.  In other words:

1. Build to /shared/index_123/ (number could also be the PID of the
index-building process).
2. Delete /shared/index_old/.
3. Use readlink(2) to grab the current (real) pathname of the index
(/shared/index_122)
4. cd /shared ; ln -sf index_123/ /shared/index (production path)
5. Rename the previous index (/shared/index_122) to /shared/index_old/.

By building the index under a temporary directory name, then swapping
out the directory when we want to put the new index into production,
we avoid the locking problems between readers and writers entirely.

--
Tilghman
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library, how to check index is healthy

Marvin Humphrey
On Tue, Feb 14, 2017 at 7:43 AM, Tilghman Lesher <[hidden email]> wrote:

> As another sidenote, there are techniques for reliable exclusive
> locking when the datastore is NFS.  Namely, instead of using the
> default locking mechanisms in Unix, you can use the link(2) system
> interface (which is an atomic operation on NFS) with an agreed-upon
> name for your lock.  For example, if your shared volume was "/shared",
> then you could create a temporary file using mkstemp on the volume,
> then attempt to link(2) the temporary file to that known lockfile
> name, "/shared/lock".  If the link succeeds, you have the lock, but if
> the operation fails, another process obtained the lock.

That is, in fact, what Lucy does internally.

    https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Store/Lock.c#L188

    // Write to a temporary file, then use the creation of a hard link to
    // ensure atomic but non-destructive creation of the lockfile with its
    // complete contents.

> This method
> does require that your processes clean up (i.e. delete) the file when
> you want to release the lock, however.

Right, and we have some logic to clean up the lockfile automatically.  The
lock file contains a host name and a PID; if the host name matches AND the pid
is not active, we assume that the lockfile can be deleted.

This default behavior works pretty well for "typical" use on normal local
volumes -- it deletes many stale lockfiles automatically and generally spares
users from having to evaluate whether they need to do it themselves.  The
price is that on NFS and the like you typically need to override the default:
to be safe when you have multiple machines trying to write to an index on a
shared volume, you must ensure that each Indexer is associated with the proper
host name (via IndexManager).

(For more info, see http://lucy.apache.org/docs/c/Lucy/Docs/FileLocking .)

> 1. Build to /shared/index_123/ (number could also be the PID of the
> index-building process).
> 2. Delete /shared/index_old/.
> 3. Use readlink(2) to grab the current (real) pathname of the index
> (/shared/index_122)
> 4. cd /shared ; ln -sf index_123/ /shared/index (production path)
> 5. Rename the previous index (/shared/index_122) to /shared/index_old/.
>
> By building the index under a temporary directory name, then swapping
> out the directory when we want to put the new index into production,
> we avoid the locking problems between readers and writers entirely.

I can see how this works, though it is costly if you're building the indexes
from scratch each time rather than taking advantage of Lucy's incremental
indexing.

To speed things up, Lucy could supply a way to copy an entire index
near-instantaneously using hard links.  (This works because index files, once
committed, are never modified -- index content only changes through the
addition of new files and eventual deletion of obsolete files.)  The interface
could look something like this:

    lucy_Backup *backup = lucy_Backup_new("/path/to/index");
    cfish_String *snapshot_name = Lucy_Backup_Get_Snapshot_Name);
    cfish_String *backup_path
        = cfish_String_newf("/backupdir/backup_%o", snapshot_name);
    Lucy_Backup_Hard_Link_Dupe(backup, backup_path);

Then the following workflow becomes possible:

1. Use `hard_link_dupe` to create a duplicate index.
2. Add new content to the duped index
3. ln -sf /shared/index_123 /shared/index (production path)
4. Remove old indexes after some timeout.  (All searchers must be refreshed
   on a schedule which guarantees they do not access deleted content, or
   you'll see `Stale NFS filehandle` exceptions.)

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library, how to check index is healthy

serkanmulayim@gmail.com
In reply to this post by Nick Wellnhofer
Thanks guys very much for your comments. And sorry for my late response.

Nick, I have a few follow up questions regarding your comments.

So as I see:
1- when we do indexing operation in an existing index, a new segment is
created and it is not put into the index until it is committed. When it is
committed, its segment is kept separately and the snapshot.json file is
updated to include the new segment.
2- lock files are being generated and are kept separate based on the pid
(no shared FS adjustments).

From the documentation about Indexer: "In general, only one Indexer at a
time may write to an index safely. If a write lock cannot be secured, new()
will throw an exception."

What I would like to do is, to be able to index thousands of documents in
batches with asynchronous calls to the library. Asynchronous calls will try
to update the newly created segment to be written by different calls. If
PIDs are the same, it seems like system will crash due to write.lock
containing the PIDs. Do you think there is a way to make this work with
calls from different PIDs, with an addition of commit.lock file? I hope
this makes sense :( :)

One more question is when I index documents and commit each time (let's say
5000 batches of commits in synchronous way), I see that the indexing works
fine. How are the segments being handled. I do not see that 5000 different
segments created. Is it because after a certain number of segments (say
32), the segments are being merged and optimized?

Thanks in advance.
Serkan

On Tue, Feb 14, 2017 at 7:03 AM, Nick Wellnhofer <[hidden email]>
wrote:

> On 13/02/2017 20:44, Serkan Mulayim wrote:
>
>> 1- How do we check that the index is healthy for SEARCHING (e.g. creating
>> a searcher) without a crash? As I see there is no problem in creating a
>> Searcher even if there is a lock (write.lock or merge.lock)
>>
>
> First of all, Lucy should never "crash" in the sense of a segfault. If it
> does, this is a bug that should be reported.
>
> Unless your index is on a shared volume like NFS, it can always be
> searched.
>
> 2- How do we check that the index is healthy for INDEXING (e.g. creating a
>> new indexer). I believe if the index is healthy(answer to the first
>> question) and there is no LOCK file (e.g. write.lock or merge.lock), then
>> we can assume that index is healthy and we can create a new indexer, right.
>> (Assuming that there is no write permission issues or no disk space issues)
>>
>
> You can always create a new Indexer. The worst that can happen is that a
> LockErr exception is thrown after the Indexer failed to acquire a lock.
> Note that by default, Indexer retries to get a lock for 1000 ms (one
> second). This can be configured with IndexManager:
>
>     https://lucy.apache.org/docs/c/Lucy/Index/IndexManager.html
>
> 3- What are the lock types? As far as I see there are only write.lock and
>> merge.lock. Are there any others?
>>
>
> This is explained in the documentation:
>
>     https://lucy.apache.org/docs/c/Lucy/Docs/FileLocking.html
>
> If we close the application calling Lucy before the indexer is destroyed,
>> is there an index recovery strategy.
>>
>
> Lucy uses an atomic rename operation when committing data so a crashing
> Indexer should never corrupt the index.
>
> What would the implications of simply deleting write.lock and merge.lock
>> be?
>>
>
> In most cases, this shouldn't be necessary. Lucy stores the PID of the
> process that created a lock and tries to clear stale lock files from
> crashed processes. But this won't work if another processes reuses the PID.
> If you're absolutely sure that a lock doesn't belong to an active Indexer,
> you can delete the lock directories manually.
>
> Side note: This could be improved by supporting locking mechanisms that
> release locks automatically if a process crashes. But these are
> OS-dependent and aren't guaranteed to work reliably over NFS:
>
> - `fcntl(F_SETLK)` or `lockf` on POSIX (unsuitable for multi-threaded
>   operation).
> - `flock` on BSD, Linux.
> - `CreateFile` with a 0 sharing mode on Windows.
>
> Nick
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library, how to check index is healthy

Nick Wellnhofer
On 28/02/2017 20:17, Serkan Mulayim wrote:
> So as I see:
> 1- when we do indexing operation in an existing index, a new segment is
> created and it is not put into the index until it is committed. When it is
> committed, its segment is kept separately and the snapshot.json file is
> updated to include the new segment.

That's right, but segments are merged occasionally.

> 2- lock files are being generated and are kept separate based on the pid
> (no shared FS adjustments).

> What I would like to do is, to be able to index thousands of documents in
> batches with asynchronous calls to the library. Asynchronous calls will try
> to update the newly created segment to be written by different calls. If
> PIDs are the same, it seems like system will crash due to write.lock
> containing the PIDs.

This has nothing to do with PIDs (they're only used to remove stale lock
files). You'll receive a LockErr exception if an Indexer can't acquire the
write lock after several retries regardless of the process ID.

> Do you think there is a way to make this work with
> calls from different PIDs, with an addition of commit.lock file? I hope
> this makes sense :( :)

Parallel indexing isn't supported by Lucy. We only support background merging
which is mostly geared towards interactive applications that only index a few
documents at a time. Non-interactive batch jobs that index thousands of
documents in parallel aren't handled well by Lucy, although this could
probably be improved. Your only options right now are:

- If it's OK for your indexing processes to potentially wait for a long
   time, increase the write lock timeout to a huge value or catch LockErrs
   and implement your own retry logic.

- Implement your own document queue where multiple processes can add
   documents and a single indexing process removes them.

> One more question is when I index documents and commit each time (let's say
> 5000 batches of commits in synchronous way), I see that the indexing works
> fine. How are the segments being handled. I do not see that 5000 different
> segments created. Is it because after a certain number of segments (say
> 32), the segments are being merged and optimized?

Yes, that's how it works. The FastUpdates cookbook entry contains more details:

     https://lucy.apache.org/docs/c/Lucy/Docs/Cookbook/FastUpdates.html

But I don't think background merging would help much in your case.

Nick