IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

William Moss
We're using Lucene 5.2.0 (I know it's old, we're in the process of
upgrading) to handle searching over our listings here at Airbnb. I've been
digging into our realtime indexing code and how we use Lucene and I wanted
to check a few assumptions around synchronization, since we see some
periodic exceptions[1] that I can't quite explain.

First, a tiny bit of background
1. We use facets and therefore are writing realtime updates using both
a IndexWriter and DirectoryTaxonomyWriter.
2. We have multiple update threads, consuming messages (from Kafka) and
updating the index.
3. Once we process a batch of messages, we call commit (first on
DirectoryTaxonomyWriter then on IndexWriter).
4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
5. We periodically call forceMerge on our IndexWriter (to improve
performance).

So, now to a few questions:
1. My understand is the right way to handle a DirectoryTaxonomyWriter and
an IndexWriter is to call commit on DirectoryTaxonomyWriter before
IndexWriter. Is this correct? Since we're using multiple threads, we need
to synchronize these calls within the process regardless, but curious to
understand the design.

2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those
need to be synchronized with the commit calls to either IndexWriter or
DirectoryTaxonomyWriter? Do we need to call it after ever time we call
commit? The comment suggests we call it "periodically," but I'm not clear
on how often that should be or what conditions trigger the index to change
in way that this would be required.

3. Lastly, what about forceMerge? Is there any worry there or can that just
safely happen in the background? Is there any need to call commit
afterward? Or does forceMerge effectively do that? Presumably, we would not
see the new index until maybeRefresh was called the next time?

Sorry, that was a lot of questions, would love help on any and all of them.

Thanks!
Will

[1] When calling maybeRefresh, we've seen error that look like:
java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Michael McCandless-2
Hitting NoSuchFileException is no good!  Something serious is wrong.
Can you include the full stack trace?

Responses inlined below:

On Tue, Sep 27, 2016 at 2:08 AM, William Moss
<[hidden email]> wrote:
> We're using Lucene 5.2.0 (I know it's old, we're in the process of
> upgrading) to handle searching over our listings here at Airbnb.

6.2.1 is a compelling upgrade because of more efficient indexing and
searching of numerics (among many other things!)...

> I've been
> digging into our realtime indexing code and how we use Lucene and I wanted
> to check a few assumptions around synchronization, since we see some
> periodic exceptions[1] that I can't quite explain.
>
> First, a tiny bit of background
> 1. We use facets and therefore are writing realtime updates using both
> a IndexWriter and DirectoryTaxonomyWriter.
> 2. We have multiple update threads, consuming messages (from Kafka) and
> updating the index.
> 3. Once we process a batch of messages, we call commit (first on
> DirectoryTaxonomyWriter then on IndexWriter).

I see TaxonomyWriter's javadocs say that is the correct order, but I
would have expected the opposite, if you are concurrently indexing
documents.

> 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
> 5. We periodically call forceMerge on our IndexWriter (to improve
> performance).

This is dubious: if your index continues to receive changes, you
should skip forceMerge and let Lucene's natural merging run at
defaults.  forceMerge is an incredibly costly operation and it's
unclear you get that much speedup at search time.

> So, now to a few questions:
> 1. My understand is the right way to handle a DirectoryTaxonomyWriter and
> an IndexWriter is to call commit on DirectoryTaxonomyWriter before
> IndexWriter. Is this correct? Since we're using multiple threads, we need
> to synchronize these calls within the process regardless, but curious to
> understand the design.

You should not have to block index updates while committing, if you
don't need/want to.

If you don't block updates, I would think you need to commit the
DirectoryTaxonomyWriter second so that any new nodes in the taxonomy
tree, referenced by the main index, are guaranteed to be present in
the DirectoryTaxonomyWriter's commit.

Maybe Shai can shed some more light here...

> 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those
> need to be synchronized with the commit calls to either IndexWriter or
> DirectoryTaxonomyWriter?

No.

Commit can be a costly, slow operation (calling fsync on N files), and
it's designed internally in IndexWriter to not block operations like
merging and refreshing.

> Do we need to call it after ever time we call
> commit?  The comment suggests we call it "periodically," but I'm not clear
> on how often that should be or what conditions trigger the index to change
> in way that this would be required.

You don't have to call refresh on every commit.  When you call it is
entirely up to you.

Commit makes changes durable on disk, so an OS crash, power loss,
etc., won't lose those changes (a bad disk WILL lose them of course).

Refresh makes changes visible for searching.

The two ops are entirely separate.

Some apps call commit periodically and never refresh, others call
refresh periodically and never commit :)  It's your call.

> 3. Lastly, what about forceMerge? Is there any worry there or can that just
> safely happen in the background? Is there any need to call commit
> afterward? Or does forceMerge effectively do that?

Force merge does not call commit itself.

If you do force merge, then it is a good idea to both commit and
refresh afterwards, as this will let Lucene free up resources (files,
file descriptors) with the old un-merged segments.

> Presumably, we would not
> see the new index until maybeRefresh was called the next time?

Exactly.

> Sorry, that was a lot of questions, would love help on any and all of them.

No worries, keep them coming!

> [1] When calling maybeRefresh, we've seen error that look like:
> java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe

Need the full stack trace / context here to understand what's happening...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Shai Erera
Hmm ... the commit part of the two indexes is always tricky. The javadocs
are correct because the order of indexing is as follows: when you index a
document with facets, the facets are first added to the taxonomy index and
only then the document is indexed in IW.

Therefore if you concurrently index and commit, then committing TIW first
ensures that all "known" facets up to this point are committed. Then when
you commit IW, the documents in there are guaranteed to have their facet
ordinals already in the committed TIW (which may at this point include more
facets than are indexed in IW, but that's OK).

When you then refresh the searchers, you can count on the refreshed IR to
have all its ordinals in the refreshed TIR (which again, may include more
ordinals, but that's OK).

Hope this helps!

Shai

On Tue, Sep 27, 2016 at 6:57 AM Michael McCandless <
[hidden email]> wrote:

> Hitting NoSuchFileException is no good!  Something serious is wrong.
> Can you include the full stack trace?
>
> Responses inlined below:
>
> On Tue, Sep 27, 2016 at 2:08 AM, William Moss
> <[hidden email]> wrote:
> > We're using Lucene 5.2.0 (I know it's old, we're in the process of
> > upgrading) to handle searching over our listings here at Airbnb.
>
> 6.2.1 is a compelling upgrade because of more efficient indexing and
> searching of numerics (among many other things!)...
>
> > I've been
> > digging into our realtime indexing code and how we use Lucene and I
> wanted
> > to check a few assumptions around synchronization, since we see some
> > periodic exceptions[1] that I can't quite explain.
> >
> > First, a tiny bit of background
> > 1. We use facets and therefore are writing realtime updates using both
> > a IndexWriter and DirectoryTaxonomyWriter.
> > 2. We have multiple update threads, consuming messages (from Kafka) and
> > updating the index.
> > 3. Once we process a batch of messages, we call commit (first on
> > DirectoryTaxonomyWriter then on IndexWriter).
>
> I see TaxonomyWriter's javadocs say that is the correct order, but I
> would have expected the opposite, if you are concurrently indexing
> documents.
>
> > 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
> > 5. We periodically call forceMerge on our IndexWriter (to improve
> > performance).
>
> This is dubious: if your index continues to receive changes, you
> should skip forceMerge and let Lucene's natural merging run at
> defaults.  forceMerge is an incredibly costly operation and it's
> unclear you get that much speedup at search time.
>
> > So, now to a few questions:
> > 1. My understand is the right way to handle a DirectoryTaxonomyWriter and
> > an IndexWriter is to call commit on DirectoryTaxonomyWriter before
> > IndexWriter. Is this correct? Since we're using multiple threads, we need
> > to synchronize these calls within the process regardless, but curious to
> > understand the design.
>
> You should not have to block index updates while committing, if you
> don't need/want to.
>
> If you don't block updates, I would think you need to commit the
> DirectoryTaxonomyWriter second so that any new nodes in the taxonomy
> tree, referenced by the main index, are guaranteed to be present in
> the DirectoryTaxonomyWriter's commit.
>
> Maybe Shai can shed some more light here...
>
> > 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those
> > need to be synchronized with the commit calls to either IndexWriter or
> > DirectoryTaxonomyWriter?
>
> No.
>
> Commit can be a costly, slow operation (calling fsync on N files), and
> it's designed internally in IndexWriter to not block operations like
> merging and refreshing.
>
> > Do we need to call it after ever time we call
> > commit?  The comment suggests we call it "periodically," but I'm not
> clear
> > on how often that should be or what conditions trigger the index to
> change
> > in way that this would be required.
>
> You don't have to call refresh on every commit.  When you call it is
> entirely up to you.
>
> Commit makes changes durable on disk, so an OS crash, power loss,
> etc., won't lose those changes (a bad disk WILL lose them of course).
>
> Refresh makes changes visible for searching.
>
> The two ops are entirely separate.
>
> Some apps call commit periodically and never refresh, others call
> refresh periodically and never commit :)  It's your call.
>
> > 3. Lastly, what about forceMerge? Is there any worry there or can that
> just
> > safely happen in the background? Is there any need to call commit
> > afterward? Or does forceMerge effectively do that?
>
> Force merge does not call commit itself.
>
> If you do force merge, then it is a good idea to both commit and
> refresh afterwards, as this will let Lucene free up resources (files,
> file descriptors) with the old un-merged segments.
>
> > Presumably, we would not
> > see the new index until maybeRefresh was called the next time?
>
> Exactly.
>
> > Sorry, that was a lot of questions, would love help on any and all of
> them.
>
> No worries, keep them coming!
>
> > [1] When calling maybeRefresh, we've seen error that look like:
> > java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe
>
> Need the full stack trace / context here to understand what's happening...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

William Moss
Thank you both for your quick reply!

Responding to a few of Michael's points first:
* We actually tried the upgrade to 6.0 a few months back (when that was the
newest) and were getting similar errors to the ones I'm seeing now. We were
not able to track them down, which is part of the motivation for me asking
all these questions. We'll get there though :-)

* The last time we tested this (which I think was still post
ConcurrentMergePolicy) we saw that the read speed would slowly degrade over
time. My understanding was that forceMerge was very expensive, but would
make reads faster once complete. Is this not correct? Also, we never
attempted to tune the MergePolicy at all, so while were on the subject, is
there good documentation on how to do that? I'm much prefer to get away
from calling forceMerge. If it's useful information, we've got a relatively
small corpus, only ~2+M documents.

* We want to be able to ensure that if a machine or JVM crashes we are in a
coherent state. To that end, we need to call commit on Lucene and then
commit back what we've read so far to Kafka. Calling commit is the only way
to ensure this, right?

* To make sure I understand how maybeRefresh works, ignoring whether or not
we commit for a second, if I add a document via IndexWriter, it will not be
reflected in IndexSearchers I get by calling acquire on SearcherAndTaxonomy
until I call maybeRefresh?

Now, on to the concurrency issue. I was thinking a little more about this
and I think the fundamental issue is that while IndexWriter and
DirectoryTaxonomyWriter are each thread safe, them together are not. As
suggested by the documentation, we use one instance each of IndexWriter,
DirectoryTaxonomyWriter and SearcherTaxonomyManager. Imagine the following
scenario:
[Thread 1] Add document to DirectoryTaxonomyWriter
[Thread 1] Add document to IndexWriter
[Thread 1] Call commit on DirectoryTaxonomyWriter
[Thread 2] Add document to DirectoryTaxonomyWriter
[Thread 2] Add document to IndexWriter
[Thread 1] Call commit on IndexWriter
The on disc representation now should contain things in the IndexWriter
that are not contained in the DirectoryTaxonomyWriter, right?

Assuming maybeRefresh looks at the state on disk when it's doing it's
update (if this not true I don't understand why it was throwing
NoSuchFileException) then it can be out of sync as well?

I apparently never made a full copy of the stack trace. I'll attempt to
regenerate it and post it here once I have it.

Thanks again!
Will


On Mon, Sep 26, 2016 at 10:05 PM Shai Erera <[hidden email]> wrote:

> Hmm ... the commit part of the two indexes is always tricky. The javadocs
> are correct because the order of indexing is as follows: when you index a
> document with facets, the facets are first added to the taxonomy index and
> only then the document is indexed in IW.
>
> Therefore if you concurrently index and commit, then committing TIW first
> ensures that all "known" facets up to this point are committed. Then when
> you commit IW, the documents in there are guaranteed to have their facet
> ordinals already in the committed TIW (which may at this point include more
> facets than are indexed in IW, but that's OK).
>
> When you then refresh the searchers, you can count on the refreshed IR to
> have all its ordinals in the refreshed TIR (which again, may include more
> ordinals, but that's OK).
>
> Hope this helps!
>
> Shai
>
> On Tue, Sep 27, 2016 at 6:57 AM Michael McCandless <
> [hidden email]> wrote:
>
> > Hitting NoSuchFileException is no good!  Something serious is wrong.
> > Can you include the full stack trace?
> >
> > Responses inlined below:
> >
> > On Tue, Sep 27, 2016 at 2:08 AM, William Moss
> > <[hidden email]> wrote:
> > > We're using Lucene 5.2.0 (I know it's old, we're in the process of
> > > upgrading) to handle searching over our listings here at Airbnb.
> >
> > 6.2.1 is a compelling upgrade because of more efficient indexing and
> > searching of numerics (among many other things!)...
> >
> > > I've been
> > > digging into our realtime indexing code and how we use Lucene and I
> > wanted
> > > to check a few assumptions around synchronization, since we see some
> > > periodic exceptions[1] that I can't quite explain.
> > >
> > > First, a tiny bit of background
> > > 1. We use facets and therefore are writing realtime updates using both
> > > a IndexWriter and DirectoryTaxonomyWriter.
> > > 2. We have multiple update threads, consuming messages (from Kafka) and
> > > updating the index.
> > > 3. Once we process a batch of messages, we call commit (first on
> > > DirectoryTaxonomyWriter then on IndexWriter).
> >
> > I see TaxonomyWriter's javadocs say that is the correct order, but I
> > would have expected the opposite, if you are concurrently indexing
> > documents.
> >
> > > 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
> > > 5. We periodically call forceMerge on our IndexWriter (to improve
> > > performance).
> >
> > This is dubious: if your index continues to receive changes, you
> > should skip forceMerge and let Lucene's natural merging run at
> > defaults.  forceMerge is an incredibly costly operation and it's
> > unclear you get that much speedup at search time.
> >
> > > So, now to a few questions:
> > > 1. My understand is the right way to handle a DirectoryTaxonomyWriter
> and
> > > an IndexWriter is to call commit on DirectoryTaxonomyWriter before
> > > IndexWriter. Is this correct? Since we're using multiple threads, we
> need
> > > to synchronize these calls within the process regardless, but curious
> to
> > > understand the design.
> >
> > You should not have to block index updates while committing, if you
> > don't need/want to.
> >
> > If you don't block updates, I would think you need to commit the
> > DirectoryTaxonomyWriter second so that any new nodes in the taxonomy
> > tree, referenced by the main index, are guaranteed to be present in
> > the DirectoryTaxonomyWriter's commit.
> >
> > Maybe Shai can shed some more light here...
> >
> > > 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do
> those
> > > need to be synchronized with the commit calls to either IndexWriter or
> > > DirectoryTaxonomyWriter?
> >
> > No.
> >
> > Commit can be a costly, slow operation (calling fsync on N files), and
> > it's designed internally in IndexWriter to not block operations like
> > merging and refreshing.
> >
> > > Do we need to call it after ever time we call
> > > commit?  The comment suggests we call it "periodically," but I'm not
> > clear
> > > on how often that should be or what conditions trigger the index to
> > change
> > > in way that this would be required.
> >
> > You don't have to call refresh on every commit.  When you call it is
> > entirely up to you.
> >
> > Commit makes changes durable on disk, so an OS crash, power loss,
> > etc., won't lose those changes (a bad disk WILL lose them of course).
> >
> > Refresh makes changes visible for searching.
> >
> > The two ops are entirely separate.
> >
> > Some apps call commit periodically and never refresh, others call
> > refresh periodically and never commit :)  It's your call.
> >
> > > 3. Lastly, what about forceMerge? Is there any worry there or can that
> > just
> > > safely happen in the background? Is there any need to call commit
> > > afterward? Or does forceMerge effectively do that?
> >
> > Force merge does not call commit itself.
> >
> > If you do force merge, then it is a good idea to both commit and
> > refresh afterwards, as this will let Lucene free up resources (files,
> > file descriptors) with the old un-merged segments.
> >
> > > Presumably, we would not
> > > see the new index until maybeRefresh was called the next time?
> >
> > Exactly.
> >
> > > Sorry, that was a lot of questions, would love help on any and all of
> > them.
> >
> > No worries, keep them coming!
> >
> > > [1] When calling maybeRefresh, we've seen error that look like:
> > > java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe
> >
> > Need the full stack trace / context here to understand what's
> happening...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Michael McCandless-2
In reply to this post by Shai Erera
On Tue, Sep 27, 2016 at 7:05 AM, Shai Erera <[hidden email]> wrote:

> Hmm ... the commit part of the two indexes is always tricky. The javadocs
> are correct because the order of indexing is as follows: when you index a
> document with facets, the facets are first added to the taxonomy index and
> only then the document is indexed in IW.
>
> Therefore if you concurrently index and commit, then committing TIW first
> ensures that all "known" facets up to this point are committed. Then when
> you commit IW, the documents in there are guaranteed to have their facet
> ordinals already in the committed TIW (which may at this point include more
> facets than are indexed in IW, but that's OK).

Hmm but if you commit TIW first, then IW after, isn't it possible that
after TIW commit finishes that I index a few more documents into IW
that added new taxonomy nodes/labels/ordinals and then when I call
IW.commit those last few documents are now referencing taxonomy nodes
that do not exist in the TIW commit point?

Mike McCandless

http://blog.mikemccandless.com

>> On Tue, Sep 27, 2016 at 2:08 AM, William Moss
>> <[hidden email]> wrote:
>> > We're using Lucene 5.2.0 (I know it's old, we're in the process of
>> > upgrading) to handle searching over our listings here at Airbnb.
>>
>> 6.2.1 is a compelling upgrade because of more efficient indexing and
>> searching of numerics (among many other things!)...
>>
>> > I've been
>> > digging into our realtime indexing code and how we use Lucene and I
>> wanted
>> > to check a few assumptions around synchronization, since we see some
>> > periodic exceptions[1] that I can't quite explain.
>> >
>> > First, a tiny bit of background
>> > 1. We use facets and therefore are writing realtime updates using both
>> > a IndexWriter and DirectoryTaxonomyWriter.
>> > 2. We have multiple update threads, consuming messages (from Kafka) and
>> > updating the index.
>> > 3. Once we process a batch of messages, we call commit (first on
>> > DirectoryTaxonomyWriter then on IndexWriter).
>>
>> I see TaxonomyWriter's javadocs say that is the correct order, but I
>> would have expected the opposite, if you are concurrently indexing
>> documents.
>>
>> > 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
>> > 5. We periodically call forceMerge on our IndexWriter (to improve
>> > performance).
>>
>> This is dubious: if your index continues to receive changes, you
>> should skip forceMerge and let Lucene's natural merging run at
>> defaults.  forceMerge is an incredibly costly operation and it's
>> unclear you get that much speedup at search time.
>>
>> > So, now to a few questions:
>> > 1. My understand is the right way to handle a DirectoryTaxonomyWriter and
>> > an IndexWriter is to call commit on DirectoryTaxonomyWriter before
>> > IndexWriter. Is this correct? Since we're using multiple threads, we need
>> > to synchronize these calls within the process regardless, but curious to
>> > understand the design.
>>
>> You should not have to block index updates while committing, if you
>> don't need/want to.
>>
>> If you don't block updates, I would think you need to commit the
>> DirectoryTaxonomyWriter second so that any new nodes in the taxonomy
>> tree, referenced by the main index, are guaranteed to be present in
>> the DirectoryTaxonomyWriter's commit.
>>
>> Maybe Shai can shed some more light here...
>>
>> > 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those
>> > need to be synchronized with the commit calls to either IndexWriter or
>> > DirectoryTaxonomyWriter?
>>
>> No.
>>
>> Commit can be a costly, slow operation (calling fsync on N files), and
>> it's designed internally in IndexWriter to not block operations like
>> merging and refreshing.
>>
>> > Do we need to call it after ever time we call
>> > commit?  The comment suggests we call it "periodically," but I'm not
>> clear
>> > on how often that should be or what conditions trigger the index to
>> change
>> > in way that this would be required.
>>
>> You don't have to call refresh on every commit.  When you call it is
>> entirely up to you.
>>
>> Commit makes changes durable on disk, so an OS crash, power loss,
>> etc., won't lose those changes (a bad disk WILL lose them of course).
>>
>> Refresh makes changes visible for searching.
>>
>> The two ops are entirely separate.
>>
>> Some apps call commit periodically and never refresh, others call
>> refresh periodically and never commit :)  It's your call.
>>
>> > 3. Lastly, what about forceMerge? Is there any worry there or can that
>> just
>> > safely happen in the background? Is there any need to call commit
>> > afterward? Or does forceMerge effectively do that?
>>
>> Force merge does not call commit itself.
>>
>> If you do force merge, then it is a good idea to both commit and
>> refresh afterwards, as this will let Lucene free up resources (files,
>> file descriptors) with the old un-merged segments.
>>
>> > Presumably, we would not
>> > see the new index until maybeRefresh was called the next time?
>>
>> Exactly.
>>
>> > Sorry, that was a lot of questions, would love help on any and all of
>> them.
>>
>> No worries, keep them coming!
>>
>> > [1] When calling maybeRefresh, we've seen error that look like:
>> > java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe
>>
>> Need the full stack trace / context here to understand what's happening...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Michael McCandless-2
In reply to this post by William Moss
On Wed, Sep 28, 2016 at 3:05 AM, William Moss
<[hidden email]> wrote:
> Thank you both for your quick reply!

You're welcome!

> * We actually tried the upgrade to 6.0 a few months back (when that was the
> newest) and were getting similar errors to the ones I'm seeing now. We were
> not able to track them down, which is part of the motivation for me asking
> all these questions. We'll get there though :-)

OK, we gotta get to the root cause.  Sounds like it happens in either version...

> * The last time we tested this (which I think was still post
> ConcurrentMergePolicy) we saw that the read speed would slowly degrade over
> time. My understanding was that forceMerge was very expensive, but would
> make reads faster once complete. Is this not correct?

It really depends on what queries you are running.  Really you should
test in your use case and be certain that the massive expense of force
merge is worthwhile / necessary.  In general it's not worth it, ever
if searches are a bit faster, except for indices that will never
change again.

> Also, we never
> attempted to tune the MergePolicy at all, so while were on the subject, is
> there good documentation on how to do that? I'm much prefer to get away
> from calling forceMerge. If it's useful information, we've got a relatively
> small corpus, only ~2+M documents.

Just use the defaults :)  Tuning those settings is dangerous unless
you have a very specific problem to fix.

> * We want to be able to ensure that if a machine or JVM crashes we are in a
> coherent state. To that end, we need to call commit on Lucene and then
> commit back what we've read so far to Kafka. Calling commit is the only way
> to ensure this, right?

Correct: commit in Lucene, then notify Kafka what offset you had
indexed just before you called IW.commit.

But you may want to replicate the index across machines if you don't
want to have a single point of failure.  We recently added
near-real-time replication to Lucene for this use case ...

> * To make sure I understand how maybeRefresh works, ignoring whether or not
> we commit for a second, if I add a document via IndexWriter, it will not be
> reflected in IndexSearchers I get by calling acquire on SearcherAndTaxonomy
> until I call maybeRefresh?

Correct.

> Now, on to the concurrency issue. I was thinking a little more about this
> and I think the fundamental issue is that while IndexWriter and
> DirectoryTaxonomyWriter are each thread safe, them together are not. As
> suggested by the documentation, we use one instance each of IndexWriter,
> DirectoryTaxonomyWriter and SearcherTaxonomyManager. Imagine the following
> scenario:
> [Thread 1] Add document to DirectoryTaxonomyWriter
> [Thread 1] Add document to IndexWriter
> [Thread 1] Call commit on DirectoryTaxonomyWriter
> [Thread 2] Add document to DirectoryTaxonomyWriter
> [Thread 2] Add document to IndexWriter
> [Thread 1] Call commit on IndexWriter
> The on disc representation now should contain things in the IndexWriter
> that are not contained in the DirectoryTaxonomyWriter, right?

Correct, I'm also confused about the commit order for this reason.
Let's see what Shai says.

However, that should not lead to NSFE.  At worst it should lead to
"ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.

> Assuming maybeRefresh looks at the state on disk when it's doing it's
> update (if this not true I don't understand why it was throwing
> NoSuchFileException) then it can be out of sync as well?

maybeRefresh (assuming new docs were indexed since you last called it)
will write new index files holding those indexed docs, and then open
them to do searching over them.

> I apparently never made a full copy of the stack trace. I'll attempt to
> regenerate it and post it here once I have it.

OK, we need to understand that.  It should not be happening ;)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Shai Erera
*> However, that should not lead to NSFE.  At worst it should lead to>
"ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.*

That is correct, this interleaving indexing case can potentially result in
an AIOOBE like exception during faceted search, when the facets that are in
the "sneaked-in-docs" will be found be a search, but resolving the ordinals
to their labels will fail because the labels will be unknown to the
taxonomy.

I wonder if committing the opposite order solves this problem. So in the
above use case, IW.commit() commits all the new docs with their facets,
then if more indexing happens before TIW.commit(), then the commit to the
taxonomy index results in more facets than are known to the search index,
but that's ok.

I'm just not sure if that covers all concurrency cases though. I remember
this was discussed several times in the past, and we eventually reached a
conclusion, but clearly if it was the latter, it wasn't clarified in the
javadocs. I can't think of a use case that breaks this commit order though (
IW.commit() followed by TIW.commit()). This feels safe to me ... can you
try to think of a use case that breaks it? Assuming that each doc-indexing
does addTaxo() followed by addDoc().

Maybe we should have a helper which takes an IW and TIW and exposes
commit() APIs that will do it in the correct order?

Now I'm thinking about SearcherTaxoManager -- it reopens the readers by
first re-opening IR, then TIR. It does so under the assumption of first
committing to TIW then to IW. Now if we reverse the order, then you need to
be more careful in when you commit changes to the two writers, and when you
re-open the readers. If you always do that from the same thread, then you
should be fine, the order of re-opens doesn't really matter.

But you re-open from a different thread than the one you commit, I am not
sure that committing to IW first then TIW can play well with any re-open
order? I.e. one case which breaks it is you commit to IW, then re-open both
IR and TIR, before you commit to TIW, and you have a search which may find
ordinals that are unknown you to the TIR?

So I'd say that if you refresh() from the same thread that you do commit(),
then commit to IW first then TIW, and use SearcherTaxoManager as it's
currently implemented. But I'd like to hear your thoughts about it.

Shai


On Wed, Sep 28, 2016 at 1:26 PM Michael McCandless <
[hidden email]> wrote:

> On Wed, Sep 28, 2016 at 3:05 AM, William Moss
> <[hidden email]> wrote:
> > Thank you both for your quick reply!
>
> You're welcome!
>
> > * We actually tried the upgrade to 6.0 a few months back (when that was
> the
> > newest) and were getting similar errors to the ones I'm seeing now. We
> were
> > not able to track them down, which is part of the motivation for me
> asking
> > all these questions. We'll get there though :-)
>
> OK, we gotta get to the root cause.  Sounds like it happens in either
> version...
>
> > * The last time we tested this (which I think was still post
> > ConcurrentMergePolicy) we saw that the read speed would slowly degrade
> over
> > time. My understanding was that forceMerge was very expensive, but would
> > make reads faster once complete. Is this not correct?
>
> It really depends on what queries you are running.  Really you should
> test in your use case and be certain that the massive expense of force
> merge is worthwhile / necessary.  In general it's not worth it, ever
> if searches are a bit faster, except for indices that will never
> change again.
>
> > Also, we never
> > attempted to tune the MergePolicy at all, so while were on the subject,
> is
> > there good documentation on how to do that? I'm much prefer to get away
> > from calling forceMerge. If it's useful information, we've got a
> relatively
> > small corpus, only ~2+M documents.
>
> Just use the defaults :)  Tuning those settings is dangerous unless
> you have a very specific problem to fix.
>
> > * We want to be able to ensure that if a machine or JVM crashes we are
> in a
> > coherent state. To that end, we need to call commit on Lucene and then
> > commit back what we've read so far to Kafka. Calling commit is the only
> way
> > to ensure this, right?
>
> Correct: commit in Lucene, then notify Kafka what offset you had
> indexed just before you called IW.commit.
>
> But you may want to replicate the index across machines if you don't
> want to have a single point of failure.  We recently added
> near-real-time replication to Lucene for this use case ...
>
> > * To make sure I understand how maybeRefresh works, ignoring whether or
> not
> > we commit for a second, if I add a document via IndexWriter, it will not
> be
> > reflected in IndexSearchers I get by calling acquire on
> SearcherAndTaxonomy
> > until I call maybeRefresh?
>
> Correct.
>
> > Now, on to the concurrency issue. I was thinking a little more about this
> > and I think the fundamental issue is that while IndexWriter and
> > DirectoryTaxonomyWriter are each thread safe, them together are not. As
> > suggested by the documentation, we use one instance each of IndexWriter,
> > DirectoryTaxonomyWriter and SearcherTaxonomyManager. Imagine the
> following
> > scenario:
> > [Thread 1] Add document to DirectoryTaxonomyWriter
> > [Thread 1] Add document to IndexWriter
> > [Thread 1] Call commit on DirectoryTaxonomyWriter
> > [Thread 2] Add document to DirectoryTaxonomyWriter
> > [Thread 2] Add document to IndexWriter
> > [Thread 1] Call commit on IndexWriter
> > The on disc representation now should contain things in the IndexWriter
> > that are not contained in the DirectoryTaxonomyWriter, right?
>
> Correct, I'm also confused about the commit order for this reason.
> Let's see what Shai says.
>
> However, that should not lead to NSFE.  At worst it should lead to
> "ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.
>
> > Assuming maybeRefresh looks at the state on disk when it's doing it's
> > update (if this not true I don't understand why it was throwing
> > NoSuchFileException) then it can be out of sync as well?
>
> maybeRefresh (assuming new docs were indexed since you last called it)
> will write new index files holding those indexed docs, and then open
> them to do searching over them.
>
> > I apparently never made a full copy of the stack trace. I'll attempt to
> > regenerate it and post it here once I have it.
>
> OK, we need to understand that.  It should not be happening ;)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

William Moss
Confusingly, I can't for the life of me seem to replicate this error (it
used to be quite easy). I wish I still had the logs somewhere :-(

One important thing I left out is that we're also calling delete a lot.
Most of the time when we get an update from Kafka it's an update to an
existing document, so we call deleteDocument on IW (side note, why is there
no delete for DTW?), then FacetsConfig.build, then addDocument. This,
presumably, further exacerbates the concurrency issue we've been discussing.

To summarize what I'm taking away from all this though, it seems like
having multiple threads calling these functions is asking for trouble. To
that end, I'm thinking of redesigning our system as follows:
1. Add a read/write lock for use when indexing.
2. Acquire the read lock when calling deleteDocument, FacetsConfig.build
and addDocument. Continue calling these from multiple threads.
3. Create a separate thread that, every N seconds, acquires the write lock,
calls commit on both IW and DTW, update our offsets in Kafka, then
maybeRefresh on STR.

I'll also investigate removing forceMerge, but that's relatively
independent of all this.

Thanks again!
Will


On Wed, Sep 28, 2016 at 4:19 AM Shai Erera <[hidden email]> wrote:

> *> However, that should not lead to NSFE.  At worst it should lead to>
> "ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.*
>
> That is correct, this interleaving indexing case can potentially result in
> an AIOOBE like exception during faceted search, when the facets that are in
> the "sneaked-in-docs" will be found be a search, but resolving the ordinals
> to their labels will fail because the labels will be unknown to the
> taxonomy.
>
> I wonder if committing the opposite order solves this problem. So in the
> above use case, IW.commit() commits all the new docs with their facets,
> then if more indexing happens before TIW.commit(), then the commit to the
> taxonomy index results in more facets than are known to the search index,
> but that's ok.
>
> I'm just not sure if that covers all concurrency cases though. I remember
> this was discussed several times in the past, and we eventually reached a
> conclusion, but clearly if it was the latter, it wasn't clarified in the
> javadocs. I can't think of a use case that breaks this commit order though
> (
> IW.commit() followed by TIW.commit()). This feels safe to me ... can you
> try to think of a use case that breaks it? Assuming that each doc-indexing
> does addTaxo() followed by addDoc().
>
> Maybe we should have a helper which takes an IW and TIW and exposes
> commit() APIs that will do it in the correct order?
>
> Now I'm thinking about SearcherTaxoManager -- it reopens the readers by
> first re-opening IR, then TIR. It does so under the assumption of first
> committing to TIW then to IW. Now if we reverse the order, then you need to
> be more careful in when you commit changes to the two writers, and when you
> re-open the readers. If you always do that from the same thread, then you
> should be fine, the order of re-opens doesn't really matter.
>
> But you re-open from a different thread than the one you commit, I am not
> sure that committing to IW first then TIW can play well with any re-open
> order? I.e. one case which breaks it is you commit to IW, then re-open both
> IR and TIR, before you commit to TIW, and you have a search which may find
> ordinals that are unknown you to the TIR?
>
> So I'd say that if you refresh() from the same thread that you do commit(),
> then commit to IW first then TIW, and use SearcherTaxoManager as it's
> currently implemented. But I'd like to hear your thoughts about it.
>
> Shai
>
>
> On Wed, Sep 28, 2016 at 1:26 PM Michael McCandless <
> [hidden email]> wrote:
>
> > On Wed, Sep 28, 2016 at 3:05 AM, William Moss
> > <[hidden email]> wrote:
> > > Thank you both for your quick reply!
> >
> > You're welcome!
> >
> > > * We actually tried the upgrade to 6.0 a few months back (when that was
> > the
> > > newest) and were getting similar errors to the ones I'm seeing now. We
> > were
> > > not able to track them down, which is part of the motivation for me
> > asking
> > > all these questions. We'll get there though :-)
> >
> > OK, we gotta get to the root cause.  Sounds like it happens in either
> > version...
> >
> > > * The last time we tested this (which I think was still post
> > > ConcurrentMergePolicy) we saw that the read speed would slowly degrade
> > over
> > > time. My understanding was that forceMerge was very expensive, but
> would
> > > make reads faster once complete. Is this not correct?
> >
> > It really depends on what queries you are running.  Really you should
> > test in your use case and be certain that the massive expense of force
> > merge is worthwhile / necessary.  In general it's not worth it, ever
> > if searches are a bit faster, except for indices that will never
> > change again.
> >
> > > Also, we never
> > > attempted to tune the MergePolicy at all, so while were on the subject,
> > is
> > > there good documentation on how to do that? I'm much prefer to get away
> > > from calling forceMerge. If it's useful information, we've got a
> > relatively
> > > small corpus, only ~2+M documents.
> >
> > Just use the defaults :)  Tuning those settings is dangerous unless
> > you have a very specific problem to fix.
> >
> > > * We want to be able to ensure that if a machine or JVM crashes we are
> > in a
> > > coherent state. To that end, we need to call commit on Lucene and then
> > > commit back what we've read so far to Kafka. Calling commit is the only
> > way
> > > to ensure this, right?
> >
> > Correct: commit in Lucene, then notify Kafka what offset you had
> > indexed just before you called IW.commit.
> >
> > But you may want to replicate the index across machines if you don't
> > want to have a single point of failure.  We recently added
> > near-real-time replication to Lucene for this use case ...
> >
> > > * To make sure I understand how maybeRefresh works, ignoring whether or
> > not
> > > we commit for a second, if I add a document via IndexWriter, it will
> not
> > be
> > > reflected in IndexSearchers I get by calling acquire on
> > SearcherAndTaxonomy
> > > until I call maybeRefresh?
> >
> > Correct.
> >
> > > Now, on to the concurrency issue. I was thinking a little more about
> this
> > > and I think the fundamental issue is that while IndexWriter and
> > > DirectoryTaxonomyWriter are each thread safe, them together are not. As
> > > suggested by the documentation, we use one instance each of
> IndexWriter,
> > > DirectoryTaxonomyWriter and SearcherTaxonomyManager. Imagine the
> > following
> > > scenario:
> > > [Thread 1] Add document to DirectoryTaxonomyWriter
> > > [Thread 1] Add document to IndexWriter
> > > [Thread 1] Call commit on DirectoryTaxonomyWriter
> > > [Thread 2] Add document to DirectoryTaxonomyWriter
> > > [Thread 2] Add document to IndexWriter
> > > [Thread 1] Call commit on IndexWriter
> > > The on disc representation now should contain things in the IndexWriter
> > > that are not contained in the DirectoryTaxonomyWriter, right?
> >
> > Correct, I'm also confused about the commit order for this reason.
> > Let's see what Shai says.
> >
> > However, that should not lead to NSFE.  At worst it should lead to
> > "ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.
> >
> > > Assuming maybeRefresh looks at the state on disk when it's doing it's
> > > update (if this not true I don't understand why it was throwing
> > > NoSuchFileException) then it can be out of sync as well?
> >
> > maybeRefresh (assuming new docs were indexed since you last called it)
> > will write new index files holding those indexed docs, and then open
> > them to do searching over them.
> >
> > > I apparently never made a full copy of the stack trace. I'll attempt to
> > > regenerate it and post it here once I have it.
> >
> > OK, we need to understand that.  It should not be happening ;)
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>