Recommended Update Batch Size?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Recommended Update Batch Size?

Walter Underwood, Netflix
What is a good size for batching updates? My xml update docs are
around 600-700 bytes each right now.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Recommended Update Batch Size?

Mike Klaas
On 10/31/06, Walter Underwood <[hidden email]> wrote:
> What is a good size for batching updates? My xml update docs are
> around 600-700 bytes each right now.

When I think of "batches" I think of documents sent before a
<commit/>, but it seems like you are talking about the number of
documents sent in a single HTTP put.  For the latter, there isnt' a
huge advantage to making gigantic batch.  I usually tune mine to be
about 50-100kB (which for me is  only about ten documents).  It is
definately advantageous to run multiple threads, however (even on a
single processor machine).

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Update Batch Size?

Walter Underwood, Netflix
On 10/31/06 12:54 PM, "Mike Klaas" <[hidden email]> wrote:

> On 10/31/06, Walter Underwood <[hidden email]> wrote:
>> What is a good size for batching updates? My xml update docs are
>> around 600-700 bytes each right now.
>
> When I think of "batches" I think of documents sent before a
> <commit/>, but it seems like you are talking about the number of
> documents sent in a single HTTP put.  For the latter, there isnt' a
> huge advantage to making gigantic batch.  I usually tune mine to be
> about 50-100kB (which for me is  only about ten documents).  It is
> definately advantageous to run multiple threads, however (even on a
> single processor machine).

Right, I meant per HTTP POST. I was wondering about parallel
update requests, so thanks for that info. --wunder

Reply | Threaded
Open this post in threaded view
|

Re: Recommended Update Batch Size?

Yonik Seeley-2
In reply to this post by Walter Underwood, Netflix
On 10/31/06, Walter Underwood <[hidden email]> wrote:
> What is a good size for batching updates? My xml update docs are
> around 600-700 bytes each right now.

There are two types of batches... documents per request (I wouldn't go
too big here) and documents added before a commit.

Bigger batches before a commit will be more efficient in general...
the only state that Solr keeps around before a commit is a
HashTable<String,Integer> entry per unique id deleted or overwritten.
You might be able to do your entire collection.

If you have a multi-CPU server, you could increase indexing
performance by using a multithreaded client to keep all the CPUs on
the server busy.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Update Batch Size?

Chris Hostetter-3
In reply to this post by Walter Underwood, Netflix

: Right, I meant per HTTP POST. I was wondering about parallel
: update requests, so thanks for that info. --wunder

FYI: the last time i looked into it, there really wasn't any benefit in
sending multiple docs in a single /update POST request compared to using
Keep-Alive.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Re: Recommended Update Batch Size?

Mike Klaas
In reply to this post by Yonik Seeley-2
On 10/31/06, Yonik Seeley <[hidden email]> wrote:

> Bigger batches before a commit will be more efficient in general...
> the only state that Solr keeps around before a commit is a
> HashTable<String,Integer> entry per unique id deleted or overwritten.
> You might be able to do your entire collection.

Note that _some_ care should be taken here as well.  I recently tried
to commit 3.9m documents in one go to an index that already every
document (thus needing to delete them all) and ended up in a strange
situation where the cpu was spinning for over a day with the java heap
maxed (1.1Gb).  If you attempt less insane feats it will go better.

DUH2.doDeletions() would also highly benefit from sorting the id terms
before looking them up in these types of cases (as it would trigger
optimizations in lucene as well as being kinder to the os' read-ahead
buffers).

> If you have a multi-CPU server, you could increase indexing
> performance by using a multithreaded client to keep all the CPUs on
> the server busy.

I thought so, too, but it turns out that there isn't a huge amount of
concurrent updating that can occur, if I am reading the code
correctly.  DUH2.addDoc() calls exactly one of addConditionally,
overwriteBoth, or allowDups, each of which add the document in a
synchronized(this) block.

This shouldn't be too hard to fix.  I'm going to take a look at doing so.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Update Batch Size?

Walter Underwood, Netflix
A quick update on my experiments with update rate:

* 20 docs/sec using one wget call per POST
* 170 docs/sec using single doc POST over a persistent HTTP connection
* 250 docs/sec using 20 doc batches over persistent HTTP
* 250 docs/sec using 100 doc batches over persistent HTTP

The latter three used a commit every 2000 docs (not batches)
and an optimize every 10,000 docs.

Each submitted document is between 200 and 700 bytes, pretty small.

I didn't try parallel connections, since this speed is just
fine.

This is using the default settings for merge factor, max buffered docs,
and so on.

wunder
--
Walter Underwood
Search Guru, Netflix



Reply | Threaded
Open this post in threaded view
|

Re: Re: Recommended Update Batch Size?

Yonik Seeley-2
In reply to this post by Mike Klaas
On 11/1/06, Mike Klaas <[hidden email]> wrote:
> DUH2.doDeletions() would also highly benefit from sorting the id terms
> before looking them up in these types of cases (as it would trigger
> optimizations in lucene as well as being kinder to the os' read-ahead
> buffers).

Hmmm, good point.  I wonder how simply using a TreeMap instead of a
HashMap would work.

> > If you have a multi-CPU server, you could increase indexing
> > performance by using a multithreaded client to keep all the CPUs on
> > the server busy.
>
> I thought so, too, but it turns out that there isn't a huge amount of
> concurrent updating that can occur, if I am reading the code
> correctly.  DUH2.addDoc() calls exactly one of addConditionally,
> overwriteBoth, or allowDups, each of which add the document in a
> synchronized(this) block.

Good catch.
And with the way that deletes are deferred, moving the add outside of
the sync block should work OK I think... then the analysis if
documents can be done in parallel.

Hmmm, but it may not work well in a mixed-overwriting environment.
Thread 1 overwrites doc 100, Thread 2 adds doc 100 (allowing duplicates).
With add synchronization the index has two possible states:
   Index contains doc_from_thread1  OR index contains both docs
Without sync around the adds, an additional possible state is added:
  Index contains doc_from_thread2

Even though synchronized behavior != unsynchronized behavior, this is
only a problem if someone actually desires to mix overwriting &
non-overwriting on the same document ids, and is OK with the two
possible states in the synchronized case.

I'm tempted to say "mixing overwriting & non-overwriting adds for the
same documents has undefined behavior".  Thoughts?

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Recommended Update Batch Size?

Mike Klaas
On 11/2/06, Yonik Seeley <[hidden email]> wrote:
> On 11/1/06, Mike Klaas <[hidden email]> wrote:
> > DUH2.doDeletions() would also highly benefit from sorting the id terms
> > before looking them up in these types of cases (as it would trigger
> > optimizations in lucene as well as being kinder to the os' read-ahead
> > buffers).
>
> Hmmm, good point.  I wonder how simply using a TreeMap instead of a
> HashMap would work.

Definitely.

> > I thought so, too, but it turns out that there isn't a huge amount of
> > concurrent updating that can occur, if I am reading the code
> > correctly.  DUH2.addDoc() calls exactly one of addConditionally,
> > overwriteBoth, or allowDups, each of which add the document in a
> > synchronized(this) block.
>
> Good catch.
> And with the way that deletes are deferred, moving the add outside of
> the sync block should work OK I think... then the analysis if
> documents can be done in parallel.

The one thing I'm worried about is closing the writer while documents
are being added to it. IndexWriter is nominally thread-safe, but I'm
not sure what happens to documents that are being added at the time.
Looking at IndexWriter.java, it seems like if addDocument() is entered
but hasn't reached the synchronized block, then close() is called, the
document could be lost or an exception raised.

<>
> I'm tempted to say "mixing overwriting & non-overwriting adds for the
> same documents has undefined behavior".  Thoughts?

I believe that is reasonable.

I was going to try to put in some basic autoCommit logic while I was
mucking about here.  One question: did you intend for maxCommitTime to
trigger deterministically (regardless of any events occurring or not)?
 I had in mind checking these constraints only when documents are
added, but this could result in maxCommitTime elapsing without a
commit.

regards,
-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Recommended Update Batch Size?

Yonik Seeley-2
On 11/2/06, Mike Klaas <[hidden email]> wrote:
> The one thing I'm worried about is closing the writer while documents
> are being added to it. IndexWriter is nominally thread-safe, but I'm
> not sure what happens to documents that are being added at the time.
> Looking at IndexWriter.java, it seems like if addDocument() is entered
> but hasn't reached the synchronized block, then close() is called, the
> document could be lost or an exception raised.

This seems harder to address in "user code" and still maintain parallelism.
Perhaps a Lucene patch would be more appropriate?

Perhaps IndexWriter should have a close flag, and addDocument should
return a boolean indicating if the document was added or not.  Then we
could move addDocument() outside the sync block, and put a big do
while(!addDocument()) loop around the whole thing.

There is still another case to consider: if a commit happens between
adding the id to the pset and adding the document to the index, and
the add succeeds, the id will no longer be in the pset so we will end
up with a duplicate after the next commit.

> I was going to try to put in some basic autoCommit logic while I was
> mucking about here.  One question: did you intend for maxCommitTime to
> trigger deterministically (regardless of any events occurring or not)?

I hadn't thought through the whole thing, but it seems like it should
only trigger if it would make a difference.

>  I had in mind checking these constraints only when documents are
> added, but this could result in maxCommitTime elapsing without a
> commit.

If there is nothing to commit, that should be fine.
I think the type of guarantee we should make is that if you add a
document, it will be committed within a certain period of time
(leaving out variances for autowarming time, etc).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Recommended Update Batch Size?

Mike Klaas
On 11/2/06, Yonik Seeley <[hidden email]> wrote:

> On 11/2/06, Mike Klaas <[hidden email]> wrote:
> > The one thing I'm worried about is closing the writer while documents
> > are being added to it. IndexWriter is nominally thread-safe, but I'm
> > not sure what happens to documents that are being added at the time.
> > Looking at IndexWriter.java, it seems like if addDocument() is entered
> > but hasn't reached the synchronized block, then close() is called, the
> > document could be lost or an exception raised.
>
> This seems harder to address in "user code" and still maintain parallelism.
> Perhaps a Lucene patch would be more appropriate?
>
> Perhaps IndexWriter should have a close flag, and addDocument should
> return a boolean indicating if the document was added or not.  Then we
> could move addDocument() outside the sync block, and put a big do
> while(!addDocument()) loop around the whole thing.
>
> There is still another case to consider: if a commit happens between
> adding the id to the pset and adding the document to the index, and
> the add succeeds, the id will no longer be in the pset so we will end
> up with a duplicate after the next commit.

I think that I've come up with a new locking strategy that circumvents
all these issues... stay tuned.

> > I was going to try to put in some basic autoCommit logic while I was
> > mucking about here.  One question: did you intend for maxCommitTime to
> > trigger deterministically (regardless of any events occurring or not)?
>
> I hadn't thought through the whole thing, but it seems like it should
> only trigger if it would make a difference.

Right--I was more concerned with whether it would occur by itself, or
if this was a condition that would trigger if true when checked.

> >  I had in mind checking these constraints only when documents are
> > added, but this could result in maxCommitTime elapsing without a
> > commit.
>
> If there is nothing to commit, that should be fine.
> I think the type of guarantee we should make is that if you add a
> document, it will be committed within a certain period of time
> (leaving out variances for autowarming time, etc).

That's the condition I was wondering about.  I may leave tha tout of
the patch for the time being.

-Mike