Index documents in async way

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Index documents in async way

Cao Mạnh Đạt
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Mike Drob-3
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Cao Mạnh Đạt
> Can you explain a little more on how this would impact durability of updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Ishan Chattopadhyaya
Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[hidden email]> wrote:
> Can you explain a little more on how this would impact durability of updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Erick Erickson
I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing took a long time?

I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <[hidden email]> wrote:
Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[hidden email]> wrote:
> Can you explain a little more on how this would impact durability of updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Joel Bernstein
I think this model has a lot of potential. 

I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.




On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson <[hidden email]> wrote:
I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing took a long time?

I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <[hidden email]> wrote:
Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[hidden email]> wrote:
> Can you explain a little more on how this would impact durability of updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

caomanhdat
> Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?
> I suppose failures would be returned to the client one the async response?
To make things more clear, the response for async update will be something like this
{ "trackId" : "<update_version>" }
Then the user will call another endpoint for tracking the response like GET status_updates?trackId=<update_version>, the response will tell that whether the update is in_queue, processing, succeed or failed. Currently we are also adding to tlog first then call writer.addDoc later.
Later we can convert current sync operations by waiting until the update gets processed before return to users.

>How would one keep the tlog from growing forever if the actual indexing took a long time?
I think it won't be very different from what we are having now, since on commit (producer threads do the commit) we rotate to a new tlog. 

> I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.
Thank you, it adds another pros to this model :P

On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein <[hidden email]> wrote:
I think this model has a lot of potential. 

I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.




On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson <[hidden email]> wrote:
I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing took a long time?

I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <[hidden email]> wrote:
Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[hidden email]> wrote:
> Can you explain a little more on how this would impact durability of updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!



--
Best regards,
Cao Mạnh Đạt
Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Tomás Fernández Löbbe
Interesting idea Đạt. The first questions/comments that come to my mind would be:
* Atomic updates, can those be supported? I guess yes if we can guarantee that messages are read once and only once.
* I'm guessing we'd need to read messages in an ordered way, so it'd be a single Kafka partition per Solr shard, right? (Don't know Pulsar)
* May be difficult to determine what replicas should do after a document update failure. Do they continue processing (which means if it was a transient error they'll become inconsistent) or do they stop? maybe try to recover from other active replicas? but if none of the replicas could process the document then they would all go to recovery?

> Then the user will call another endpoint for tracking the response like GET status_updates?trackId=<update_version>,
Maybe we could have a way to stream those responses out? (i.e. via another queue)? Maybe with an option to only stream out errors or something.

> Currently we are also adding to tlog first then call writer.addDoc later
I don't think that'c correct? see DUH2.doNormalUpdate.

> I think it won't be very different from what we are having now, since on commit (producer threads do the commit) we rotate to a new tlog.
How would this work in your mind with one of the distributed queues?

I think this is a great idea, something that needs to be deeply thought, but could make big improvements. Thanks for bringing this up, Đạt.

On Thu, Oct 8, 2020 at 7:39 PM Đạt Cao Mạnh <[hidden email]> wrote:
> Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?
> I suppose failures would be returned to the client one the async response?
To make things more clear, the response for async update will be something like this
{ "trackId" : "<update_version>" }
Then the user will call another endpoint for tracking the response like GET status_updates?trackId=<update_version>, the response will tell that whether the update is in_queue, processing, succeed or failed. Currently we are also adding to tlog first then call writer.addDoc later.
Later we can convert current sync operations by waiting until the update gets processed before return to users.

>How would one keep the tlog from growing forever if the actual indexing took a long time?
I think it won't be very different from what we are having now, since on commit (producer threads do the commit) we rotate to a new tlog. 

> I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.
Thank you, it adds another pros to this model :P

On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein <[hidden email]> wrote:
I think this model has a lot of potential. 

I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.




On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson <[hidden email]> wrote:
I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing took a long time?

I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <[hidden email]> wrote:
Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[hidden email]> wrote:
> Can you explain a little more on how this would impact durability of updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!



--
Best regards,
Cao Mạnh Đạt
Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Cao Mạnh Đạt
Thank you Tomas

>Atomic updates, can those be supported? I guess yes if we can guarantee that messages are read once and only once.
It won't be straightforward since we have multiple consumers on the tlog queue. But it is possible with appropriate locking

>I'm guessing we'd need to read messages in an ordered way, so it'd be a single Kafka partition per Solr shard, right? (Don't know Pulsar)
It will likely be the case, but like I said async updates will be the first piece, switching to using Kafka gonna be an another area to look at

>May be difficult to determine what replicas should do after a document update failure. Do they continue processing (which means if it was a transient error they'll become inconsistent) or do they stop? but if none of the replicas could process the document then they would all go to recovery?
Good question, I had not thought about this, but I think the current model of SolrCloud needs to answer this question too. i.e: the leader failed but had others success.

> maybe try to recover from other active replicas?
I think it is totally possible

Maybe we could have a way to stream those responses out? (i.e. via another queue)? Maybe with an option to only stream out errors or something.
It can be, but for REST users, it gonna be difficult for them

>I don't think that'c correct? see DUH2.doNormalUpdate. 
You're right, we actually run the update first then writing to the tlog later

> How would this work in your mind with one of the distributed queues?
For a distributed queue, basically for every commit we need to store the latest consumed offset corresponding to the commit. An easy solution here can be blocking everything then do the commit, the commit data will store the latest consumed offset

On Fri, Oct 9, 2020 at 11:49 AM Tomás Fernández Löbbe <[hidden email]> wrote:
Interesting idea Đạt. The first questions/comments that come to my mind would be:
* Atomic updates, can those be supported? I guess yes if we can guarantee that messages are read once and only once.
* I'm guessing we'd need to read messages in an ordered way, so it'd be a single Kafka partition per Solr shard, right? (Don't know Pulsar)
* May be difficult to determine what replicas should do after a document update failure. Do they continue processing (which means if it was a transient error they'll become inconsistent) or do they stop? maybe try to recover from other active replicas? but if none of the replicas could process the document then they would all go to recovery?

> Then the user will call another endpoint for tracking the response like GET status_updates?trackId=<update_version>,
Maybe we could have a way to stream those responses out? (i.e. via another queue)? Maybe with an option to only stream out errors or something.

> Currently we are also adding to tlog first then call writer.addDoc later
I don't think that'c correct? see DUH2.doNormalUpdate.

> I think it won't be very different from what we are having now, since on commit (producer threads do the commit) we rotate to a new tlog.
How would this work in your mind with one of the distributed queues?

I think this is a great idea, something that needs to be deeply thought, but could make big improvements. Thanks for bringing this up, Đạt.

On Thu, Oct 8, 2020 at 7:39 PM Đạt Cao Mạnh <[hidden email]> wrote:
> Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?
> I suppose failures would be returned to the client one the async response?
To make things more clear, the response for async update will be something like this
{ "trackId" : "<update_version>" }
Then the user will call another endpoint for tracking the response like GET status_updates?trackId=<update_version>, the response will tell that whether the update is in_queue, processing, succeed or failed. Currently we are also adding to tlog first then call writer.addDoc later.
Later we can convert current sync operations by waiting until the update gets processed before return to users.

>How would one keep the tlog from growing forever if the actual indexing took a long time?
I think it won't be very different from what we are having now, since on commit (producer threads do the commit) we rotate to a new tlog. 

> I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.
Thank you, it adds another pros to this model :P

On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein <[hidden email]> wrote:
I think this model has a lot of potential. 

I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.




On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson <[hidden email]> wrote:
I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing took a long time?

I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <[hidden email]> wrote:
Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[hidden email]> wrote:
> Can you explain a little more on how this would impact durability of updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?

Mike 

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!



--
Best regards,
Cao Mạnh Đạt
Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Ilan Ginzburg
I like the idea.

Two (main) points are not clear for me:
- Order of updates: If the current leader fails (its tlog becoming
inaccessible) and another leader is elected and indexes some more,
what happens when the first leader comes back? What does it do with
its tlog and how to know which part needs to be indexed and which part
should not (more recent versions of same docs already indexed by the
other leader).
- Durability: If the tlog only exists on a single node before the
client gets an ACK and that node fails "forever", the update is lost.
We need a way to even detect that an update was lost (which might not
be obvious).

Unclear to me if your previous answers address these points.

Ilan

On Fri, Oct 9, 2020 at 3:54 PM Cao Mạnh Đạt <[hidden email]> wrote:

>
> Thank you Tomas
>
> >Atomic updates, can those be supported? I guess yes if we can guarantee that messages are read once and only once.
> It won't be straightforward since we have multiple consumers on the tlog queue. But it is possible with appropriate locking
>
> >I'm guessing we'd need to read messages in an ordered way, so it'd be a single Kafka partition per Solr shard, right? (Don't know Pulsar)
> It will likely be the case, but like I said async updates will be the first piece, switching to using Kafka gonna be an another area to look at
>
> >May be difficult to determine what replicas should do after a document update failure. Do they continue processing (which means if it was a transient error they'll become inconsistent) or do they stop? but if none of the replicas could process the document then they would all go to recovery?
> Good question, I had not thought about this, but I think the current model of SolrCloud needs to answer this question too. i.e: the leader failed but had others success.
>
> > maybe try to recover from other active replicas?
> I think it is totally possible
>
> > Maybe we could have a way to stream those responses out? (i.e. via another queue)? Maybe with an option to only stream out errors or something.
> It can be, but for REST users, it gonna be difficult for them
>
> >I don't think that'c correct? see DUH2.doNormalUpdate.
> You're right, we actually run the update first then writing to the tlog later
>
> > How would this work in your mind with one of the distributed queues?
> For a distributed queue, basically for every commit we need to store the latest consumed offset corresponding to the commit. An easy solution here can be blocking everything then do the commit, the commit data will store the latest consumed offset
>
> On Fri, Oct 9, 2020 at 11:49 AM Tomás Fernández Löbbe <[hidden email]> wrote:
>>
>> Interesting idea Đạt. The first questions/comments that come to my mind would be:
>> * Atomic updates, can those be supported? I guess yes if we can guarantee that messages are read once and only once.
>> * I'm guessing we'd need to read messages in an ordered way, so it'd be a single Kafka partition per Solr shard, right? (Don't know Pulsar)
>> * May be difficult to determine what replicas should do after a document update failure. Do they continue processing (which means if it was a transient error they'll become inconsistent) or do they stop? maybe try to recover from other active replicas? but if none of the replicas could process the document then they would all go to recovery?
>>
>> > Then the user will call another endpoint for tracking the response like GET status_updates?trackId=<update_version>,
>> Maybe we could have a way to stream those responses out? (i.e. via another queue)? Maybe with an option to only stream out errors or something.
>>
>> > Currently we are also adding to tlog first then call writer.addDoc later
>> I don't think that'c correct? see DUH2.doNormalUpdate.
>>
>> > I think it won't be very different from what we are having now, since on commit (producer threads do the commit) we rotate to a new tlog.
>> How would this work in your mind with one of the distributed queues?
>>
>> I think this is a great idea, something that needs to be deeply thought, but could make big improvements. Thanks for bringing this up, Đạt.
>>
>> On Thu, Oct 8, 2020 at 7:39 PM Đạt Cao Mạnh <[hidden email]> wrote:
>>>
>>> > Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?
>>> > I suppose failures would be returned to the client one the async response?
>>> To make things more clear, the response for async update will be something like this
>>> { "trackId" : "<update_version>" }
>>> Then the user will call another endpoint for tracking the response like GET status_updates?trackId=<update_version>, the response will tell that whether the update is in_queue, processing, succeed or failed. Currently we are also adding to tlog first then call writer.addDoc later.
>>> Later we can convert current sync operations by waiting until the update gets processed before return to users.
>>>
>>> >How would one keep the tlog from growing forever if the actual indexing took a long time?
>>> I think it won't be very different from what we are having now, since on commit (producer threads do the commit) we rotate to a new tlog.
>>>
>>> > I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.
>>> Thank you, it adds another pros to this model :P
>>>
>>> On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein <[hidden email]> wrote:
>>>>
>>>> I think this model has a lot of potential.
>>>>
>>>> I'd like to add another wrinkle to this. Which is to store the information about each batch as a record in the index. Each batch record would contain a fingerprint for the batch. This solves lots of problems, and allows us to confirm the integrity of the batch. It also means that we can compare indexes by comparing the batch fingerprints rather than building a fingerprint from the entire index.
>>>>
>>>>
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/
>>>>
>>>>
>>>> On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson <[hidden email]> wrote:
>>>>>
>>>>> I suppose failures would be returned to the client one the async response?
>>>>>
>>>>> How would one keep the tlog from growing forever if the actual indexing took a long time?
>>>>>
>>>>> I'm guessing that this would be optional..
>>>>>
>>>>> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <[hidden email]> wrote:
>>>>>>
>>>>>> Can there be a situation where the index writer fails after the document was added to tlog and a success is sent to the user? I think we want to avoid such a situation, isn't it?
>>>>>>
>>>>>> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[hidden email]> wrote:
>>>>>>>
>>>>>>> > Can you explain a little more on how this would impact durability of updates?
>>>>>>> Since we persist updates into tlog, I do not think this will be an issue
>>>>>>>
>>>>>>> > What does a failure look like, and how does that information get propagated back to the client app?
>>>>>>> I did not be able to do much research but I think this is gonna be the same as the current way of our asyncId. In this case asyncId will be the version of an update (in case of distributed queue it will be offset) failures update will be put into a time-to-live map so users can query the failure, for success we can skip that by leverage the max succeeded version so far.
>>>>>>>
>>>>>>> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Interesting idea! Can you explain a little more on how this would impact durability of updates? What does a failure look like, and how does that information get propagated back to the client app?
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> First of all it seems that I used the term async a lot recently :D.
>>>>>>>>> Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue).
>>>>>>>>>
>>>>>>>>> I do see several big benefits of this approach
>>>>>>>>>
>>>>>>>>> We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
>>>>>>>>> One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
>>>>>>>>> Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
>>>>>>>>>
>>>>>>>>> What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Cao Mạnh Đạt
>>> E-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

David Smiley
In reply to this post by Cao Mạnh Đạt


On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

The biggest problem I have with this is that the client doesn't know about indexing problems without awkward callbacks later to see if something went wrong.  Even simple stuff like a schema problem (e.g. undefined field).  It's a useful *option*, any way.
 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
I'm a bit skeptical that would boost indexing performance.  Please understand the intent of that API is about transactionality (atomic add) and ensuring all docs go in the same segment.  Solr *does* use that API for nested / parent-child documents, and because it has to.  If that API were to get called for normal docs, I could see the configured indexing buffer RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.  You could test your performance theory on a hacked Solr without much modifications, I think?  Just buffer then send in bulk.
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
This is app/use-case dependent of course.  If you observe the segment count to be high, I think it's more likely due to a sub-optimal commit strategy.  Many users should put more thought into this.  If update requests to Solr have a small number of docs and it explicitly gives a commit (commitWithin on the other hand is fine), then this is a recipe for small segments and is generally expensive as well (commits are kinda expensive).  Many apps would do well to either pass commitWithin or rely on a configured commitWithin, accomplishing the same instead of commit.  For apps that can't do that (e.g. need to immediately read-your-write or for other reasons where I work), then such an app can't use async any way.  

An idea I've had to help throughput for this case would be for commits that are about to happen concurrently with other indexing to voluntarily wait a short period of time (500ms?) in an attempt to coalesce the commit needs of both concurrent indexing requests.  Very opt-in, probably a solrconfig value, and probably a wait time in the 100s of milliseconds range.  An ideal implementation would be conservative to avoid this waiting if there is no concurrent indexing request that did not start after the current request or that which doesn't require a commit as well.

If your goal is fewer segments, then definitely check out the recent improvements to Lucene to do some lightweight merge-on-commit.  The Solr-side hook is SOLR-14582 and it requires a custom MergePolicy.  I contributed such a MergePolicy policy here: https://github.com/apache/lucene-solr/pull/1222 although it needs to be updated in light of Lucene changes that occured since then.  We've been using that MergePolicy at Salesforce for a couple years and it has cut our segment count in half!  Of course if you can generate fewer segments in the first place, that's preferable and is more to your point.
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
I think that's the real thrust of your motivations, and sounds good to me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for making the optionality of the updateLog be a better supportable option in SolrCloud.
 
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

Gus Heck
This is interesting, though it opens a few of cans of worms IMHO. 
  1. Currently we now guarantee that if solr sends you an OK response the document WILL eventually become searchable without further action. Maintaining that guarantee becomes impossible if we haven't verified that the data is formatted correctly (i.e. dates are in ISO format, etc). This may be an acceptable cost for those opting for async indexing but it may be very hard for some folks to swallow if it became the only option however.
  2. In the case of errors we need to hold the error message indefinitely for later discovery by the client, this needs to not accumulate forever. Thus:
    1. We have a timed cleanup, leasing or some other self limiting pattern... possibly by indexing the failures in a TRA with autodelete so that clients can efficiently find the status of the particular document(s) they sent, obviouysly there's at least an asyc id involved, probably the uniqueKey (where available) and timestamps for recieved, and processed as well.
    2. We log more simply with a sequential id and let clients keep track of what they have seen... This can lead us down the path of re-inventing kafka, or making kafka a required dependency.
    3. We provide a push oriented connection (websocket? HTTP2?) that clients that care about failures can listen to and store nothing. A less appetizing variant is to publish errors to a message bus.
  3. If we have more than one thread picking up the submitted documents and writing them, we need a state machine that identifies in-progress documents to prevent multiple pickups and resets processing to new on startup to ensure we don't index the same document twice and don't lose things that were in-flight on power loss.
  4. Backpressure/throttling. If we're losing ground continuously on the submissions because indexing is heavier than accepting documents, we may fill up the disk. Of course the index itself can do that, but need to think about if this makes it worse.
A big plus to this however is that batches with errors could optionally just omit the (one or two?) errored document(s) and publish the error for each errored document rather than failing the whole batch, meaning that the indexing infrastructure submitting in batches doesn't have to leave several hundred docs unprocessed, or alternately do a slow doc at a time resubmit to weed out the offenders.

Certainly the involvement of kafka sounds interesting. If one persists to an externally addressable location like a kafka queue one might leave the option for the write-on-receipt queue to be different from the read-to-actually-index queue and put a pipeline behind solr instead of infront of it... possibly atomic updates could then be given identical processing as initial indexing....

On Sat, Oct 10, 2020 at 12:41 AM David Smiley <[hidden email]> wrote:


On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
Hi guys,

First of all it seems that I used the term async a lot recently :D. 
Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

The biggest problem I have with this is that the client doesn't know about indexing problems without awkward callbacks later to see if something went wrong.  Even simple stuff like a schema problem (e.g. undefined field).  It's a useful *option*, any way.
 

I do see several big benefits of this approach
  • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
I'm a bit skeptical that would boost indexing performance.  Please understand the intent of that API is about transactionality (atomic add) and ensuring all docs go in the same segment.  Solr *does* use that API for nested / parent-child documents, and because it has to.  If that API were to get called for normal docs, I could see the configured indexing buffer RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.  You could test your performance theory on a hacked Solr without much modifications, I think?  Just buffer then send in bulk.
  • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
This is app/use-case dependent of course.  If you observe the segment count to be high, I think it's more likely due to a sub-optimal commit strategy.  Many users should put more thought into this.  If update requests to Solr have a small number of docs and it explicitly gives a commit (commitWithin on the other hand is fine), then this is a recipe for small segments and is generally expensive as well (commits are kinda expensive).  Many apps would do well to either pass commitWithin or rely on a configured commitWithin, accomplishing the same instead of commit.  For apps that can't do that (e.g. need to immediately read-your-write or for other reasons where I work), then such an app can't use async any way.  

An idea I've had to help throughput for this case would be for commits that are about to happen concurrently with other indexing to voluntarily wait a short period of time (500ms?) in an attempt to coalesce the commit needs of both concurrent indexing requests.  Very opt-in, probably a solrconfig value, and probably a wait time in the 100s of milliseconds range.  An ideal implementation would be conservative to avoid this waiting if there is no concurrent indexing request that did not start after the current request or that which doesn't require a commit as well.

If your goal is fewer segments, then definitely check out the recent improvements to Lucene to do some lightweight merge-on-commit.  The Solr-side hook is SOLR-14582 and it requires a custom MergePolicy.  I contributed such a MergePolicy policy here: https://github.com/apache/lucene-solr/pull/1222 although it needs to be updated in light of Lucene changes that occured since then.  We've been using that MergePolicy at Salesforce for a couple years and it has cut our segment count in half!  Of course if you can generate fewer segments in the first place, that's preferable and is more to your point.
  • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
I think that's the real thrust of your motivations, and sounds good to me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for making the optionality of the updateLog be a better supportable option in SolrCloud.
 
What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

Thanks!



--
Reply | Threaded
Open this post in threaded view
|

Re: Index documents in async way

caomanhdat
> The biggest problem I have with this is that the client doesn't know about indexing problems without awkward callbacks later to see if something went wrong.  Even simple stuff like a schema problem (e.g. undefined field).  It's a useful *option*, any way.
> Currently we now guarantee that if solr sends you an OK response the document WILL eventually become searchable without further action. Maintaining that guarantee becomes impossible if we haven't verified that the data is formatted correctly (i.e. dates are in ISO format, etc). This may be an acceptable cost for those opting for async indexing but it may be very hard for some folks to swallow if it became the only option however.

I don't mean that we're gonna replace the sync way. Plus the way sync works can be changed to leverage the async method. For example, the sync thread basically waits until the indexer threads finish indexing that update. Some notice here is all the validations in update processors will still happen in *sync* way in this model. Only indexing part to Lucene is executed in an async way.

> I'm a bit skeptical that would boost indexing performance.  Please understand the intent of that API is about transactionality (atomic add) and ensuring all docs go in the same segment.  Solr *does* use that API for nested / parent-child documents, and because it has to.  If that API were to get called for normal docs, I could see the configured indexing buffer RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.  You could test your performance theory on a hacked Solr without much modifications, I think?  Just buffer then send in bulk.
    What I mean here is right now, when we send a batch of documents to Solr. We still process it as concrete - unrelated documents by indexing one by one. If indexing the fifth document causing error, that won't affect already indexed 4 documents. Using this model we can index the batch in an atomic way.

    > I think that's the real thrust of your motivations, and sounds good to me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for making the optionality of the updateLog be a better supportable option in SolrCloud.

    Thank you for bring it out, I will take a look.

    All other parts that did not get quoted are very valuable things and I really appreciate those.

    On Wed, Oct 14, 2020 at 12:06 AM Gus Heck <[hidden email]> wrote:
    This is interesting, though it opens a few of cans of worms IMHO. 
    1. Currently we now guarantee that if solr sends you an OK response the document WILL eventually become searchable without further action. Maintaining that guarantee becomes impossible if we haven't verified that the data is formatted correctly (i.e. dates are in ISO format, etc). This may be an acceptable cost for those opting for async indexing but it may be very hard for some folks to swallow if it became the only option however.
    2. In the case of errors we need to hold the error message indefinitely for later discovery by the client, this needs to not accumulate forever. Thus:
      1. We have a timed cleanup, leasing or some other self limiting pattern... possibly by indexing the failures in a TRA with autodelete so that clients can efficiently find the status of the particular document(s) they sent, obviouysly there's at least an asyc id involved, probably the uniqueKey (where available) and timestamps for recieved, and processed as well.
      2. We log more simply with a sequential id and let clients keep track of what they have seen... This can lead us down the path of re-inventing kafka, or making kafka a required dependency.
      3. We provide a push oriented connection (websocket? HTTP2?) that clients that care about failures can listen to and store nothing. A less appetizing variant is to publish errors to a message bus.
    3. If we have more than one thread picking up the submitted documents and writing them, we need a state machine that identifies in-progress documents to prevent multiple pickups and resets processing to new on startup to ensure we don't index the same document twice and don't lose things that were in-flight on power loss.
    4. Backpressure/throttling. If we're losing ground continuously on the submissions because indexing is heavier than accepting documents, we may fill up the disk. Of course the index itself can do that, but need to think about if this makes it worse.
    A big plus to this however is that batches with errors could optionally just omit the (one or two?) errored document(s) and publish the error for each errored document rather than failing the whole batch, meaning that the indexing infrastructure submitting in batches doesn't have to leave several hundred docs unprocessed, or alternately do a slow doc at a time resubmit to weed out the offenders.

    Certainly the involvement of kafka sounds interesting. If one persists to an externally addressable location like a kafka queue one might leave the option for the write-on-receipt queue to be different from the read-to-actually-index queue and put a pipeline behind solr instead of infront of it... possibly atomic updates could then be given identical processing as initial indexing....

    On Sat, Oct 10, 2020 at 12:41 AM David Smiley <[hidden email]> wrote:


    On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
    Hi guys,

    First of all it seems that I used the term async a lot recently :D. 
    Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

    The biggest problem I have with this is that the client doesn't know about indexing problems without awkward callbacks later to see if something went wrong.  Even simple stuff like a schema problem (e.g. undefined field).  It's a useful *option*, any way.
     

    I do see several big benefits of this approach
    • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
    I'm a bit skeptical that would boost indexing performance.  Please understand the intent of that API is about transactionality (atomic add) and ensuring all docs go in the same segment.  Solr *does* use that API for nested / parent-child documents, and because it has to.  If that API were to get called for normal docs, I could see the configured indexing buffer RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.  You could test your performance theory on a hacked Solr without much modifications, I think?  Just buffer then send in bulk.
    • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
    This is app/use-case dependent of course.  If you observe the segment count to be high, I think it's more likely due to a sub-optimal commit strategy.  Many users should put more thought into this.  If update requests to Solr have a small number of docs and it explicitly gives a commit (commitWithin on the other hand is fine), then this is a recipe for small segments and is generally expensive as well (commits are kinda expensive).  Many apps would do well to either pass commitWithin or rely on a configured commitWithin, accomplishing the same instead of commit.  For apps that can't do that (e.g. need to immediately read-your-write or for other reasons where I work), then such an app can't use async any way.  

    An idea I've had to help throughput for this case would be for commits that are about to happen concurrently with other indexing to voluntarily wait a short period of time (500ms?) in an attempt to coalesce the commit needs of both concurrent indexing requests.  Very opt-in, probably a solrconfig value, and probably a wait time in the 100s of milliseconds range.  An ideal implementation would be conservative to avoid this waiting if there is no concurrent indexing request that did not start after the current request or that which doesn't require a commit as well.

    If your goal is fewer segments, then definitely check out the recent improvements to Lucene to do some lightweight merge-on-commit.  The Solr-side hook is SOLR-14582 and it requires a custom MergePolicy.  I contributed such a MergePolicy policy here: https://github.com/apache/lucene-solr/pull/1222 although it needs to be updated in light of Lucene changes that occured since then.  We've been using that MergePolicy at Salesforce for a couple years and it has cut our segment count in half!  Of course if you can generate fewer segments in the first place, that's preferable and is more to your point.
    • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
    I think that's the real thrust of your motivations, and sounds good to me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for making the optionality of the updateLog be a better supportable option in SolrCloud.
     
    What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

    Thanks!



    --


    --
    Best regards,
    Cao Mạnh Đạt
    Reply | Threaded
    Open this post in threaded view
    |

    Re: Index documents in async way

    David Smiley
    What I mean here is right now, when we send a batch of documents to Solr. We still process it as concrete - unrelated documents by indexing one by one. If indexing the fifth document causing error, that won't affect already indexed 4 documents. Using this model we can index the batch in an atomic way.

    It sounds like you are suggesting an alternative version of TolerantUpdateProcessorFactory (would look nothing like it though) that sends each document forward in its own thread concurrently instead of serially/synchronously?  If true, I'm quite supportive of this -- Solr should just work this way IMO.  It could speed up some use-cases tremendously and also cap abuse of too many client indexing requests (people who unwittingly send batches asynchronously/concurrently from a queue, possibly flooding Solr).  I remember I discussed this proposal with Mikhail many years ago at a Lucene/Solr Revolution conference in Washington DC (the first one held there?), as a way to speed up all indexing, not just DIH (which was his curiosity at the time).  He wasn't a fan.  I'll CC him to see if it jogs his memory :-)

    ~ David Smiley
    Apache Lucene/Solr Search Developer


    On Thu, Oct 15, 2020 at 6:29 AM Đạt Cao Mạnh <[hidden email]> wrote:
    > The biggest problem I have with this is that the client doesn't know about indexing problems without awkward callbacks later to see if something went wrong.  Even simple stuff like a schema problem (e.g. undefined field).  It's a useful *option*, any way.
    > Currently we now guarantee that if solr sends you an OK response the document WILL eventually become searchable without further action. Maintaining that guarantee becomes impossible if we haven't verified that the data is formatted correctly (i.e. dates are in ISO format, etc). This may be an acceptable cost for those opting for async indexing but it may be very hard for some folks to swallow if it became the only option however.

    I don't mean that we're gonna replace the sync way. Plus the way sync works can be changed to leverage the async method. For example, the sync thread basically waits until the indexer threads finish indexing that update. Some notice here is all the validations in update processors will still happen in *sync* way in this model. Only indexing part to Lucene is executed in an async way.

    > I'm a bit skeptical that would boost indexing performance.  Please understand the intent of that API is about transactionality (atomic add) and ensuring all docs go in the same segment.  Solr *does* use that API for nested / parent-child documents, and because it has to.  If that API were to get called for normal docs, I could see the configured indexing buffer RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.  You could test your performance theory on a hacked Solr without much modifications, I think?  Just buffer then send in bulk.
      What I mean here is right now, when we send a batch of documents to Solr. We still process it as concrete - unrelated documents by indexing one by one. If indexing the fifth document causing error, that won't affect already indexed 4 documents. Using this model we can index the batch in an atomic way.

      > I think that's the real thrust of your motivations, and sounds good to me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for making the optionality of the updateLog be a better supportable option in SolrCloud.

      Thank you for bring it out, I will take a look.

      All other parts that did not get quoted are very valuable things and I really appreciate those.

      On Wed, Oct 14, 2020 at 12:06 AM Gus Heck <[hidden email]> wrote:
      This is interesting, though it opens a few of cans of worms IMHO. 
      1. Currently we now guarantee that if solr sends you an OK response the document WILL eventually become searchable without further action. Maintaining that guarantee becomes impossible if we haven't verified that the data is formatted correctly (i.e. dates are in ISO format, etc). This may be an acceptable cost for those opting for async indexing but it may be very hard for some folks to swallow if it became the only option however.
      2. In the case of errors we need to hold the error message indefinitely for later discovery by the client, this needs to not accumulate forever. Thus:
        1. We have a timed cleanup, leasing or some other self limiting pattern... possibly by indexing the failures in a TRA with autodelete so that clients can efficiently find the status of the particular document(s) they sent, obviouysly there's at least an asyc id involved, probably the uniqueKey (where available) and timestamps for recieved, and processed as well.
        2. We log more simply with a sequential id and let clients keep track of what they have seen... This can lead us down the path of re-inventing kafka, or making kafka a required dependency.
        3. We provide a push oriented connection (websocket? HTTP2?) that clients that care about failures can listen to and store nothing. A less appetizing variant is to publish errors to a message bus.
      3. If we have more than one thread picking up the submitted documents and writing them, we need a state machine that identifies in-progress documents to prevent multiple pickups and resets processing to new on startup to ensure we don't index the same document twice and don't lose things that were in-flight on power loss.
      4. Backpressure/throttling. If we're losing ground continuously on the submissions because indexing is heavier than accepting documents, we may fill up the disk. Of course the index itself can do that, but need to think about if this makes it worse.
      A big plus to this however is that batches with errors could optionally just omit the (one or two?) errored document(s) and publish the error for each errored document rather than failing the whole batch, meaning that the indexing infrastructure submitting in batches doesn't have to leave several hundred docs unprocessed, or alternately do a slow doc at a time resubmit to weed out the offenders.

      Certainly the involvement of kafka sounds interesting. If one persists to an externally addressable location like a kafka queue one might leave the option for the write-on-receipt queue to be different from the read-to-actually-index queue and put a pipeline behind solr instead of infront of it... possibly atomic updates could then be given identical processing as initial indexing....

      On Sat, Oct 10, 2020 at 12:41 AM David Smiley <[hidden email]> wrote:


      On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt <[hidden email]> wrote:
      Hi guys,

      First of all it seems that I used the term async a lot recently :D. 
      Recently I have been thinking a lot about changing the current indexing model of Solr from sync way like currently (user submit an update request waiting for response). What about changing it to async model, where nodes will only persist the update into tlog then return immediately much like what tlog is doing now. Then we have a dedicated executor which reads from tlog to do indexing (producer consumer model with tlog acting like the queue). 

      The biggest problem I have with this is that the client doesn't know about indexing problems without awkward callbacks later to see if something went wrong.  Even simple stuff like a schema problem (e.g. undefined field).  It's a useful *option*, any way.
       

      I do see several big benefits of this approach
      • We can batching updates in a single call, right now we do not use writer.add(documents) api from lucene, by batching updates this gonna boost the performance of indexing
      I'm a bit skeptical that would boost indexing performance.  Please understand the intent of that API is about transactionality (atomic add) and ensuring all docs go in the same segment.  Solr *does* use that API for nested / parent-child documents, and because it has to.  If that API were to get called for normal docs, I could see the configured indexing buffer RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.  You could test your performance theory on a hacked Solr without much modifications, I think?  Just buffer then send in bulk.
      • One common problems with Solr now is we have lot of threads doing indexing so that can ends up with many small segments. Using this model we can have bigger segments so less merge cost
      This is app/use-case dependent of course.  If you observe the segment count to be high, I think it's more likely due to a sub-optimal commit strategy.  Many users should put more thought into this.  If update requests to Solr have a small number of docs and it explicitly gives a commit (commitWithin on the other hand is fine), then this is a recipe for small segments and is generally expensive as well (commits are kinda expensive).  Many apps would do well to either pass commitWithin or rely on a configured commitWithin, accomplishing the same instead of commit.  For apps that can't do that (e.g. need to immediately read-your-write or for other reasons where I work), then such an app can't use async any way.  

      An idea I've had to help throughput for this case would be for commits that are about to happen concurrently with other indexing to voluntarily wait a short period of time (500ms?) in an attempt to coalesce the commit needs of both concurrent indexing requests.  Very opt-in, probably a solrconfig value, and probably a wait time in the 100s of milliseconds range.  An ideal implementation would be conservative to avoid this waiting if there is no concurrent indexing request that did not start after the current request or that which doesn't require a commit as well.

      If your goal is fewer segments, then definitely check out the recent improvements to Lucene to do some lightweight merge-on-commit.  The Solr-side hook is SOLR-14582 and it requires a custom MergePolicy.  I contributed such a MergePolicy policy here: https://github.com/apache/lucene-solr/pull/1222 although it needs to be updated in light of Lucene changes that occured since then.  We've been using that MergePolicy at Salesforce for a couple years and it has cut our segment count in half!  Of course if you can generate fewer segments in the first place, that's preferable and is more to your point.
      • Another huge reason here is after switching to this model, we can remove tlog and use a distributed queue like Kafka, Pulsar. Since the purpose of leader in SolrCloud now is ordering updates, the distributed queue is already ordering updates for us, so no need to have a dedicated leader. That is just the beginning of things that we can do after using a distributed queue.
      I think that's the real thrust of your motivations, and sounds good to me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for making the optionality of the updateLog be a better supportable option in SolrCloud.
       
      What do your guys think about this? Just want to hear from your guys before going deep into this rabbit hole.

      Thanks!



      --


      --
      Best regards,
      Cao Mạnh Đạt