Multiple collections for a write-alias

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
S G
Reply | Threaded
Open this post in threaded view
|

Multiple collections for a write-alias

S G
Hi,

We have a use-case to re-create a solr-collection by re-ingesting
everything but not tolerate a downtime while that is happening.

We are using collection alias feature to point to the new collection when
it has been re-ingested fully.

However, re-ingestion takes several hours to complete and during that time,
the customer has to write to both the collections - previous collection and
the one being bootstrapped.
This dual-write is harder to do from the client side (because client needs
to have a retry logic to ensure any update does not succeed in one
collection and fails in another - consistency problem) and it would be a
real welcome addition if collection aliasing can support this.

Proposal:
If can enhance the write alias to point to multiple collections such that
any update to the alias is written to all the collections it points to, it
would help the client to avoid dual writes and also issue just a single
http call from the client instead of multiple. It would also reduce the
retry logic inside the client code used to keep the collections consistent.


Thanks
SG
Reply | Threaded
Open this post in threaded view
|

Re: Multiple collections for a write-alias

Erick Erickson
Aliases can already point to multiple collections, have you just tried that?

I'm not totally sure what the behavior would be, but nothing you've written
indicates you tried so I thought I'd point it out.

It's not clear to me how useful this is though, or what failure messages
are returned. Or how you figure out which collection failed. Or how you'd
take remedial action.

Best,
Erick

Erick

On Thu, Nov 9, 2017 at 10:09 AM, S G <[hidden email]> wrote:

> Hi,
>
> We have a use-case to re-create a solr-collection by re-ingesting
> everything but not tolerate a downtime while that is happening.
>
> We are using collection alias feature to point to the new collection when
> it has been re-ingested fully.
>
> However, re-ingestion takes several hours to complete and during that time,
> the customer has to write to both the collections - previous collection and
> the one being bootstrapped.
> This dual-write is harder to do from the client side (because client needs
> to have a retry logic to ensure any update does not succeed in one
> collection and fails in another - consistency problem) and it would be a
> real welcome addition if collection aliasing can support this.
>
> Proposal:
> If can enhance the write alias to point to multiple collections such that
> any update to the alias is written to all the collections it points to, it
> would help the client to avoid dual writes and also issue just a single
> http call from the client instead of multiple. It would also reduce the
> retry logic inside the client code used to keep the collections consistent.
>
>
> Thanks
> SG
Reply | Threaded
Open this post in threaded view
|

Re: Multiple collections for a write-alias

Shawn Heisey-2
In reply to this post by S G
On 11/9/2017 11:09 AM, S G wrote:
> However, re-ingestion takes several hours to complete and during that time,
> the customer has to write to both the collections - previous collection and
> the one being bootstrapped.
> This dual-write is harder to do from the client side (because client needs
> to have a retry logic to ensure any update does not succeed in one
> collection and fails in another - consistency problem) and it would be a
> real welcome addition if collection aliasing can support this.

Let me explain how I handle this situation.  I'm not running in cloud
mode, but I use the "swap" feature of CoreAdmin to do much the same
thing you're describing with collection aliases.

My source data (mysql database) has a way to track the last new document
that was added, as well as track which deletes have been applied, and
which documents need to be reinserted.  I use these pointers to decide
what data to retrieve on each indexing cycle, and then I update them to
new positions when the indexing cycle completes successfully.

When I do a full rebuild, I grab the current positions for new docs,
deletes, and reinserts, and store that information in a special place. 
Then I start building indexes in the "build" cores.  In the meantime, I
am continuing to update all the "live" cores, so users are unaware that
anything special is happening.

When the rebuild finishes (which can take a day or more), I go to that
special place where I stored all the position information, and proceed
to run a "catchup" indexing process on the build cores -- all the
deletes, new documents, and reinserts that happened since the time the
full rebuild started.  When that completes, I swap the build cores with
the live cores, and resume normal operation.

Doing it this way, I do not need to worry about the normal indexing
cycle handling writes to both the old index and the new index -- the
ongoing cycle just updates the current live cores.

> Proposal:
> If can enhance the write alias to point to multiple collections such that
> any update to the alias is written to all the collections it points to, it
> would help the client to avoid dual writes and also issue just a single
> http call from the client instead of multiple. It would also reduce the
> retry logic inside the client code used to keep the collections consistent.

Imagine an index with time-series data, where there is an alias called
"today" that includes up to 24 hourly collections.  If you were to write
to that alias with the idea you've proposed, the data would end up in
the wrong places and would in fact get incorrectly duplicated many times
... but the way it currently works, the writes would only go to the
FIRST collection in the alias, which can be arranged to always be the
"current" collection.

Your proposal is an interesting idea, but would require some development
work.  Errors during indexing could be a major source of headaches,
especially those errors that don't affect all collections in the alias
equally.  So as to not change how users expect Solr to work currently,
aliases would need a special flag to indicate that writes *should* be
duplicated to all collections in the alias, or maybe there would need to
be two different kinds of aliases.  Since such a feature is probably not
going to happen quickly even if it is something that we agree to work
on, would you be able to use something like the method that I outlined
above?

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Multiple collections for a write-alias

Emir Arnautović
In reply to this post by S G
This approach could work only if it is append only index. In case you have updates/deletes, you have to process in order, otherwise you will get incorrect results. I am thinking that is one of the reasons why it might not be supported since not too useful.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 9 Nov 2017, at 19:09, S G <[hidden email]> wrote:
>
> Hi,
>
> We have a use-case to re-create a solr-collection by re-ingesting
> everything but not tolerate a downtime while that is happening.
>
> We are using collection alias feature to point to the new collection when
> it has been re-ingested fully.
>
> However, re-ingestion takes several hours to complete and during that time,
> the customer has to write to both the collections - previous collection and
> the one being bootstrapped.
> This dual-write is harder to do from the client side (because client needs
> to have a retry logic to ensure any update does not succeed in one
> collection and fails in another - consistency problem) and it would be a
> real welcome addition if collection aliasing can support this.
>
> Proposal:
> If can enhance the write alias to point to multiple collections such that
> any update to the alias is written to all the collections it points to, it
> would help the client to avoid dual writes and also issue just a single
> http call from the client instead of multiple. It would also reduce the
> retry logic inside the client code used to keep the collections consistent.
>
>
> Thanks
> SG

S G
Reply | Threaded
Open this post in threaded view
|

Re: Multiple collections for a write-alias

S G
We are actually very close to doing what Shawn has suggested.

Emir has a good point about new collections failing on deletes/updates of
older documents which were not present in the new collection. But even if
this
feature can be implemented for an append-only log, it would make a good
feature IMO.


Use-case for re-indexing everything again is generally that of an attribute
change like
enabling "indexed" or "docValues" on a field or adding a new field to a
schema.
While the reading client-code sits behind a flag to start using the new
attribute/field, we
have to re-index all the data without stopping older-format reads.
Currently, we have to do
dual writes to the new collections or play catch-up-after-a-bootstrap.


Note that the catch-up-after-a-bootstrap is not very easy too (it is very
similar to the one
described by Shwan). If this special place is Kafka or some table in the
DB, then we have to
do dual writes to the regular source-of-truth and this special place. Dual
writes with DB and Kafka
suffer from being transaction-less (and thus lack consistency) while dual
write to DB increase
the load on DB.


Having created_date / modified_date fields and querying the DB to find
live-traffic documents has
its own problems and is taxing on the DB again.


Dual writes to Solr's multiple collections directly is the simplest to
implement for a client and
that is exactly what this new feature could be. With a
dual-write-collection-alias, it becomes
easier for the client to not implement any of the above if the
dual-write-collection-alias does the following:

- Deletes on missing documents in new collection are simply ignored.
- Incremental updates just throw an error for not being supported on
multi-write-collection-alias.
- Regular updates (i.e. Delete-Then-Insert) should work just fine because
they will just treat the document as a brand new one and versioning
strategies can take care of out-of-order updates.


SG


On Fri, Nov 10, 2017 at 6:33 AM, Emir Arnautović <
[hidden email]> wrote:

> This approach could work only if it is append only index. In case you have
> updates/deletes, you have to process in order, otherwise you will get
> incorrect results. I am thinking that is one of the reasons why it might
> not be supported since not too useful.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 9 Nov 2017, at 19:09, S G <[hidden email]> wrote:
> >
> > Hi,
> >
> > We have a use-case to re-create a solr-collection by re-ingesting
> > everything but not tolerate a downtime while that is happening.
> >
> > We are using collection alias feature to point to the new collection when
> > it has been re-ingested fully.
> >
> > However, re-ingestion takes several hours to complete and during that
> time,
> > the customer has to write to both the collections - previous collection
> and
> > the one being bootstrapped.
> > This dual-write is harder to do from the client side (because client
> needs
> > to have a retry logic to ensure any update does not succeed in one
> > collection and fails in another - consistency problem) and it would be a
> > real welcome addition if collection aliasing can support this.
> >
> > Proposal:
> > If can enhance the write alias to point to multiple collections such that
> > any update to the alias is written to all the collections it points to,
> it
> > would help the client to avoid dual writes and also issue just a single
> > http call from the client instead of multiple. It would also reduce the
> > retry logic inside the client code used to keep the collections
> consistent.
> >
> >
> > Thanks
> > SG
>
>