Allow Join over two sharded collection

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Allow Join over two sharded collection

mganeshs
All,

Any idea when this ticket will be addressed.

https://issues.apache.org/jira/browse/SOLR-8297

One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?

Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Erick Erickson
Probably won't be in 7.0. In fact it appears to have lost momentum so
I don't know if it'll ever be committed. Don't know that it _won't_,
but there's no way to say.

There's been a lot of work in the Solr Streaming world to do joins and
it's quite possible that that'll do what you need.

Best,
Erick

On Thu, Jun 29, 2017 at 7:44 AM, mganeshs <[hidden email]> wrote:

> All,
>
> Any idea when this  ticket <https://issues.apache.org/jira/browse/SOLR-8297>
> will be addressed.
>
> https://issues.apache.org/jira/browse/SOLR-8297
>
> One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

mganeshs
Hi Erick,

Initially I also thought of using Streaming for Joins. But looks like Joins with Streaming is not for heavy QPS sort of queries and that's my use case.
Currently things are working fine with normal join for us as we have only one shard. But in coming days number of documents to be indexed is going to be increased drastically. So we need to split shards. The time I split shards I can't use Joins.

We thought of going with Implict routing for sharding. But if we go with Implicit routing, all indexing will not be distributed and so one shard could be getting more load which we don't want.
So we badly looking for default Join.
As I have posted in different questions in this forum itself and you too have replied.... our joins are between real documents and it's ACL documents. ACL document has multi value field whose value would be user or groups. Why we want to keep ACL separately instead of keeping it in same real document itself. It's because that our ACL can grow till 1L of users or even more. and for every change in ACL or its permission we don't want to re-index the real document as well.

Do you think is there any better alternative ? or the way we have kept ACLs are wrong ?

Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Damien Kamerman
Joins will work with shards as long as the docs you're joining from/to are
in the shard. Why not go compositeId routing (either ID=uniqueKey!docId or
router.field)? Is there no 'uniqueKey' which will distribute randomly? You
may need to put the same ACL docs in all shards depending on your use case.

On 30 June 2017 at 12:57, mganeshs <[hidden email]> wrote:

> Hi Erick,
>
> Initially I also thought of using Streaming for Joins. But looks like Joins
> with Streaming is not for heavy QPS sort of queries and that's my use case.
> Currently things are working fine with normal join for us as we have only
> one shard. But in coming days number of documents to be indexed is going to
> be increased drastically. So we need to split shards. The time I split
> shards I can't use Joins.
>
> We thought of going with Implict routing for sharding. But if we go with
> Implicit routing, all indexing will not be distributed and so one shard
> could be getting more load which we don't want.
> So we badly looking for default Join.
> As I have posted in different questions in this forum itself and you too
> have replied.... our joins are between real documents and it's ACL
> documents. ACL document has multi value field whose value would be user or
> groups. Why we want to keep ACL separately instead of keeping it in same
> real document itself. It's because that our ACL can grow till 1L of users
> or
> even more. and for every change in ACL or its permission we don't want to
> re-index the real document as well.
>
> Do you think is there any better alternative ? or the way we have kept ACLs
> are wrong ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343582.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Mikhail Khludnev-2
In reply to this post by mganeshs
probably in November or December.

On Thu, Jun 29, 2017 at 5:44 PM, mganeshs <[hidden email]> wrote:

> All,
>
> Any idea when this  ticket <https://issues.apache.org/
> jira/browse/SOLR-8297>
> will be addressed.
>
> https://issues.apache.org/jira/browse/SOLR-8297
>
> One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Susheel Kumar-3
How many documents you have currently and how much it will be after growing
drastically.

Either you can add hardware and keep one shard until the joins are fully
available
or
You can shard and distribute using composite id router and that's still
better even though some/one shard(s) may get high load compare to having
just one single shard/node taking all the load, right?

On Fri, Jun 30, 2017 at 2:19 AM, Mikhail Khludnev <[hidden email]> wrote:

> probably in November or December.
>
> On Thu, Jun 29, 2017 at 5:44 PM, mganeshs <[hidden email]> wrote:
>
> > All,
> >
> > Any idea when this  ticket <https://issues.apache.org/
> > jira/browse/SOLR-8297>
> > will be addressed.
> >
> > https://issues.apache.org/jira/browse/SOLR-8297
> >
> > One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?
> >
> > Regards,
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Allow-Join-over-two-sharded-collection-tp4343443.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

mganeshs
Hi Susheel,

Currently we have around 20M documents already and we are expecting now on that every month 1M of documents.
The reason why don't want to for time based implicit routing is that, all documents will end up with recent shard and so indexing will be heavy for the new shard, where as older shards will be used just for query purpose.
If we have default sharding, then load for indexing is distributed across all the shards. That's the reason we would like to stick to default sharding. But Join is the issue over here when default sharding is used :-(
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Erick Erickson
1M docs/month shouldn't make Solr break a sweat. If it really worries
you and you're indexing in a big batch, index during off hours. At
very worst, if you're ingesting them all at once you might have to
throttle the indexing a bit.

Frankly, most of the time acquiring the documents from the system of
record is where the bottleneck is and Solr easily handles the indexing
load.

The other advantage is that if you use implicit routing rather than a
composite ID, you can add shards to your collection one at a time as
required, for time-series data that's an elegant way to "age out" old
documents.

Best,
Erick

On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <[hidden email]> wrote:

> Hi Susheel,
>
> Currently we have around 20M documents already and we are expecting now on
> that every month 1M of documents.
> The reason why don't want to for time based implicit routing is that, all
> documents will end up with recent shard and so indexing will be heavy for
> the new shard, where as older shards will be used just for query purpose.
> If we have default sharding, then load for indexing is distributed across
> all the shards. That's the reason we would like to stick to default
> sharding. But Join is the issue over here when default sharding is used :-(
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Susheel Kumar-3
As Eric said 1docs/month isn't a big deal.  I have 45+ million docs in one
shard but YMMV depending on other factors.

Also there is lot of confusion in the terminology. The default routing is
compositeID routing.  The implicit routing which Eric mentioned is the
manual routing.  https://issues.apache.org/jira/browse/SOLR-6630

Which routing you are suggesting to use? Can you clarify again.  Also
what's your exact use case.  Do you query old aged documents or you don't
need to and most or all of your queries are supposed to go to shard with
newer documents.

Thanks,
Susheel

On Sat, Jul 1, 2017 at 12:14 PM, Erick Erickson <[hidden email]>
wrote:

> 1M docs/month shouldn't make Solr break a sweat. If it really worries
> you and you're indexing in a big batch, index during off hours. At
> very worst, if you're ingesting them all at once you might have to
> throttle the indexing a bit.
>
> Frankly, most of the time acquiring the documents from the system of
> record is where the bottleneck is and Solr easily handles the indexing
> load.
>
> The other advantage is that if you use implicit routing rather than a
> composite ID, you can add shards to your collection one at a time as
> required, for time-series data that's an elegant way to "age out" old
> documents.
>
> Best,
> Erick
>
> On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <[hidden email]> wrote:
> > Hi Susheel,
> >
> > Currently we have around 20M documents already and we are expecting now
> on
> > that every month 1M of documents.
> > The reason why don't want to for time based implicit routing is that, all
> > documents will end up with recent shard and so indexing will be heavy for
> > the new shard, where as older shards will be used just for query purpose.
> > If we have default sharding, then load for indexing is distributed across
> > all the shards. That's the reason we would like to stick to default
> > sharding. But Join is the issue over here when default sharding is used
> :-(
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Susheel Kumar-3
Depending on your use case people also use collection aliasing for time
series data.  See below

https://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/

On Sat, Jul 1, 2017 at 7:13 PM, Susheel Kumar <[hidden email]> wrote:

> As Eric said 1docs/month isn't a big deal.  I have 45+ million docs in one
> shard but YMMV depending on other factors.
>
> Also there is lot of confusion in the terminology. The default routing is
> compositeID routing.  The implicit routing which Eric mentioned is the
> manual routing.  https://issues.apache.org/jira/browse/SOLR-6630
>
> Which routing you are suggesting to use? Can you clarify again.  Also
> what's your exact use case.  Do you query old aged documents or you don't
> need to and most or all of your queries are supposed to go to shard with
> newer documents.
>
> Thanks,
> Susheel
>
> On Sat, Jul 1, 2017 at 12:14 PM, Erick Erickson <[hidden email]>
> wrote:
>
>> 1M docs/month shouldn't make Solr break a sweat. If it really worries
>> you and you're indexing in a big batch, index during off hours. At
>> very worst, if you're ingesting them all at once you might have to
>> throttle the indexing a bit.
>>
>> Frankly, most of the time acquiring the documents from the system of
>> record is where the bottleneck is and Solr easily handles the indexing
>> load.
>>
>> The other advantage is that if you use implicit routing rather than a
>> composite ID, you can add shards to your collection one at a time as
>> required, for time-series data that's an elegant way to "age out" old
>> documents.
>>
>> Best,
>> Erick
>>
>> On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <[hidden email]> wrote:
>> > Hi Susheel,
>> >
>> > Currently we have around 20M documents already and we are expecting now
>> on
>> > that every month 1M of documents.
>> > The reason why don't want to for time based implicit routing is that,
>> all
>> > documents will end up with recent shard and so indexing will be heavy
>> for
>> > the new shard, where as older shards will be used just for query
>> purpose.
>> > If we have default sharding, then load for indexing is distributed
>> across
>> > all the shards. That's the reason we would like to stick to default
>> > sharding. But Join is the issue over here when default sharding is used
>> :-(
>> >
>> >
>> >
>> > --
>> > View this message in context: http://lucene.472066.n3.nabble
>> .com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Glick, David
Unsubscribe

Sent from my iPhone

> On Jul 1, 2017, at 8:02 PM, Susheel Kumar <[hidden email]> wrote:
>
> Depending on your use case people also use collection aliasing for time
> series data.  See below
>
> https://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
>
>> On Sat, Jul 1, 2017 at 7:13 PM, Susheel Kumar <[hidden email]> wrote:
>>
>> As Eric said 1docs/month isn't a big deal.  I have 45+ million docs in one
>> shard but YMMV depending on other factors.
>>
>> Also there is lot of confusion in the terminology. The default routing is
>> compositeID routing.  The implicit routing which Eric mentioned is the
>> manual routing.  https://issues.apache.org/jira/browse/SOLR-6630
>>
>> Which routing you are suggesting to use? Can you clarify again.  Also
>> what's your exact use case.  Do you query old aged documents or you don't
>> need to and most or all of your queries are supposed to go to shard with
>> newer documents.
>>
>> Thanks,
>> Susheel
>>
>> On Sat, Jul 1, 2017 at 12:14 PM, Erick Erickson <[hidden email]>
>> wrote:
>>
>>> 1M docs/month shouldn't make Solr break a sweat. If it really worries
>>> you and you're indexing in a big batch, index during off hours. At
>>> very worst, if you're ingesting them all at once you might have to
>>> throttle the indexing a bit.
>>>
>>> Frankly, most of the time acquiring the documents from the system of
>>> record is where the bottleneck is and Solr easily handles the indexing
>>> load.
>>>
>>> The other advantage is that if you use implicit routing rather than a
>>> composite ID, you can add shards to your collection one at a time as
>>> required, for time-series data that's an elegant way to "age out" old
>>> documents.
>>>
>>> Best,
>>> Erick
>>>
>>>> On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <[hidden email]> wrote:
>>>> Hi Susheel,
>>>>
>>>> Currently we have around 20M documents already and we are expecting now
>>> on
>>>> that every month 1M of documents.
>>>> The reason why don't want to for time based implicit routing is that,
>>> all
>>>> documents will end up with recent shard and so indexing will be heavy
>>> for
>>>> the new shard, where as older shards will be used just for query
>>> purpose.
>>>> If we have default sharding, then load for indexing is distributed
>>> across
>>>> all the shards. That's the reason we would like to stick to default
>>>> sharding. But Join is the issue over here when default sharding is used
>>> :-(
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://lucene.472066.n3.nabble
>>> .com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

mganeshs
In reply to this post by Susheel Kumar-3
Hi Susheel,

To make use of Joins only option is I should go for manual routing. If I go for manual routing based on time, we miss the power of distributing the load while indexing. It will end up with all indexing happens in newly created shard, which we feel this will not be efficient approach and degrades the performance of indexing as we have lot of jvms running, but still all indexing going to one single shard for indexing and we are also expecting 1M+ docs per month in coming days.

For your question on whether we will query old aged document... ? Mostly we won't query old aged documents. With querying pattern, it's clear we should go for manual routing and creating alias. But when it comes to indexing, in order to distribute the load of indexing, we felt default routing is the best option, but Join will not work. And that's the reason for asking when this feature will be in place ?

Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Susheel Kumar-3
How are you planing to manual route? What key(s) are you thinking to use.

Second the link i shared was collection aliasing and if you use that, you
will end up with multiple collections. Just want to clarify as you said
above "...manual routing and creating alias"

Again until the join feature is available across shards, you can still
continue with one shard (and replica's if needed).  20M + 1M/per month
shouldn't be a big deal.

Thanks,
Susheel

On Mon, Jul 3, 2017 at 11:16 PM, mganeshs <[hidden email]> wrote:

> Hi Susheel,
>
> To make use of Joins only option is I should go for manual routing. If I go
> for manual routing based on time, we miss the power of distributing the
> load
> while indexing. It will end up with all indexing happens in newly created
> shard, which we feel this will not be efficient approach and degrades the
> performance of indexing as we have lot of jvms running, but still all
> indexing going to one single shard for indexing and we are also expecting
> 1M+ docs per month in coming days.
>
> For your question on whether we will query old aged document... ? Mostly we
> won't query old aged documents. With querying pattern, it's clear we should
> go for manual routing and creating alias. But when it comes to indexing, in
> order to distribute the load of indexing, we felt default routing is the
> best option, but Join will not work. And that's the reason for asking when
> this feature will be in place ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4344098.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

mganeshs
All,

Any idea, whether this will be taken care or addressed in near future ?

https://issues.apache.org/jira/browse/SOLR-8297

Regards,




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Allow Join over two sharded collection

Erick Erickson
This doesn't appear to be being actively pursued, so it's anybody's guess.

Depending on your use-case, the streaming capabilities may be an
OOB solution.

Best,
Erick

On Wed, Feb 6, 2019 at 1:22 AM mganeshs <[hidden email]> wrote:

>
> All,
>
> Any idea, whether this will be taken care or addressed in near future ?
>
> https://issues.apache.org/jira/browse/SOLR-8297
>
> Regards,
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html