search in solrcloud on replicas

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

search in solrcloud on replicas

Odysci
Hi,

I have a question regarding solrcloud searches on both replicas of an index.
I have a solrcloud setup with 2 physical machines (let's call them A and
B), and my index is divided into 2 shards, and 2 replicas, such that each
machine has a full copy of the index. My Zookeeper setup uses 3 instances.
The nodes and replicas are as follows:
Machine A:
      core_node3 / shard1_replica_n1
      core_node7 / shard2_replica_n4
Machine B:
      core_node5 / shard1_replica_n2
      core_node8 / shard2_replica_n6

I'm using solrJ and I create the solr client using Http2SolrClient.Builder
and the IP of machineA.

Here is my question:
when I do a search (using solrJ) and I look at the search logs on both
machines, I see that the same search is being executed on both machines.
But if the full index is present on both machines, wouldn't it be enough
just to search on one of machines?
In fact, if I turn off machine B, the search returns the correct results
anyway.

Thanks a lot.

Reinaldo
Reply | Threaded
Open this post in threaded view
|

Re: search in solrcloud on replicas

Erick Erickson
The base algorithm for searches picks out one replica from each
shard in a round-robin fashion, without regard to whether it’s on
the same machine or not.

You can alter this behavior, see:
https://lucene.apache.org/solr/guide/8_1/distributed-requests.html

When you say “the exact same search”, it isn’t quite in the sense that
it’s going to a different shard as evidenced by &DISTRIB=false being
on the URL (I’d guess you already know that, but…). The top-level
request _may_ be forwarded as is, there’s an internal load balancer
that does this. The theory is that all the top-level requests shouldn’t
be handled by the same Solr instance if a client is directly using
the http address of a single node in the cluster for all requests.

Best,
Erick



> On May 27, 2020, at 11:12 AM, Odysci <[hidden email]> wrote:
>
> Hi,
>
> I have a question regarding solrcloud searches on both replicas of an index.
> I have a solrcloud setup with 2 physical machines (let's call them A and
> B), and my index is divided into 2 shards, and 2 replicas, such that each
> machine has a full copy of the index. My Zookeeper setup uses 3 instances.
> The nodes and replicas are as follows:
> Machine A:
>      core_node3 / shard1_replica_n1
>      core_node7 / shard2_replica_n4
> Machine B:
>      core_node5 / shard1_replica_n2
>      core_node8 / shard2_replica_n6
>
> I'm using solrJ and I create the solr client using Http2SolrClient.Builder
> and the IP of machineA.
>
> Here is my question:
> when I do a search (using solrJ) and I look at the search logs on both
> machines, I see that the same search is being executed on both machines.
> But if the full index is present on both machines, wouldn't it be enough
> just to search on one of machines?
> In fact, if I turn off machine B, the search returns the correct results
> anyway.
>
> Thanks a lot.
>
> Reinaldo

Reply | Threaded
Open this post in threaded view
|

Re: search in solrcloud on replicas

Odysci
Erick,
thanks for the reply.
Your last line puzzled me a bit. You wrote
*"The theory is that all the top-level requests shouldn’t be handled by the
same Solr instance if a client is directly using the http address of a
single node in the cluster for all requests."*

We are using 2 machines (2 different IPs), 2 shards with 2 replicas each.
We have an application which sends all solr requests to the same http
address of our machine A. I assumed that Zookeeper would distribute the
requests among the nodes.
Is this not the right thing to do? Should I have the application alternate
the solr machine to send requests to?
Thanks

Reinaldo


On Wed, May 27, 2020 at 12:37 PM Erick Erickson <[hidden email]>
wrote:

> The base algorithm for searches picks out one replica from each
> shard in a round-robin fashion, without regard to whether it’s on
> the same machine or not.
>
> You can alter this behavior, see:
> https://lucene.apache.org/solr/guide/8_1/distributed-requests.html
>
> When you say “the exact same search”, it isn’t quite in the sense that
> it’s going to a different shard as evidenced by &DISTRIB=false being
> on the URL (I’d guess you already know that, but…). The top-level
> request _may_ be forwarded as is, there’s an internal load balancer
> that does this. The theory is that all the top-level requests shouldn’t
> be handled by the same Solr instance if a client is directly using
> the http address of a single node in the cluster for all requests.
>
> Best,
> Erick
>
>
>
> > On May 27, 2020, at 11:12 AM, Odysci <[hidden email]> wrote:
> >
> > Hi,
> >
> > I have a question regarding solrcloud searches on both replicas of an
> index.
> > I have a solrcloud setup with 2 physical machines (let's call them A and
> > B), and my index is divided into 2 shards, and 2 replicas, such that each
> > machine has a full copy of the index. My Zookeeper setup uses 3
> instances.
> > The nodes and replicas are as follows:
> > Machine A:
> >      core_node3 / shard1_replica_n1
> >      core_node7 / shard2_replica_n4
> > Machine B:
> >      core_node5 / shard1_replica_n2
> >      core_node8 / shard2_replica_n6
> >
> > I'm using solrJ and I create the solr client using
> Http2SolrClient.Builder
> > and the IP of machineA.
> >
> > Here is my question:
> > when I do a search (using solrJ) and I look at the search logs on both
> > machines, I see that the same search is being executed on both machines.
> > But if the full index is present on both machines, wouldn't it be enough
> > just to search on one of machines?
> > In fact, if I turn off machine B, the search returns the correct results
> > anyway.
> >
> > Thanks a lot.
> >
> > Reinaldo
>
>
Reply | Threaded
Open this post in threaded view
|

Re: search in solrcloud on replicas

Erick Erickson
Close. Zookeeper is not involved in routing requests. Each Solr node
queries Zookeeper to get the topology of the cluster, and thereafter
Zookeeper will notify each node when the topology changes, i.e.
a node goes up or down, a replica goes into recovery etc. Zookeeper
does _not_ get involved in each request since each Solr node
has all the information it needs to satisfy the request cached.
This is a common misunderstanding.

So, nodeA gets the topology of the cluster, including the IP addresses
of each and every node in the cluster. Now you send a query directly to
nodeA. There is an internal load balancer that routes that request to one
of the nodes in the cluster, perhaps itself, perhaps nodeB, etc. That way,
nodeA doesn’t do all the aggregating.

Aggregating? Well, a top-level request comes in to nodeB. Let’s say
rows=10. NodeB must send a sub-request to one replica of every shard
and get the top 10 from each one. It then sorts the lists by whatever
the sort criteria are and sends another request to each of the replicas
queried in the first step to get the actual top 10 docs. Why the 2nd round
trip? Well, imagine there are 100 shards (and I’ve seen more). If the
sub-requests each returned the top 10 documents, there would be
1,000 documents fetched, 990 of which would be thrown away.

Your setup has a single point of failure the way you have it set up now.
Ideally, you have nodeA with one replica of each shard and nodeB also
has one replica for each shard. So either one can go down and your system
can still serve requests. However, since your app is sending the
requests all to the same node, you don’t have that robustness; if that
node goes down so does your entire application.

You should be doing one of two things:
1> use a load balancer between your app and your Solr nodes
or
2> have your app use SolrJ and CloudSolrClient. That class is “just
another Solr node” as far as Zookeeper is concerned. It goes through
the exact same process as a Solr node. When it starts, it gets a snapshot
of the topology of the cluster and “does the right thing” with requests,
including dealing with any changes to the topology, i.e. nodes
stopping/starting, replicas going into recovery, new collections being
added, etc.

HTH,
Erick

> On Jun 4, 2020, at 7:11 PM, Odysci <[hidden email]> wrote:
>
> Erick,
> thanks for the reply.
> Your last line puzzled me a bit. You wrote
> *"The theory is that all the top-level requests shouldn’t be handled by the
> same Solr instance if a client is directly using the http address of a
> single node in the cluster for all requests."*
>
> We are using 2 machines (2 different IPs), 2 shards with 2 replicas each.
> We have an application which sends all solr requests to the same http
> address of our machine A. I assumed that Zookeeper would distribute the
> requests among the nodes.
> Is this not the right thing to do? Should I have the application alternate
> the solr machine to send requests to?
> Thanks
>
> Reinaldo
>
>
> On Wed, May 27, 2020 at 12:37 PM Erick Erickson <[hidden email]>
> wrote:
>
>> The base algorithm for searches picks out one replica from each
>> shard in a round-robin fashion, without regard to whether it’s on
>> the same machine or not.
>>
>> You can alter this behavior, see:
>> https://lucene.apache.org/solr/guide/8_1/distributed-requests.html
>>
>> When you say “the exact same search”, it isn’t quite in the sense that
>> it’s going to a different shard as evidenced by &DISTRIB=false being
>> on the URL (I’d guess you already know that, but…). The top-level
>> request _may_ be forwarded as is, there’s an internal load balancer
>> that does this. The theory is that all the top-level requests shouldn’t
>> be handled by the same Solr instance if a client is directly using
>> the http address of a single node in the cluster for all requests.
>>
>> Best,
>> Erick
>>
>>
>>
>>> On May 27, 2020, at 11:12 AM, Odysci <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I have a question regarding solrcloud searches on both replicas of an
>> index.
>>> I have a solrcloud setup with 2 physical machines (let's call them A and
>>> B), and my index is divided into 2 shards, and 2 replicas, such that each
>>> machine has a full copy of the index. My Zookeeper setup uses 3
>> instances.
>>> The nodes and replicas are as follows:
>>> Machine A:
>>>     core_node3 / shard1_replica_n1
>>>     core_node7 / shard2_replica_n4
>>> Machine B:
>>>     core_node5 / shard1_replica_n2
>>>     core_node8 / shard2_replica_n6
>>>
>>> I'm using solrJ and I create the solr client using
>> Http2SolrClient.Builder
>>> and the IP of machineA.
>>>
>>> Here is my question:
>>> when I do a search (using solrJ) and I look at the search logs on both
>>> machines, I see that the same search is being executed on both machines.
>>> But if the full index is present on both machines, wouldn't it be enough
>>> just to search on one of machines?
>>> In fact, if I turn off machine B, the search returns the correct results
>>> anyway.
>>>
>>> Thanks a lot.
>>>
>>> Reinaldo
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: search in solrcloud on replicas

Odysci
Erick,
thanks a lot, very clear.

Reinaldo

On Thu, Jun 4, 2020 at 8:37 PM Erick Erickson <[hidden email]>
wrote:

> Close. Zookeeper is not involved in routing requests. Each Solr node
> queries Zookeeper to get the topology of the cluster, and thereafter
> Zookeeper will notify each node when the topology changes, i.e.
> a node goes up or down, a replica goes into recovery etc. Zookeeper
> does _not_ get involved in each request since each Solr node
> has all the information it needs to satisfy the request cached.
> This is a common misunderstanding.
>
> So, nodeA gets the topology of the cluster, including the IP addresses
> of each and every node in the cluster. Now you send a query directly to
> nodeA. There is an internal load balancer that routes that request to one
> of the nodes in the cluster, perhaps itself, perhaps nodeB, etc. That way,
> nodeA doesn’t do all the aggregating.
>
> Aggregating? Well, a top-level request comes in to nodeB. Let’s say
> rows=10. NodeB must send a sub-request to one replica of every shard
> and get the top 10 from each one. It then sorts the lists by whatever
> the sort criteria are and sends another request to each of the replicas
> queried in the first step to get the actual top 10 docs. Why the 2nd round
> trip? Well, imagine there are 100 shards (and I’ve seen more). If the
> sub-requests each returned the top 10 documents, there would be
> 1,000 documents fetched, 990 of which would be thrown away.
>
> Your setup has a single point of failure the way you have it set up now.
> Ideally, you have nodeA with one replica of each shard and nodeB also
> has one replica for each shard. So either one can go down and your system
> can still serve requests. However, since your app is sending the
> requests all to the same node, you don’t have that robustness; if that
> node goes down so does your entire application.
>
> You should be doing one of two things:
> 1> use a load balancer between your app and your Solr nodes
> or
> 2> have your app use SolrJ and CloudSolrClient. That class is “just
> another Solr node” as far as Zookeeper is concerned. It goes through
> the exact same process as a Solr node. When it starts, it gets a snapshot
> of the topology of the cluster and “does the right thing” with requests,
> including dealing with any changes to the topology, i.e. nodes
> stopping/starting, replicas going into recovery, new collections being
> added, etc.
>
> HTH,
> Erick
>
> > On Jun 4, 2020, at 7:11 PM, Odysci <[hidden email]> wrote:
> >
> > Erick,
> > thanks for the reply.
> > Your last line puzzled me a bit. You wrote
> > *"The theory is that all the top-level requests shouldn’t be handled by
> the
> > same Solr instance if a client is directly using the http address of a
> > single node in the cluster for all requests."*
> >
> > We are using 2 machines (2 different IPs), 2 shards with 2 replicas each.
> > We have an application which sends all solr requests to the same http
> > address of our machine A. I assumed that Zookeeper would distribute the
> > requests among the nodes.
> > Is this not the right thing to do? Should I have the application
> alternate
> > the solr machine to send requests to?
> > Thanks
> >
> > Reinaldo
> >
> >
> > On Wed, May 27, 2020 at 12:37 PM Erick Erickson <[hidden email]
> >
> > wrote:
> >
> >> The base algorithm for searches picks out one replica from each
> >> shard in a round-robin fashion, without regard to whether it’s on
> >> the same machine or not.
> >>
> >> You can alter this behavior, see:
> >> https://lucene.apache.org/solr/guide/8_1/distributed-requests.html
> >>
> >> When you say “the exact same search”, it isn’t quite in the sense that
> >> it’s going to a different shard as evidenced by &DISTRIB=false being
> >> on the URL (I’d guess you already know that, but…). The top-level
> >> request _may_ be forwarded as is, there’s an internal load balancer
> >> that does this. The theory is that all the top-level requests shouldn’t
> >> be handled by the same Solr instance if a client is directly using
> >> the http address of a single node in the cluster for all requests.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >>> On May 27, 2020, at 11:12 AM, Odysci <[hidden email]> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I have a question regarding solrcloud searches on both replicas of an
> >> index.
> >>> I have a solrcloud setup with 2 physical machines (let's call them A
> and
> >>> B), and my index is divided into 2 shards, and 2 replicas, such that
> each
> >>> machine has a full copy of the index. My Zookeeper setup uses 3
> >> instances.
> >>> The nodes and replicas are as follows:
> >>> Machine A:
> >>>     core_node3 / shard1_replica_n1
> >>>     core_node7 / shard2_replica_n4
> >>> Machine B:
> >>>     core_node5 / shard1_replica_n2
> >>>     core_node8 / shard2_replica_n6
> >>>
> >>> I'm using solrJ and I create the solr client using
> >> Http2SolrClient.Builder
> >>> and the IP of machineA.
> >>>
> >>> Here is my question:
> >>> when I do a search (using solrJ) and I look at the search logs on both
> >>> machines, I see that the same search is being executed on both
> machines.
> >>> But if the full index is present on both machines, wouldn't it be
> enough
> >>> just to search on one of machines?
> >>> In fact, if I turn off machine B, the search returns the correct
> results
> >>> anyway.
> >>>
> >>> Thanks a lot.
> >>>
> >>> Reinaldo
> >>
> >>
>
>