Realtime get not always returning existing data

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Realtime get not always returning existing data

sgaron cse
Hey all,

We're trying to use SOLR for our document store and are facing some issues
with the Realtime Get api. Basically, we're doing an api call from multiple
endpoint to retrieve configuration data. The document that we are
retrieving does not change at all but sometimes the API returns a null
document ({doc:null}). I'd say 99.99% of the time we can retrieve the
document fine but once in a blue moon we get the null document. The problem
is that for us, if SOLR returns null, that means that the document does not
exist but because this is a document that should be there it causes all
sort of problems in our system.

The API I call is the following:
http://{server_ip}/solr/config/get?id={id}&wt=json&fl=_source_

As far as I understand reading the documentation, the Realtime Get API
should get me the document no matter what. Even if the document is not yet
committed to the index.

I see no errors whatsoever in the SOLR logs that could help me with this
problem. in fact there are no error at all.

As for our setup, because we're still in testing phase, we only have two
SOLR instances running on the same box in cloud mode with replication=1
which means that the core that we run the Realtime Get on is only present
in one of the two instances. Our script randomly chooses which instances it
does the query on but as far as I understand, in cloud mode the API call
should be dispatched automatically to the right instance.

Am I missing anything here? Is it possible that there is a race condition
in the Realtime Get API that could return null data even if the document
exist?

Thanks,
Steve
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
What version of Solr are you running? Mostly that's for curiosity.

Is the doc that's not returned something you've recently indexed?
Here's a possible scenario:
You send the doc out to be indexed. The primary forwards the doc to
the followers. Before the follower has a chance to process (but not
commit), you issue a RTG against that doc and it happens to be routed
to a node that hasn't received it from the leader yet. Does this sound
plausible in your scenario?

Hmmm, I suppose it's not even a requirement that the request gets sent
to a follower, it could easily be "in process" on the leader/primary.

Best,
Erick
On Wed, Sep 26, 2018 at 11:55 AM sgaron cse <[hidden email]> wrote:

>
> Hey all,
>
> We're trying to use SOLR for our document store and are facing some issues
> with the Realtime Get api. Basically, we're doing an api call from multiple
> endpoint to retrieve configuration data. The document that we are
> retrieving does not change at all but sometimes the API returns a null
> document ({doc:null}). I'd say 99.99% of the time we can retrieve the
> document fine but once in a blue moon we get the null document. The problem
> is that for us, if SOLR returns null, that means that the document does not
> exist but because this is a document that should be there it causes all
> sort of problems in our system.
>
> The API I call is the following:
> http://{server_ip}/solr/config/get?id={id}&wt=json&fl=_source_
>
> As far as I understand reading the documentation, the Realtime Get API
> should get me the document no matter what. Even if the document is not yet
> committed to the index.
>
> I see no errors whatsoever in the SOLR logs that could help me with this
> problem. in fact there are no error at all.
>
> As for our setup, because we're still in testing phase, we only have two
> SOLR instances running on the same box in cloud mode with replication=1
> which means that the core that we run the Realtime Get on is only present
> in one of the two instances. Our script randomly chooses which instances it
> does the query on but as far as I understand, in cloud mode the API call
> should be dispatched automatically to the right instance.
>
> Am I missing anything here? Is it possible that there is a race condition
> in the Realtime Get API that could return null data even if the document
> exist?
>
> Thanks,
> Steve
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

sgaron cse
Hey Erick,

We're using SOLR 7.3.1, which is not the latest but still not too far back.

No the document has not been recently indexed, in fact, I can use the
/search API endpoint to find the document. But I need a fast way to find
document that have not necessarily been indexed yet so /search is out of
the question. Also to put you in context, last time the doc was modified
was 3 days ago but we are still seing the occasional doc:null return from
the Realtime Get API.

Steve

On Thu, Sep 27, 2018 at 12:52 PM Erick Erickson <[hidden email]>
wrote:

> What version of Solr are you running? Mostly that's for curiosity.
>
> Is the doc that's not returned something you've recently indexed?
> Here's a possible scenario:
> You send the doc out to be indexed. The primary forwards the doc to
> the followers. Before the follower has a chance to process (but not
> commit), you issue a RTG against that doc and it happens to be routed
> to a node that hasn't received it from the leader yet. Does this sound
> plausible in your scenario?
>
> Hmmm, I suppose it's not even a requirement that the request gets sent
> to a follower, it could easily be "in process" on the leader/primary.
>
> Best,
> Erick
> On Wed, Sep 26, 2018 at 11:55 AM sgaron cse <[hidden email]> wrote:
> >
> > Hey all,
> >
> > We're trying to use SOLR for our document store and are facing some
> issues
> > with the Realtime Get api. Basically, we're doing an api call from
> multiple
> > endpoint to retrieve configuration data. The document that we are
> > retrieving does not change at all but sometimes the API returns a null
> > document ({doc:null}). I'd say 99.99% of the time we can retrieve the
> > document fine but once in a blue moon we get the null document. The
> problem
> > is that for us, if SOLR returns null, that means that the document does
> not
> > exist but because this is a document that should be there it causes all
> > sort of problems in our system.
> >
> > The API I call is the following:
> > http://{server_ip}/solr/config/get?id={id}&wt=json&fl=_source_
> >
> > As far as I understand reading the documentation, the Realtime Get API
> > should get me the document no matter what. Even if the document is not
> yet
> > committed to the index.
> >
> > I see no errors whatsoever in the SOLR logs that could help me with this
> > problem. in fact there are no error at all.
> >
> > As for our setup, because we're still in testing phase, we only have two
> > SOLR instances running on the same box in cloud mode with replication=1
> > which means that the core that we run the Realtime Get on is only present
> > in one of the two instances. Our script randomly chooses which instances
> it
> > does the query on but as far as I understand, in cloud mode the API call
> > should be dispatched automatically to the right instance.
> >
> > Am I missing anything here? Is it possible that there is a race condition
> > in the Realtime Get API that could return null data even if the document
> > exist?
> >
> > Thanks,
> > Steve
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
Steve:

Thanks. So theoretically I should be able to set up a cluster, index a
bunch of docs to it and then just hammer RTG calls against those IDs
and sometime see a failure?

Hmmm, I guess a follow-up question is whether there's any indexing
gong on at all when this happens. Or, more specifically, if there's
any time when you see this problem when there's _no_ indexing going
on.

I understand that it's not recently-indexed docs that are not being
found, but if there's indexing going on searchers are being opened,
caches flushed and the like so if this happens even when there's no
indexing going on it'd help reproduce/track

Erick
On Thu, Sep 27, 2018 at 10:11 AM sgaron cse <[hidden email]> wrote:

>
> Hey Erick,
>
> We're using SOLR 7.3.1, which is not the latest but still not too far back.
>
> No the document has not been recently indexed, in fact, I can use the
> /search API endpoint to find the document. But I need a fast way to find
> document that have not necessarily been indexed yet so /search is out of
> the question. Also to put you in context, last time the doc was modified
> was 3 days ago but we are still seing the occasional doc:null return from
> the Realtime Get API.
>
> Steve
>
> On Thu, Sep 27, 2018 at 12:52 PM Erick Erickson <[hidden email]>
> wrote:
>
> > What version of Solr are you running? Mostly that's for curiosity.
> >
> > Is the doc that's not returned something you've recently indexed?
> > Here's a possible scenario:
> > You send the doc out to be indexed. The primary forwards the doc to
> > the followers. Before the follower has a chance to process (but not
> > commit), you issue a RTG against that doc and it happens to be routed
> > to a node that hasn't received it from the leader yet. Does this sound
> > plausible in your scenario?
> >
> > Hmmm, I suppose it's not even a requirement that the request gets sent
> > to a follower, it could easily be "in process" on the leader/primary.
> >
> > Best,
> > Erick
> > On Wed, Sep 26, 2018 at 11:55 AM sgaron cse <[hidden email]> wrote:
> > >
> > > Hey all,
> > >
> > > We're trying to use SOLR for our document store and are facing some
> > issues
> > > with the Realtime Get api. Basically, we're doing an api call from
> > multiple
> > > endpoint to retrieve configuration data. The document that we are
> > > retrieving does not change at all but sometimes the API returns a null
> > > document ({doc:null}). I'd say 99.99% of the time we can retrieve the
> > > document fine but once in a blue moon we get the null document. The
> > problem
> > > is that for us, if SOLR returns null, that means that the document does
> > not
> > > exist but because this is a document that should be there it causes all
> > > sort of problems in our system.
> > >
> > > The API I call is the following:
> > > http://{server_ip}/solr/config/get?id={id}&wt=json&fl=_source_
> > >
> > > As far as I understand reading the documentation, the Realtime Get API
> > > should get me the document no matter what. Even if the document is not
> > yet
> > > committed to the index.
> > >
> > > I see no errors whatsoever in the SOLR logs that could help me with this
> > > problem. in fact there are no error at all.
> > >
> > > As for our setup, because we're still in testing phase, we only have two
> > > SOLR instances running on the same box in cloud mode with replication=1
> > > which means that the core that we run the Realtime Get on is only present
> > > in one of the two instances. Our script randomly chooses which instances
> > it
> > > does the query on but as far as I understand, in cloud mode the API call
> > > should be dispatched automatically to the right instance.
> > >
> > > Am I missing anything here? Is it possible that there is a race condition
> > > in the Realtime Get API that could return null data even if the document
> > > exist?
> > >
> > > Thanks,
> > > Steve
> >
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

sgaron cse
So this is a SOLR core where we keep configuration data so it is almost
never written to. The statistics for the core say its been last modified 4
hours ago, yet I got doc:null from the API an hour ago. And also you don't
have to have a lot of data into the core. For example, this core has only
11 documents in it. The document I'm trying to fetch is about 45KB if that
matters.

Other things to note, this SOLR cloud instance is running multiple cores (9
cores total) and some of them are getting completely hammered. But I
figured that each core is it's own thing, I may be wrong.

BTW, I'm not 100% familiar with SOLR cloud but I see in the Replication
section that the Master (saerching) and the Master (Replicable) are running
different version / different gen. Not sure if that matters, not sure what
that means.

Thanks for your help,
Steve

On Thu, Sep 27, 2018 at 1:30 PM Erick Erickson <[hidden email]>
wrote:

> Steve:
>
> Thanks. So theoretically I should be able to set up a cluster, index a
> bunch of docs to it and then just hammer RTG calls against those IDs
> and sometime see a failure?
>
> Hmmm, I guess a follow-up question is whether there's any indexing
> gong on at all when this happens. Or, more specifically, if there's
> any time when you see this problem when there's _no_ indexing going
> on.
>
> I understand that it's not recently-indexed docs that are not being
> found, but if there's indexing going on searchers are being opened,
> caches flushed and the like so if this happens even when there's no
> indexing going on it'd help reproduce/track
>
> Erick
> On Thu, Sep 27, 2018 at 10:11 AM sgaron cse <[hidden email]> wrote:
> >
> > Hey Erick,
> >
> > We're using SOLR 7.3.1, which is not the latest but still not too far
> back.
> >
> > No the document has not been recently indexed, in fact, I can use the
> > /search API endpoint to find the document. But I need a fast way to find
> > document that have not necessarily been indexed yet so /search is out of
> > the question. Also to put you in context, last time the doc was modified
> > was 3 days ago but we are still seing the occasional doc:null return from
> > the Realtime Get API.
> >
> > Steve
> >
> > On Thu, Sep 27, 2018 at 12:52 PM Erick Erickson <[hidden email]
> >
> > wrote:
> >
> > > What version of Solr are you running? Mostly that's for curiosity.
> > >
> > > Is the doc that's not returned something you've recently indexed?
> > > Here's a possible scenario:
> > > You send the doc out to be indexed. The primary forwards the doc to
> > > the followers. Before the follower has a chance to process (but not
> > > commit), you issue a RTG against that doc and it happens to be routed
> > > to a node that hasn't received it from the leader yet. Does this sound
> > > plausible in your scenario?
> > >
> > > Hmmm, I suppose it's not even a requirement that the request gets sent
> > > to a follower, it could easily be "in process" on the leader/primary.
> > >
> > > Best,
> > > Erick
> > > On Wed, Sep 26, 2018 at 11:55 AM sgaron cse <[hidden email]>
> wrote:
> > > >
> > > > Hey all,
> > > >
> > > > We're trying to use SOLR for our document store and are facing some
> > > issues
> > > > with the Realtime Get api. Basically, we're doing an api call from
> > > multiple
> > > > endpoint to retrieve configuration data. The document that we are
> > > > retrieving does not change at all but sometimes the API returns a
> null
> > > > document ({doc:null}). I'd say 99.99% of the time we can retrieve the
> > > > document fine but once in a blue moon we get the null document. The
> > > problem
> > > > is that for us, if SOLR returns null, that means that the document
> does
> > > not
> > > > exist but because this is a document that should be there it causes
> all
> > > > sort of problems in our system.
> > > >
> > > > The API I call is the following:
> > > > http://{server_ip}/solr/config/get?id={id}&wt=json&fl=_source_
> > > >
> > > > As far as I understand reading the documentation, the Realtime Get
> API
> > > > should get me the document no matter what. Even if the document is
> not
> > > yet
> > > > committed to the index.
> > > >
> > > > I see no errors whatsoever in the SOLR logs that could help me with
> this
> > > > problem. in fact there are no error at all.
> > > >
> > > > As for our setup, because we're still in testing phase, we only have
> two
> > > > SOLR instances running on the same box in cloud mode with
> replication=1
> > > > which means that the core that we run the Realtime Get on is only
> present
> > > > in one of the two instances. Our script randomly chooses which
> instances
> > > it
> > > > does the query on but as far as I understand, in cloud mode the API
> call
> > > > should be dispatched automatically to the right instance.
> > > >
> > > > Am I missing anything here? Is it possible that there is a race
> condition
> > > > in the Realtime Get API that could return null data even if the
> document
> > > > exist?
> > > >
> > > > Thanks,
> > > > Steve
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Shawn Heisey-2
On 9/27/2018 11:48 AM, sgaron cse wrote:
> So this is a SOLR core where we keep configuration data so it is almost
> never written to. The statistics for the core say its been last modified 4
> hours ago, yet I got doc:null from the API an hour ago. And also you don't
> have to have a lot of data into the core. For example, this core has only
> 11 documents in it. The document I'm trying to fetch is about 45KB if that
> matters.

Are there multiple replicas of this collection?  Have you tried sending
requests specifically to the replica cores with distrib=false on the URL
to keep SolrCloud from sending the request elsewhere within the cluster,
to see if maybe the replicas are not as synchronized as they should be? 
Without distrib=false, you cannot control which machine(s) will answer
your query.

Replicas shouldn't get out of sync unless something goes very wrong, but
it has been known to happen.

> Other things to note, this SOLR cloud instance is running multiple cores (9
> cores total) and some of them are getting completely hammered. But I
> figured that each core is it's own thing, I may be wrong.
>
> BTW, I'm not 100% familiar with SOLR cloud but I see in the Replication
> section that the Master (saerching) and the Master (Replicable) are running
> different version / different gen. Not sure if that matters, not sure what
> that means.

For normal usage, you can completely ignore the replication master
information when Solr is running in SolrCloud mode. SolrCloud only uses
replication for recovering indexes that get out of sync (in a way that
SolrCloud can detect), and it configures the replication handler on the
fly when it is needed. The information it returns at any other time will
be meaningless. When things are operating normally, the replication
feature will never be used.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Chris Ulicny
I don't think I've much to add that Steve hasn't already covered, but we've
also seen this "null doc" problem in one of our setups.

In one of our Solr Cloud instances in production where the /get handler is
hit very hard in bursts, the /get request will occasionally return "null"
for a document that exists. However, there is very heavy indexing (no
overwrites or deletes) during that time which we assumed was the cause.
This happens on 2 collections which have 10 shards each, replication factor
of 2, spread across 4 hosts. During testing and when we first moved to this
setup in production, we had a replication factor of 1, and still
experienced the same issue of periodic "null" returned for documents, so it
is probably not a replica synchronization issue.

These documents were indexed about 10 minutes prior and had already been
successfully returned to previous /get requests. We haven't been able to
replicate it with any consistency, but it isn't a particularly critical
issue with our use case.

Best,
Chris


On Thu, Sep 27, 2018 at 2:53 PM Shawn Heisey <[hidden email]> wrote:

> On 9/27/2018 11:48 AM, sgaron cse wrote:
> > So this is a SOLR core where we keep configuration data so it is almost
> > never written to. The statistics for the core say its been last modified
> 4
> > hours ago, yet I got doc:null from the API an hour ago. And also you
> don't
> > have to have a lot of data into the core. For example, this core has only
> > 11 documents in it. The document I'm trying to fetch is about 45KB if
> that
> > matters.
>
> Are there multiple replicas of this collection?  Have you tried sending
> requests specifically to the replica cores with distrib=false on the URL
> to keep SolrCloud from sending the request elsewhere within the cluster,
> to see if maybe the replicas are not as synchronized as they should be?
> Without distrib=false, you cannot control which machine(s) will answer
> your query.
>
> Replicas shouldn't get out of sync unless something goes very wrong, but
> it has been known to happen.
>
> > Other things to note, this SOLR cloud instance is running multiple cores
> (9
> > cores total) and some of them are getting completely hammered. But I
> > figured that each core is it's own thing, I may be wrong.
> >
> > BTW, I'm not 100% familiar with SOLR cloud but I see in the Replication
> > section that the Master (saerching) and the Master (Replicable) are
> running
> > different version / different gen. Not sure if that matters, not sure
> what
> > that means.
>
> For normal usage, you can completely ignore the replication master
> information when Solr is running in SolrCloud mode. SolrCloud only uses
> replication for recovering indexes that get out of sync (in a way that
> SolrCloud can detect), and it configures the replication handler on the
> fly when it is needed. The information it returns at any other time will
> be meaningless. When things are operating normally, the replication
> feature will never be used.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

sgaron cse
Hey Shawn,

because this is a test deployment replica is set to 1 so as far as I
understand, data will not be replicated for this core. Basically we have
two SOLR instances running on the same box. One on port 8983, the other on
port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which on the
instance on port 8983 and the other 4 on port 8984. As far as I can tell
all cores suffer from the occasional null document. But the one that I can
easily see error from is a config core where we store configuration data
for our system. Since the configuration data should always be there we
throw exceptions as soon as we get a null document which is why I noticed
the problem.

Our client code that connects to the APIs randomly chooses between all the
different ports because it does not know which instance it should ask. So
no, we did not try sending directly to the instance that has the data but
since there is no replica there is no way that this should get out of sync.
To add up to what Chris was saying, although the core that is seeing the
issue is not hit very hard, other core in the setup will be. We are
building a clustering environment that has auto-scaling so if we are under
heavy load, we can easily have 200-300 client hitting the SOLR instance
simultaneously.


On Thu, Sep 27, 2018 at 3:38 PM Chris Ulicny <[hidden email]> wrote:

> I don't think I've much to add that Steve hasn't already covered, but we've
> also seen this "null doc" problem in one of our setups.
>
> In one of our Solr Cloud instances in production where the /get handler is
> hit very hard in bursts, the /get request will occasionally return "null"
> for a document that exists. However, there is very heavy indexing (no
> overwrites or deletes) during that time which we assumed was the cause.
> This happens on 2 collections which have 10 shards each, replication factor
> of 2, spread across 4 hosts. During testing and when we first moved to this
> setup in production, we had a replication factor of 1, and still
> experienced the same issue of periodic "null" returned for documents, so it
> is probably not a replica synchronization issue.
>
> These documents were indexed about 10 minutes prior and had already been
> successfully returned to previous /get requests. We haven't been able to
> replicate it with any consistency, but it isn't a particularly critical
> issue with our use case.
>
> Best,
> Chris
>
>
> On Thu, Sep 27, 2018 at 2:53 PM Shawn Heisey <[hidden email]> wrote:
>
> > On 9/27/2018 11:48 AM, sgaron cse wrote:
> > > So this is a SOLR core where we keep configuration data so it is almost
> > > never written to. The statistics for the core say its been last
> modified
> > 4
> > > hours ago, yet I got doc:null from the API an hour ago. And also you
> > don't
> > > have to have a lot of data into the core. For example, this core has
> only
> > > 11 documents in it. The document I'm trying to fetch is about 45KB if
> > that
> > > matters.
> >
> > Are there multiple replicas of this collection?  Have you tried sending
> > requests specifically to the replica cores with distrib=false on the URL
> > to keep SolrCloud from sending the request elsewhere within the cluster,
> > to see if maybe the replicas are not as synchronized as they should be?
> > Without distrib=false, you cannot control which machine(s) will answer
> > your query.
> >
> > Replicas shouldn't get out of sync unless something goes very wrong, but
> > it has been known to happen.
> >
> > > Other things to note, this SOLR cloud instance is running multiple
> cores
> > (9
> > > cores total) and some of them are getting completely hammered. But I
> > > figured that each core is it's own thing, I may be wrong.
> > >
> > > BTW, I'm not 100% familiar with SOLR cloud but I see in the Replication
> > > section that the Master (saerching) and the Master (Replicable) are
> > running
> > > different version / different gen. Not sure if that matters, not sure
> > what
> > > that means.
> >
> > For normal usage, you can completely ignore the replication master
> > information when Solr is running in SolrCloud mode. SolrCloud only uses
> > replication for recovering indexes that get out of sync (in a way that
> > SolrCloud can detect), and it configures the replication handler on the
> > fly when it is needed. The information it returns at any other time will
> > be meaningless. When things are operating normally, the replication
> > feature will never be used.
> >
> > Thanks,
> > Shawn
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Shawn Heisey-2
On 9/28/2018 6:09 AM, sgaron cse wrote:
> because this is a test deployment replica is set to 1 so as far as I
> understand, data will not be replicated for this core. Basically we have
> two SOLR instances running on the same box. One on port 8983, the other on
> port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which on the
> instance on port 8983 and the other 4 on port 8984.

A question that isn't really related to the problem you're investigating
now:  Why are you running two Solr instances on the same machine?  9
cores is definitely not too many for one Solr instance.

> As far as I can tell
> all cores suffer from the occasional null document. But the one that I can
> easily see error from is a config core where we store configuration data
> for our system. Since the configuration data should always be there we
> throw exceptions as soon as we get a null document which is why I noticed
> the problem.

When you say "null document" do you mean that you get no results, or
that you get a result with a document, but that document has nothing in
it?  Are there any errors returned or logged by Solr when this happens?

> Our client code that connects to the APIs randomly chooses between all the
> different ports because it does not know which instance it should ask. So
> no, we did not try sending directly to the instance that has the data but
> since there is no replica there is no way that this should get out of sync.

I was suggesting this as a troubleshooting step, not a change to how you
use Solr.  Basically trying to determine what happens if you send a
request directly to the instance and core that contains the document
with distrib=false, to see if it behaves differently than when it's a
more generic collection-directed query.  The idea was to try and narrow
down exactly where to look for a problem.

If you wait a few seconds, does the problem go away?  When using real
time get, a new document must be written to a segment and a new realtime
searcher must be created before you can get that document.  These things
typically happen very quickly, but it's not instantaneous.

> To add up to what Chris was saying, although the core that is seeing the
> issue is not hit very hard, other core in the setup will be. We are
> building a clustering environment that has auto-scaling so if we are under
> heavy load, we can easily have 200-300 client hitting the SOLR instance
> simultaneously.

That much traffic is going to need multiple replicas on separate
hardware, with something in place to do load balancing. Unless your code
is Java and you can use CloudSolrClient, I would recommend an external
load balancer.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
I've set up a test program on a local machine, we'll see if I can reproduce
here's the setup:

1> created a 2-shard, leader(primary) only collection
2> added 1M simple docs to it (ids 0-999,999) and some text
3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive)
     to insure there were deleted docs. Don't have any clue whether
that matters.
4> fired up a 16 thread query program doing RTG on random doc IDs
     The program will stop when either it gets a null response or the response
     isn't the doc asked for.
5> running 7.3.1
6> I'm using the SolrJ RTG code 'cause it was easy
7> All this is running locally on a Mac Pro, no network involved which is
     another variable I suppose
8> 7M queries later no issues
9> there's no indexing going on at all

Steve and Chris:

What about this test setup do you imagine doesn't reflect what your
setup is doing? Things I can think of in order of things to test:

> mimic you y'all are calling RTG more faithfully
> index to this collection, perhaps not at a high rate
> create another collection and actively index to it
> separate the machines running solr from the one
     doing any querying or indexing
> ???

And, of course if it reproduces then run it to death on 7.5 to see if
it's still a problem

Best,
Erick
On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <[hidden email]> wrote:

>
> On 9/28/2018 6:09 AM, sgaron cse wrote:
> > because this is a test deployment replica is set to 1 so as far as I
> > understand, data will not be replicated for this core. Basically we have
> > two SOLR instances running on the same box. One on port 8983, the other on
> > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which on the
> > instance on port 8983 and the other 4 on port 8984.
>
> A question that isn't really related to the problem you're investigating
> now:  Why are you running two Solr instances on the same machine?  9
> cores is definitely not too many for one Solr instance.
>
> > As far as I can tell
> > all cores suffer from the occasional null document. But the one that I can
> > easily see error from is a config core where we store configuration data
> > for our system. Since the configuration data should always be there we
> > throw exceptions as soon as we get a null document which is why I noticed
> > the problem.
>
> When you say "null document" do you mean that you get no results, or
> that you get a result with a document, but that document has nothing in
> it?  Are there any errors returned or logged by Solr when this happens?
>
> > Our client code that connects to the APIs randomly chooses between all the
> > different ports because it does not know which instance it should ask. So
> > no, we did not try sending directly to the instance that has the data but
> > since there is no replica there is no way that this should get out of sync.
>
> I was suggesting this as a troubleshooting step, not a change to how you
> use Solr.  Basically trying to determine what happens if you send a
> request directly to the instance and core that contains the document
> with distrib=false, to see if it behaves differently than when it's a
> more generic collection-directed query.  The idea was to try and narrow
> down exactly where to look for a problem.
>
> If you wait a few seconds, does the problem go away?  When using real
> time get, a new document must be written to a segment and a new realtime
> searcher must be created before you can get that document.  These things
> typically happen very quickly, but it's not instantaneous.
>
> > To add up to what Chris was saying, although the core that is seeing the
> > issue is not hit very hard, other core in the setup will be. We are
> > building a clustering environment that has auto-scaling so if we are under
> > heavy load, we can easily have 200-300 client hitting the SOLR instance
> > simultaneously.
>
> That much traffic is going to need multiple replicas on separate
> hardware, with something in place to do load balancing. Unless your code
> is Java and you can use CloudSolrClient, I would recommend an external
> load balancer.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
Well, I flipped indexing on and after another 7 million queries, no fails.
No reason to stop just yet, but not encouraging so far...


On Fri, Sep 28, 2018, 10:58 Erick Erickson <[hidden email]> wrote:

> I've set up a test program on a local machine, we'll see if I can reproduce
> here's the setup:
>
> 1> created a 2-shard, leader(primary) only collection
> 2> added 1M simple docs to it (ids 0-999,999) and some text
> 3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive)
>      to insure there were deleted docs. Don't have any clue whether
> that matters.
> 4> fired up a 16 thread query program doing RTG on random doc IDs
>      The program will stop when either it gets a null response or the
> response
>      isn't the doc asked for.
> 5> running 7.3.1
> 6> I'm using the SolrJ RTG code 'cause it was easy
> 7> All this is running locally on a Mac Pro, no network involved which is
>      another variable I suppose
> 8> 7M queries later no issues
> 9> there's no indexing going on at all
>
> Steve and Chris:
>
> What about this test setup do you imagine doesn't reflect what your
> setup is doing? Things I can think of in order of things to test:
>
> > mimic you y'all are calling RTG more faithfully
> > index to this collection, perhaps not at a high rate
> > create another collection and actively index to it
> > separate the machines running solr from the one
>      doing any querying or indexing
> > ???
>
> And, of course if it reproduces then run it to death on 7.5 to see if
> it's still a problem
>
> Best,
> Erick
> On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <[hidden email]> wrote:
> >
> > On 9/28/2018 6:09 AM, sgaron cse wrote:
> > > because this is a test deployment replica is set to 1 so as far as I
> > > understand, data will not be replicated for this core. Basically we
> have
> > > two SOLR instances running on the same box. One on port 8983, the
> other on
> > > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which
> on the
> > > instance on port 8983 and the other 4 on port 8984.
> >
> > A question that isn't really related to the problem you're investigating
> > now:  Why are you running two Solr instances on the same machine?  9
> > cores is definitely not too many for one Solr instance.
> >
> > > As far as I can tell
> > > all cores suffer from the occasional null document. But the one that I
> can
> > > easily see error from is a config core where we store configuration
> data
> > > for our system. Since the configuration data should always be there we
> > > throw exceptions as soon as we get a null document which is why I
> noticed
> > > the problem.
> >
> > When you say "null document" do you mean that you get no results, or
> > that you get a result with a document, but that document has nothing in
> > it?  Are there any errors returned or logged by Solr when this happens?
> >
> > > Our client code that connects to the APIs randomly chooses between all
> the
> > > different ports because it does not know which instance it should ask.
> So
> > > no, we did not try sending directly to the instance that has the data
> but
> > > since there is no replica there is no way that this should get out of
> sync.
> >
> > I was suggesting this as a troubleshooting step, not a change to how you
> > use Solr.  Basically trying to determine what happens if you send a
> > request directly to the instance and core that contains the document
> > with distrib=false, to see if it behaves differently than when it's a
> > more generic collection-directed query.  The idea was to try and narrow
> > down exactly where to look for a problem.
> >
> > If you wait a few seconds, does the problem go away?  When using real
> > time get, a new document must be written to a segment and a new realtime
> > searcher must be created before you can get that document.  These things
> > typically happen very quickly, but it's not instantaneous.
> >
> > > To add up to what Chris was saying, although the core that is seeing
> the
> > > issue is not hit very hard, other core in the setup will be. We are
> > > building a clustering environment that has auto-scaling so if we are
> under
> > > heavy load, we can easily have 200-300 client hitting the SOLR instance
> > > simultaneously.
> >
> > That much traffic is going to need multiple replicas on separate
> > hardware, with something in place to do load balancing. Unless your code
> > is Java and you can use CloudSolrClient, I would recommend an external
> > load balancer.
> >
> > Thanks,
> > Shawn
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

sgaron cse
@Shawn
We're running two instance on one machine for two reason:
1. The box has plenty of resources (48 cores / 256GB ram) and since I was
reading that it's not recommended to use more than 31GB of heap in SOLR we
figured 96 GB for keeping index data in OS cache + 31 GB of heap per
instance was a good idea.
2. We're in testing phase so we wanted a SOLR cloud configuration, we will
most likely have a much bigger deployment once going to production. In prod
right now, we currently to run a six machines Riak cluster. Riak is a
key/value document store an has SOLR built-in for search, but we are trying
to push the key/value aspect of Riak inside SOLR. That way we would have
one less piece to worry about in our system.

When I say null document, I mean the /get API returns: {doc: null}

The problem is definitely not always there. We also have large period of
time (few hours) were we have no problems. I'm just extremely hesitant on
retrying when I get a null document because in some case, getting a null
document is a valid outcome. Our caching layer heavily rely on this for
example. If I was to retry every nulls I'd pay a big penalty in
performance.

As for your last comment, part of our testing phase is also testing the
limits. Our framework has auto-scaling built-in so if we have a burst of
request, the system will automatically spin up more clients. We're pushing
10% of our production system to that Test server to see how it will handle
it.

@Erick
Thanks a lot for testing, there has to be a variable that I don't
understand for this scenario to happen. Let me try to reproduce it reliably
from my side and when I do I'll send you instructions on how to reproduce.
I don't want you to waste your time on this. There might be one thing tho
that your testing scenario does not account for, while we use the /get API
a lot, we use it sporadically and most likely create a new connection to
the API each time because calls comes from newly spawned processes. I had
no luck reproducing the problem using 50 thread while keeping a session
opened and hammering the /get API. I need to find a better way to reproduce
this.

@both Erick and Shawn
I saw something really weird today, there was a mix up in some of the cores
data. Basically, one core had data in it that should belong to another
core. Here's my question about this: Is it possible that two request to the
/get API coming in at the same time would get confused and either both get
the same result or result get inverted? Cause that could explain my
{doc:null} problem, our caching layer that looks up for IDs in some cores
is usually hit pretty hard.

Anyway let me time to do more testing on monday/tuesday and to try to
pinpoint and make the issues easily reproduce-able.

Thanks again for helping,
Steve




On Fri, Sep 28, 2018 at 4:19 PM Erick Erickson <[hidden email]>
wrote:

> Well, I flipped indexing on and after another 7 million queries, no fails.
> No reason to stop just yet, but not encouraging so far...
>
>
> On Fri, Sep 28, 2018, 10:58 Erick Erickson <[hidden email]>
> wrote:
>
> > I've set up a test program on a local machine, we'll see if I can
> reproduce
> > here's the setup:
> >
> > 1> created a 2-shard, leader(primary) only collection
> > 2> added 1M simple docs to it (ids 0-999,999) and some text
> > 3> re-added 100_000 docs with a random id between 0 - 999,999 (inclusive)
> >      to insure there were deleted docs. Don't have any clue whether
> > that matters.
> > 4> fired up a 16 thread query program doing RTG on random doc IDs
> >      The program will stop when either it gets a null response or the
> > response
> >      isn't the doc asked for.
> > 5> running 7.3.1
> > 6> I'm using the SolrJ RTG code 'cause it was easy
> > 7> All this is running locally on a Mac Pro, no network involved which is
> >      another variable I suppose
> > 8> 7M queries later no issues
> > 9> there's no indexing going on at all
> >
> > Steve and Chris:
> >
> > What about this test setup do you imagine doesn't reflect what your
> > setup is doing? Things I can think of in order of things to test:
> >
> > > mimic you y'all are calling RTG more faithfully
> > > index to this collection, perhaps not at a high rate
> > > create another collection and actively index to it
> > > separate the machines running solr from the one
> >      doing any querying or indexing
> > > ???
> >
> > And, of course if it reproduces then run it to death on 7.5 to see if
> > it's still a problem
> >
> > Best,
> > Erick
> > On Fri, Sep 28, 2018 at 10:21 AM Shawn Heisey <[hidden email]>
> wrote:
> > >
> > > On 9/28/2018 6:09 AM, sgaron cse wrote:
> > > > because this is a test deployment replica is set to 1 so as far as I
> > > > understand, data will not be replicated for this core. Basically we
> > have
> > > > two SOLR instances running on the same box. One on port 8983, the
> > other on
> > > > port 8984. We have 9 cores on this SOLR cloud deployment, 5 of which
> > on the
> > > > instance on port 8983 and the other 4 on port 8984.
> > >
> > > A question that isn't really related to the problem you're
> investigating
> > > now:  Why are you running two Solr instances on the same machine?  9
> > > cores is definitely not too many for one Solr instance.
> > >
> > > > As far as I can tell
> > > > all cores suffer from the occasional null document. But the one that
> I
> > can
> > > > easily see error from is a config core where we store configuration
> > data
> > > > for our system. Since the configuration data should always be there
> we
> > > > throw exceptions as soon as we get a null document which is why I
> > noticed
> > > > the problem.
> > >
> > > When you say "null document" do you mean that you get no results, or
> > > that you get a result with a document, but that document has nothing in
> > > it?  Are there any errors returned or logged by Solr when this happens?
> > >
> > > > Our client code that connects to the APIs randomly chooses between
> all
> > the
> > > > different ports because it does not know which instance it should
> ask.
> > So
> > > > no, we did not try sending directly to the instance that has the data
> > but
> > > > since there is no replica there is no way that this should get out of
> > sync.
> > >
> > > I was suggesting this as a troubleshooting step, not a change to how
> you
> > > use Solr.  Basically trying to determine what happens if you send a
> > > request directly to the instance and core that contains the document
> > > with distrib=false, to see if it behaves differently than when it's a
> > > more generic collection-directed query.  The idea was to try and narrow
> > > down exactly where to look for a problem.
> > >
> > > If you wait a few seconds, does the problem go away?  When using real
> > > time get, a new document must be written to a segment and a new
> realtime
> > > searcher must be created before you can get that document.  These
> things
> > > typically happen very quickly, but it's not instantaneous.
> > >
> > > > To add up to what Chris was saying, although the core that is seeing
> > the
> > > > issue is not hit very hard, other core in the setup will be. We are
> > > > building a clustering environment that has auto-scaling so if we are
> > under
> > > > heavy load, we can easily have 200-300 client hitting the SOLR
> instance
> > > > simultaneously.
> > >
> > > That much traffic is going to need multiple replicas on separate
> > > hardware, with something in place to do load balancing. Unless your
> code
> > > is Java and you can use CloudSolrClient, I would recommend an external
> > > load balancer.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Shawn Heisey-2
On 9/28/2018 8:11 PM, sgaron cse wrote:
> @Shawn
> We're running two instance on one machine for two reason:
> 1. The box has plenty of resources (48 cores / 256GB ram) and since I was
> reading that it's not recommended to use more than 31GB of heap in SOLR we
> figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> instance was a good idea.

Do you know that these Solr instances actually DO need 31 GB of heap, or
are you following advice from somewhere, saying "use one quarter of your
memory as the heap size"?  That advice is not in the Solr documentation,
and never will be.  Figuring out the right heap size requires
experimentation.

https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F

How big (on disk) are each of these nine cores, and how many documents
are in each one?  Which of them is in each Solr instance?  With that
information, we can make a *guess* about how big your heap should be. 
Figuring out whether the guess is correct generally requires careful
analysis of a GC log.

> 2. We're in testing phase so we wanted a SOLR cloud configuration, we will
> most likely have a much bigger deployment once going to production. In prod
> right now, we currently to run a six machines Riak cluster. Riak is a
> key/value document store an has SOLR built-in for search, but we are trying
> to push the key/value aspect of Riak inside SOLR. That way we would have
> one less piece to worry about in our system.

Solr is not a database.  It is not intended to be a data repository. 
All of its optimizations (most of which are actually in Lucene) are
geared towards search.  While technically it can be a key-value store,
that is not what it was MADE for.  Software actually designed for that
role is going to be much better than Solr as a key-value store.

> When I say null document, I mean the /get API returns: {doc: null}
>
> The problem is definitely not always there. We also have large period of
> time (few hours) were we have no problems. I'm just extremely hesitant on
> retrying when I get a null document because in some case, getting a null
> document is a valid outcome. Our caching layer heavily rely on this for
> example. If I was to retry every nulls I'd pay a big penalty in
> performance.

I've just done a little test with the 7.5.0 techproducts example.  It
looks like returning doc:null actually is how the RTG handler says it
didn't find the document.  This seems very wrong to me, but I didn't
design it, and that response needs SOME kind of format.

Have you done any testing to see whether the standard searching handler
(typically /select, but many other URL paths are possible) returns
results when RTG doesn't?  Do you know for these failures whether the
document has been committed or not?

> As for your last comment, part of our testing phase is also testing the
> limits. Our framework has auto-scaling built-in so if we have a burst of
> request, the system will automatically spin up more clients. We're pushing
> 10% of our production system to that Test server to see how it will handle
> it.

To spin up another replica, Solr must copy all its index data from the
leader replica.  Not only can this take a long time if the index is big,
but it will put a lot of extra I/O load on the machine(s) with the
leader roles.  So performance will actually be WORSE before it gets
better when you spin up another replica, and if the index is big, that
condition will persist for quite a while.  Copying the index data will
be constrained by the speed of your network and by the speed of your
disks.  Often the disks are slower than the network, but that is not
always the case.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
Steve:

bq.  Basically, one core had data in it that should belong to another
core. Here's my question about this: Is it possible that two request to the
/get API coming in at the same time would get confused and either both get
the same result or result get inverted?

Well, that shouldn't be happening, these are all supposed to be thread-safe
calls.... All things are possible of course ;)

If two replicas of the same shard have different documents, that could account
for what you're seeing, meanwhile begging the question of why that is the case
since it should never be true for a quiescent index. Technically there _are_
conditions where this is true on a very temporary basis, commits on the leader
and follower can trigger at different wall-clock times. Say your soft commit
(or hard-commit-with-opensearcher-true) is 10 seconds. It should never be the
case that s1r1 and s1r2 are out of sync 10 seconds after the last update was
sent. This doesn't seem likely from what you've described though...

Hmmmm. I guess that one other thing I can set up is to have a bunch of dummy
collections laying around. Currently I have only the active one, and
if there's some
code path whereby the RTG request goes to a replica of a different
collection, my
test setup wouldn't reproduce it.

Currently, I'm running a 2-shard, 1 replica setup, so if there's some
way that the replicas
get out of sync that wouldn't show either.

So I'm starting another run with these changes:
> opening a new connection each query
> switched so the collection I'm querying is 2x2
> added some dummy collections that are empty

One nit, while "core" is exactly correct. When we talk about a core
that's part of a collection, we try to use "replica" to be clear we're
talking about
a core with some added characteristics, i.e. we're in SolrCloud-land.
No big deal
of course....

Best,
Erick
On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <[hidden email]> wrote:

>
> On 9/28/2018 8:11 PM, sgaron cse wrote:
> > @Shawn
> > We're running two instance on one machine for two reason:
> > 1. The box has plenty of resources (48 cores / 256GB ram) and since I was
> > reading that it's not recommended to use more than 31GB of heap in SOLR we
> > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > instance was a good idea.
>
> Do you know that these Solr instances actually DO need 31 GB of heap, or
> are you following advice from somewhere, saying "use one quarter of your
> memory as the heap size"?  That advice is not in the Solr documentation,
> and never will be.  Figuring out the right heap size requires
> experimentation.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
>
> How big (on disk) are each of these nine cores, and how many documents
> are in each one?  Which of them is in each Solr instance?  With that
> information, we can make a *guess* about how big your heap should be.
> Figuring out whether the guess is correct generally requires careful
> analysis of a GC log.
>
> > 2. We're in testing phase so we wanted a SOLR cloud configuration, we will
> > most likely have a much bigger deployment once going to production. In prod
> > right now, we currently to run a six machines Riak cluster. Riak is a
> > key/value document store an has SOLR built-in for search, but we are trying
> > to push the key/value aspect of Riak inside SOLR. That way we would have
> > one less piece to worry about in our system.
>
> Solr is not a database.  It is not intended to be a data repository.
> All of its optimizations (most of which are actually in Lucene) are
> geared towards search.  While technically it can be a key-value store,
> that is not what it was MADE for.  Software actually designed for that
> role is going to be much better than Solr as a key-value store.
>
> > When I say null document, I mean the /get API returns: {doc: null}
> >
> > The problem is definitely not always there. We also have large period of
> > time (few hours) were we have no problems. I'm just extremely hesitant on
> > retrying when I get a null document because in some case, getting a null
> > document is a valid outcome. Our caching layer heavily rely on this for
> > example. If I was to retry every nulls I'd pay a big penalty in
> > performance.
>
> I've just done a little test with the 7.5.0 techproducts example.  It
> looks like returning doc:null actually is how the RTG handler says it
> didn't find the document.  This seems very wrong to me, but I didn't
> design it, and that response needs SOME kind of format.
>
> Have you done any testing to see whether the standard searching handler
> (typically /select, but many other URL paths are possible) returns
> results when RTG doesn't?  Do you know for these failures whether the
> document has been committed or not?
>
> > As for your last comment, part of our testing phase is also testing the
> > limits. Our framework has auto-scaling built-in so if we have a burst of
> > request, the system will automatically spin up more clients. We're pushing
> > 10% of our production system to that Test server to see how it will handle
> > it.
>
> To spin up another replica, Solr must copy all its index data from the
> leader replica.  Not only can this take a long time if the index is big,
> but it will put a lot of extra I/O load on the machine(s) with the
> leader roles.  So performance will actually be WORSE before it gets
> better when you spin up another replica, and if the index is big, that
> condition will persist for quite a while.  Copying the index data will
> be constrained by the speed of your network and by the speed of your
> disks.  Often the disks are slower than the network, but that is not
> always the case.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
57 million queries later, with constant indexing going on and 9 dummy
collections in the mix and the main collection I'm querying having 2
shards, 2 replicas each, I have no errors.

So unless the code doesn't look like it exercises any similar path,
I'm not sure what more I can test. "It works on my machine" ;)

Here's my querying code, does it look like it what you're seeing?

      while (Main.allStop.get() == false) {
        try (SolrClient client = new HttpSolrClient.Builder()
//("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
            .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {

          //SolrQuery query = new SolrQuery();
          String lower = Integer.toString(rand.nextInt(1_000_000));
          SolrDocument rsp = client.getById(lower);
          if (rsp == null) {
            System.out.println("Got a null response!");
            Main.allStop.set(true);
          }

          rsp = client.getById(lower);

          if (rsp.get("id").equals(lower) == false) {
            System.out.println("Got an invalid response, looking for "
+ lower + " got: " + rsp.get("id"));
            Main.allStop.set(true);
          }
          long queries = Main.eoeCounter.incrementAndGet();
          if ((queries % 100_000) == 0) {
            long seconds = (System.currentTimeMillis() - Main.start) / 1000;
            System.out.println("Query count: " +
numFormatter.format(queries) + ", rate is " +
numFormatter.format(queries / seconds) + " QPS");
          }
        } catch (Exception cle) {
          cle.printStackTrace();
          Main.allStop.set(true);
        }
      }
  }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
<[hidden email]> wrote:

>
> Steve:
>
> bq.  Basically, one core had data in it that should belong to another
> core. Here's my question about this: Is it possible that two request to the
> /get API coming in at the same time would get confused and either both get
> the same result or result get inverted?
>
> Well, that shouldn't be happening, these are all supposed to be thread-safe
> calls.... All things are possible of course ;)
>
> If two replicas of the same shard have different documents, that could account
> for what you're seeing, meanwhile begging the question of why that is the case
> since it should never be true for a quiescent index. Technically there _are_
> conditions where this is true on a very temporary basis, commits on the leader
> and follower can trigger at different wall-clock times. Say your soft commit
> (or hard-commit-with-opensearcher-true) is 10 seconds. It should never be the
> case that s1r1 and s1r2 are out of sync 10 seconds after the last update was
> sent. This doesn't seem likely from what you've described though...
>
> Hmmmm. I guess that one other thing I can set up is to have a bunch of dummy
> collections laying around. Currently I have only the active one, and
> if there's some
> code path whereby the RTG request goes to a replica of a different
> collection, my
> test setup wouldn't reproduce it.
>
> Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> way that the replicas
> get out of sync that wouldn't show either.
>
> So I'm starting another run with these changes:
> > opening a new connection each query
> > switched so the collection I'm querying is 2x2
> > added some dummy collections that are empty
>
> One nit, while "core" is exactly correct. When we talk about a core
> that's part of a collection, we try to use "replica" to be clear we're
> talking about
> a core with some added characteristics, i.e. we're in SolrCloud-land.
> No big deal
> of course....
>
> Best,
> Erick
> On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <[hidden email]> wrote:
> >
> > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > @Shawn
> > > We're running two instance on one machine for two reason:
> > > 1. The box has plenty of resources (48 cores / 256GB ram) and since I was
> > > reading that it's not recommended to use more than 31GB of heap in SOLR we
> > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > instance was a good idea.
> >
> > Do you know that these Solr instances actually DO need 31 GB of heap, or
> > are you following advice from somewhere, saying "use one quarter of your
> > memory as the heap size"?  That advice is not in the Solr documentation,
> > and never will be.  Figuring out the right heap size requires
> > experimentation.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> >
> > How big (on disk) are each of these nine cores, and how many documents
> > are in each one?  Which of them is in each Solr instance?  With that
> > information, we can make a *guess* about how big your heap should be.
> > Figuring out whether the guess is correct generally requires careful
> > analysis of a GC log.
> >
> > > 2. We're in testing phase so we wanted a SOLR cloud configuration, we will
> > > most likely have a much bigger deployment once going to production. In prod
> > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > key/value document store an has SOLR built-in for search, but we are trying
> > > to push the key/value aspect of Riak inside SOLR. That way we would have
> > > one less piece to worry about in our system.
> >
> > Solr is not a database.  It is not intended to be a data repository.
> > All of its optimizations (most of which are actually in Lucene) are
> > geared towards search.  While technically it can be a key-value store,
> > that is not what it was MADE for.  Software actually designed for that
> > role is going to be much better than Solr as a key-value store.
> >
> > > When I say null document, I mean the /get API returns: {doc: null}
> > >
> > > The problem is definitely not always there. We also have large period of
> > > time (few hours) were we have no problems. I'm just extremely hesitant on
> > > retrying when I get a null document because in some case, getting a null
> > > document is a valid outcome. Our caching layer heavily rely on this for
> > > example. If I was to retry every nulls I'd pay a big penalty in
> > > performance.
> >
> > I've just done a little test with the 7.5.0 techproducts example.  It
> > looks like returning doc:null actually is how the RTG handler says it
> > didn't find the document.  This seems very wrong to me, but I didn't
> > design it, and that response needs SOME kind of format.
> >
> > Have you done any testing to see whether the standard searching handler
> > (typically /select, but many other URL paths are possible) returns
> > results when RTG doesn't?  Do you know for these failures whether the
> > document has been committed or not?
> >
> > > As for your last comment, part of our testing phase is also testing the
> > > limits. Our framework has auto-scaling built-in so if we have a burst of
> > > request, the system will automatically spin up more clients. We're pushing
> > > 10% of our production system to that Test server to see how it will handle
> > > it.
> >
> > To spin up another replica, Solr must copy all its index data from the
> > leader replica.  Not only can this take a long time if the index is big,
> > but it will put a lot of extra I/O load on the machine(s) with the
> > leader roles.  So performance will actually be WORSE before it gets
> > better when you spin up another replica, and if the index is big, that
> > condition will persist for quite a while.  Copying the index data will
> > be constrained by the speed of your network and by the speed of your
> > disks.  Often the disks are slower than the network, but that is not
> > always the case.
> >
> > Thanks,
> > Shawn
> >
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Chris Ulicny
In our case, we are heavily indexing in the collection while the /get
requests are happening which is what we assumed was causing this very rare
behavior. However, we have experienced the problem for a collection where
the following happens in sequence with minutes in between them.

1. Document id=1 is indexed
2. Document successfully retrieved with /get?id=1
3. Document failed to be retrieved with /get?id=1
4. Document successfully retrieved with /get?id=1

We've haven't looked at the issue in a while, so I don't have the exact
timing of that sequence on hand right now. I'll try to find an actual
example, although I'm relatively certain it was multiple minutes in between
each of those requests. However our autocommit (and soft commit) times are
60s for both collections.

I think the following two are probably the biggest differences for our
setup, besides the version difference (v6.3.0):

> index to this collection, perhaps not at a high rate
> separate the machines running solr from the one doing any querying or
indexing

The clients are on 3 hosts separate from the solr instances. The total
number of threads that are making updates and making /get requests is
around 120-150. About 40-50 per host. Each of our two collections gets an
average of 500 requests per second constantly for ~5 minutes, and then the
number slowly tapers off to essentially 0 after ~15 minutes.

Every thread attempts to make the same series of requests.

-- Update with "_version_=-1". If successful, no other requests are made.
-- On 409 Conflict failure, it makes a /get request for the id
-- On doc:null failure, the client handles the error and moves on

Combining this with the previous series of /get requests, we end up with
situations where an update fails as expected, but the subsequent /get
request fails to retrieve the existing document:

1. Thread 1 updates id=1 successfully
2. Thread 2 tries to update id=1, fails (409)
3. Thread 2 tries to get id=1 succeeds.

...Minutes later...

4. Thread 3 tries to update id=1, fails (409)
5. Thread 3 tries to get id=1, fails (doc:null)

...Minutes later...

6. Thread 4 tries to update id=1, fails (409)
7. Thread 4 tries to get id=1 succeeds.

As Steven mentioned, it happens very, very rarely. We tried to recreate it
in a more controlled environment, but ran into the same issue that you are,
Erick. Every simplified situation we ran produced no problems. Since it's
not a large issue for us and happens very rarely, we stopped trying to
recreate it.


On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <[hidden email]>
wrote:

> 57 million queries later, with constant indexing going on and 9 dummy
> collections in the mix and the main collection I'm querying having 2
> shards, 2 replicas each, I have no errors.
>
> So unless the code doesn't look like it exercises any similar path,
> I'm not sure what more I can test. "It works on my machine" ;)
>
> Here's my querying code, does it look like it what you're seeing?
>
>       while (Main.allStop.get() == false) {
>         try (SolrClient client = new HttpSolrClient.Builder()
> //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
>             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
>
>           //SolrQuery query = new SolrQuery();
>           String lower = Integer.toString(rand.nextInt(1_000_000));
>           SolrDocument rsp = client.getById(lower);
>           if (rsp == null) {
>             System.out.println("Got a null response!");
>             Main.allStop.set(true);
>           }
>
>           rsp = client.getById(lower);
>
>           if (rsp.get("id").equals(lower) == false) {
>             System.out.println("Got an invalid response, looking for "
> + lower + " got: " + rsp.get("id"));
>             Main.allStop.set(true);
>           }
>           long queries = Main.eoeCounter.incrementAndGet();
>           if ((queries % 100_000) == 0) {
>             long seconds = (System.currentTimeMillis() - Main.start) /
> 1000;
>             System.out.println("Query count: " +
> numFormatter.format(queries) + ", rate is " +
> numFormatter.format(queries / seconds) + " QPS");
>           }
>         } catch (Exception cle) {
>           cle.printStackTrace();
>           Main.allStop.set(true);
>         }
>       }
>   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> <[hidden email]> wrote:
> >
> > Steve:
> >
> > bq.  Basically, one core had data in it that should belong to another
> > core. Here's my question about this: Is it possible that two request to
> the
> > /get API coming in at the same time would get confused and either both
> get
> > the same result or result get inverted?
> >
> > Well, that shouldn't be happening, these are all supposed to be
> thread-safe
> > calls.... All things are possible of course ;)
> >
> > If two replicas of the same shard have different documents, that could
> account
> > for what you're seeing, meanwhile begging the question of why that is
> the case
> > since it should never be true for a quiescent index. Technically there
> _are_
> > conditions where this is true on a very temporary basis, commits on the
> leader
> > and follower can trigger at different wall-clock times. Say your soft
> commit
> > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> be the
> > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> was
> > sent. This doesn't seem likely from what you've described though...
> >
> > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> dummy
> > collections laying around. Currently I have only the active one, and
> > if there's some
> > code path whereby the RTG request goes to a replica of a different
> > collection, my
> > test setup wouldn't reproduce it.
> >
> > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > way that the replicas
> > get out of sync that wouldn't show either.
> >
> > So I'm starting another run with these changes:
> > > opening a new connection each query
> > > switched so the collection I'm querying is 2x2
> > > added some dummy collections that are empty
> >
> > One nit, while "core" is exactly correct. When we talk about a core
> > that's part of a collection, we try to use "replica" to be clear we're
> > talking about
> > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > No big deal
> > of course....
> >
> > Best,
> > Erick
> > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <[hidden email]>
> wrote:
> > >
> > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > @Shawn
> > > > We're running two instance on one machine for two reason:
> > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> I was
> > > > reading that it's not recommended to use more than 31GB of heap in
> SOLR we
> > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > instance was a good idea.
> > >
> > > Do you know that these Solr instances actually DO need 31 GB of heap,
> or
> > > are you following advice from somewhere, saying "use one quarter of
> your
> > > memory as the heap size"?  That advice is not in the Solr
> documentation,
> > > and never will be.  Figuring out the right heap size requires
> > > experimentation.
> > >
> > >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > >
> > > How big (on disk) are each of these nine cores, and how many documents
> > > are in each one?  Which of them is in each Solr instance?  With that
> > > information, we can make a *guess* about how big your heap should be.
> > > Figuring out whether the guess is correct generally requires careful
> > > analysis of a GC log.
> > >
> > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> we will
> > > > most likely have a much bigger deployment once going to production.
> In prod
> > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > key/value document store an has SOLR built-in for search, but we are
> trying
> > > > to push the key/value aspect of Riak inside SOLR. That way we would
> have
> > > > one less piece to worry about in our system.
> > >
> > > Solr is not a database.  It is not intended to be a data repository.
> > > All of its optimizations (most of which are actually in Lucene) are
> > > geared towards search.  While technically it can be a key-value store,
> > > that is not what it was MADE for.  Software actually designed for that
> > > role is going to be much better than Solr as a key-value store.
> > >
> > > > When I say null document, I mean the /get API returns: {doc: null}
> > > >
> > > > The problem is definitely not always there. We also have large
> period of
> > > > time (few hours) were we have no problems. I'm just extremely
> hesitant on
> > > > retrying when I get a null document because in some case, getting a
> null
> > > > document is a valid outcome. Our caching layer heavily rely on this
> for
> > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > performance.
> > >
> > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > looks like returning doc:null actually is how the RTG handler says it
> > > didn't find the document.  This seems very wrong to me, but I didn't
> > > design it, and that response needs SOME kind of format.
> > >
> > > Have you done any testing to see whether the standard searching handler
> > > (typically /select, but many other URL paths are possible) returns
> > > results when RTG doesn't?  Do you know for these failures whether the
> > > document has been committed or not?
> > >
> > > > As for your last comment, part of our testing phase is also testing
> the
> > > > limits. Our framework has auto-scaling built-in so if we have a
> burst of
> > > > request, the system will automatically spin up more clients. We're
> pushing
> > > > 10% of our production system to that Test server to see how it will
> handle
> > > > it.
> > >
> > > To spin up another replica, Solr must copy all its index data from the
> > > leader replica.  Not only can this take a long time if the index is
> big,
> > > but it will put a lot of extra I/O load on the machine(s) with the
> > > leader roles.  So performance will actually be WORSE before it gets
> > > better when you spin up another replica, and if the index is big, that
> > > condition will persist for quite a while.  Copying the index data will
> > > be constrained by the speed of your network and by the speed of your
> > > disks.  Often the disks are slower than the network, but that is not
> > > always the case.
> > >
> > > Thanks,
> > > Shawn
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
Thanks. I'll be away for the rest of the week, so won't be able to try
anything more....
On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <[hidden email]> wrote:

>
> In our case, we are heavily indexing in the collection while the /get
> requests are happening which is what we assumed was causing this very rare
> behavior. However, we have experienced the problem for a collection where
> the following happens in sequence with minutes in between them.
>
> 1. Document id=1 is indexed
> 2. Document successfully retrieved with /get?id=1
> 3. Document failed to be retrieved with /get?id=1
> 4. Document successfully retrieved with /get?id=1
>
> We've haven't looked at the issue in a while, so I don't have the exact
> timing of that sequence on hand right now. I'll try to find an actual
> example, although I'm relatively certain it was multiple minutes in between
> each of those requests. However our autocommit (and soft commit) times are
> 60s for both collections.
>
> I think the following two are probably the biggest differences for our
> setup, besides the version difference (v6.3.0):
>
> > index to this collection, perhaps not at a high rate
> > separate the machines running solr from the one doing any querying or
> indexing
>
> The clients are on 3 hosts separate from the solr instances. The total
> number of threads that are making updates and making /get requests is
> around 120-150. About 40-50 per host. Each of our two collections gets an
> average of 500 requests per second constantly for ~5 minutes, and then the
> number slowly tapers off to essentially 0 after ~15 minutes.
>
> Every thread attempts to make the same series of requests.
>
> -- Update with "_version_=-1". If successful, no other requests are made.
> -- On 409 Conflict failure, it makes a /get request for the id
> -- On doc:null failure, the client handles the error and moves on
>
> Combining this with the previous series of /get requests, we end up with
> situations where an update fails as expected, but the subsequent /get
> request fails to retrieve the existing document:
>
> 1. Thread 1 updates id=1 successfully
> 2. Thread 2 tries to update id=1, fails (409)
> 3. Thread 2 tries to get id=1 succeeds.
>
> ...Minutes later...
>
> 4. Thread 3 tries to update id=1, fails (409)
> 5. Thread 3 tries to get id=1, fails (doc:null)
>
> ...Minutes later...
>
> 6. Thread 4 tries to update id=1, fails (409)
> 7. Thread 4 tries to get id=1 succeeds.
>
> As Steven mentioned, it happens very, very rarely. We tried to recreate it
> in a more controlled environment, but ran into the same issue that you are,
> Erick. Every simplified situation we ran produced no problems. Since it's
> not a large issue for us and happens very rarely, we stopped trying to
> recreate it.
>
>
> On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <[hidden email]>
> wrote:
>
> > 57 million queries later, with constant indexing going on and 9 dummy
> > collections in the mix and the main collection I'm querying having 2
> > shards, 2 replicas each, I have no errors.
> >
> > So unless the code doesn't look like it exercises any similar path,
> > I'm not sure what more I can test. "It works on my machine" ;)
> >
> > Here's my querying code, does it look like it what you're seeing?
> >
> >       while (Main.allStop.get() == false) {
> >         try (SolrClient client = new HttpSolrClient.Builder()
> > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
> >
> >           //SolrQuery query = new SolrQuery();
> >           String lower = Integer.toString(rand.nextInt(1_000_000));
> >           SolrDocument rsp = client.getById(lower);
> >           if (rsp == null) {
> >             System.out.println("Got a null response!");
> >             Main.allStop.set(true);
> >           }
> >
> >           rsp = client.getById(lower);
> >
> >           if (rsp.get("id").equals(lower) == false) {
> >             System.out.println("Got an invalid response, looking for "
> > + lower + " got: " + rsp.get("id"));
> >             Main.allStop.set(true);
> >           }
> >           long queries = Main.eoeCounter.incrementAndGet();
> >           if ((queries % 100_000) == 0) {
> >             long seconds = (System.currentTimeMillis() - Main.start) /
> > 1000;
> >             System.out.println("Query count: " +
> > numFormatter.format(queries) + ", rate is " +
> > numFormatter.format(queries / seconds) + " QPS");
> >           }
> >         } catch (Exception cle) {
> >           cle.printStackTrace();
> >           Main.allStop.set(true);
> >         }
> >       }
> >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > <[hidden email]> wrote:
> > >
> > > Steve:
> > >
> > > bq.  Basically, one core had data in it that should belong to another
> > > core. Here's my question about this: Is it possible that two request to
> > the
> > > /get API coming in at the same time would get confused and either both
> > get
> > > the same result or result get inverted?
> > >
> > > Well, that shouldn't be happening, these are all supposed to be
> > thread-safe
> > > calls.... All things are possible of course ;)
> > >
> > > If two replicas of the same shard have different documents, that could
> > account
> > > for what you're seeing, meanwhile begging the question of why that is
> > the case
> > > since it should never be true for a quiescent index. Technically there
> > _are_
> > > conditions where this is true on a very temporary basis, commits on the
> > leader
> > > and follower can trigger at different wall-clock times. Say your soft
> > commit
> > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> > be the
> > > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> > was
> > > sent. This doesn't seem likely from what you've described though...
> > >
> > > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> > dummy
> > > collections laying around. Currently I have only the active one, and
> > > if there's some
> > > code path whereby the RTG request goes to a replica of a different
> > > collection, my
> > > test setup wouldn't reproduce it.
> > >
> > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > > way that the replicas
> > > get out of sync that wouldn't show either.
> > >
> > > So I'm starting another run with these changes:
> > > > opening a new connection each query
> > > > switched so the collection I'm querying is 2x2
> > > > added some dummy collections that are empty
> > >
> > > One nit, while "core" is exactly correct. When we talk about a core
> > > that's part of a collection, we try to use "replica" to be clear we're
> > > talking about
> > > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > > No big deal
> > > of course....
> > >
> > > Best,
> > > Erick
> > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <[hidden email]>
> > wrote:
> > > >
> > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > @Shawn
> > > > > We're running two instance on one machine for two reason:
> > > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> > I was
> > > > > reading that it's not recommended to use more than 31GB of heap in
> > SOLR we
> > > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > > instance was a good idea.
> > > >
> > > > Do you know that these Solr instances actually DO need 31 GB of heap,
> > or
> > > > are you following advice from somewhere, saying "use one quarter of
> > your
> > > > memory as the heap size"?  That advice is not in the Solr
> > documentation,
> > > > and never will be.  Figuring out the right heap size requires
> > > > experimentation.
> > > >
> > > >
> > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > >
> > > > How big (on disk) are each of these nine cores, and how many documents
> > > > are in each one?  Which of them is in each Solr instance?  With that
> > > > information, we can make a *guess* about how big your heap should be.
> > > > Figuring out whether the guess is correct generally requires careful
> > > > analysis of a GC log.
> > > >
> > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> > we will
> > > > > most likely have a much bigger deployment once going to production.
> > In prod
> > > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > > key/value document store an has SOLR built-in for search, but we are
> > trying
> > > > > to push the key/value aspect of Riak inside SOLR. That way we would
> > have
> > > > > one less piece to worry about in our system.
> > > >
> > > > Solr is not a database.  It is not intended to be a data repository.
> > > > All of its optimizations (most of which are actually in Lucene) are
> > > > geared towards search.  While technically it can be a key-value store,
> > > > that is not what it was MADE for.  Software actually designed for that
> > > > role is going to be much better than Solr as a key-value store.
> > > >
> > > > > When I say null document, I mean the /get API returns: {doc: null}
> > > > >
> > > > > The problem is definitely not always there. We also have large
> > period of
> > > > > time (few hours) were we have no problems. I'm just extremely
> > hesitant on
> > > > > retrying when I get a null document because in some case, getting a
> > null
> > > > > document is a valid outcome. Our caching layer heavily rely on this
> > for
> > > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > > performance.
> > > >
> > > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > > looks like returning doc:null actually is how the RTG handler says it
> > > > didn't find the document.  This seems very wrong to me, but I didn't
> > > > design it, and that response needs SOME kind of format.
> > > >
> > > > Have you done any testing to see whether the standard searching handler
> > > > (typically /select, but many other URL paths are possible) returns
> > > > results when RTG doesn't?  Do you know for these failures whether the
> > > > document has been committed or not?
> > > >
> > > > > As for your last comment, part of our testing phase is also testing
> > the
> > > > > limits. Our framework has auto-scaling built-in so if we have a
> > burst of
> > > > > request, the system will automatically spin up more clients. We're
> > pushing
> > > > > 10% of our production system to that Test server to see how it will
> > handle
> > > > > it.
> > > >
> > > > To spin up another replica, Solr must copy all its index data from the
> > > > leader replica.  Not only can this take a long time if the index is
> > big,
> > > > but it will put a lot of extra I/O load on the machine(s) with the
> > > > leader roles.  So performance will actually be WORSE before it gets
> > > > better when you spin up another replica, and if the index is big, that
> > > > condition will persist for quite a while.  Copying the index data will
> > > > be constrained by the speed of your network and by the speed of your
> > > > disks.  Often the disks are slower than the network, but that is not
> > > > always the case.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
Hmmmm. I wonder if a version conflict or perhaps other failure can
somehow cause this. It shouldn't be very hard to add that to my test
setup, just randomly add n _version_ field value.

Erick
On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <[hidden email]> wrote:

>
> Thanks. I'll be away for the rest of the week, so won't be able to try
> anything more....
> On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <[hidden email]> wrote:
> >
> > In our case, we are heavily indexing in the collection while the /get
> > requests are happening which is what we assumed was causing this very rare
> > behavior. However, we have experienced the problem for a collection where
> > the following happens in sequence with minutes in between them.
> >
> > 1. Document id=1 is indexed
> > 2. Document successfully retrieved with /get?id=1
> > 3. Document failed to be retrieved with /get?id=1
> > 4. Document successfully retrieved with /get?id=1
> >
> > We've haven't looked at the issue in a while, so I don't have the exact
> > timing of that sequence on hand right now. I'll try to find an actual
> > example, although I'm relatively certain it was multiple minutes in between
> > each of those requests. However our autocommit (and soft commit) times are
> > 60s for both collections.
> >
> > I think the following two are probably the biggest differences for our
> > setup, besides the version difference (v6.3.0):
> >
> > > index to this collection, perhaps not at a high rate
> > > separate the machines running solr from the one doing any querying or
> > indexing
> >
> > The clients are on 3 hosts separate from the solr instances. The total
> > number of threads that are making updates and making /get requests is
> > around 120-150. About 40-50 per host. Each of our two collections gets an
> > average of 500 requests per second constantly for ~5 minutes, and then the
> > number slowly tapers off to essentially 0 after ~15 minutes.
> >
> > Every thread attempts to make the same series of requests.
> >
> > -- Update with "_version_=-1". If successful, no other requests are made.
> > -- On 409 Conflict failure, it makes a /get request for the id
> > -- On doc:null failure, the client handles the error and moves on
> >
> > Combining this with the previous series of /get requests, we end up with
> > situations where an update fails as expected, but the subsequent /get
> > request fails to retrieve the existing document:
> >
> > 1. Thread 1 updates id=1 successfully
> > 2. Thread 2 tries to update id=1, fails (409)
> > 3. Thread 2 tries to get id=1 succeeds.
> >
> > ...Minutes later...
> >
> > 4. Thread 3 tries to update id=1, fails (409)
> > 5. Thread 3 tries to get id=1, fails (doc:null)
> >
> > ...Minutes later...
> >
> > 6. Thread 4 tries to update id=1, fails (409)
> > 7. Thread 4 tries to get id=1 succeeds.
> >
> > As Steven mentioned, it happens very, very rarely. We tried to recreate it
> > in a more controlled environment, but ran into the same issue that you are,
> > Erick. Every simplified situation we ran produced no problems. Since it's
> > not a large issue for us and happens very rarely, we stopped trying to
> > recreate it.
> >
> >
> > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> > > 57 million queries later, with constant indexing going on and 9 dummy
> > > collections in the mix and the main collection I'm querying having 2
> > > shards, 2 replicas each, I have no errors.
> > >
> > > So unless the code doesn't look like it exercises any similar path,
> > > I'm not sure what more I can test. "It works on my machine" ;)
> > >
> > > Here's my querying code, does it look like it what you're seeing?
> > >
> > >       while (Main.allStop.get() == false) {
> > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
> > >
> > >           //SolrQuery query = new SolrQuery();
> > >           String lower = Integer.toString(rand.nextInt(1_000_000));
> > >           SolrDocument rsp = client.getById(lower);
> > >           if (rsp == null) {
> > >             System.out.println("Got a null response!");
> > >             Main.allStop.set(true);
> > >           }
> > >
> > >           rsp = client.getById(lower);
> > >
> > >           if (rsp.get("id").equals(lower) == false) {
> > >             System.out.println("Got an invalid response, looking for "
> > > + lower + " got: " + rsp.get("id"));
> > >             Main.allStop.set(true);
> > >           }
> > >           long queries = Main.eoeCounter.incrementAndGet();
> > >           if ((queries % 100_000) == 0) {
> > >             long seconds = (System.currentTimeMillis() - Main.start) /
> > > 1000;
> > >             System.out.println("Query count: " +
> > > numFormatter.format(queries) + ", rate is " +
> > > numFormatter.format(queries / seconds) + " QPS");
> > >           }
> > >         } catch (Exception cle) {
> > >           cle.printStackTrace();
> > >           Main.allStop.set(true);
> > >         }
> > >       }
> > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > <[hidden email]> wrote:
> > > >
> > > > Steve:
> > > >
> > > > bq.  Basically, one core had data in it that should belong to another
> > > > core. Here's my question about this: Is it possible that two request to
> > > the
> > > > /get API coming in at the same time would get confused and either both
> > > get
> > > > the same result or result get inverted?
> > > >
> > > > Well, that shouldn't be happening, these are all supposed to be
> > > thread-safe
> > > > calls.... All things are possible of course ;)
> > > >
> > > > If two replicas of the same shard have different documents, that could
> > > account
> > > > for what you're seeing, meanwhile begging the question of why that is
> > > the case
> > > > since it should never be true for a quiescent index. Technically there
> > > _are_
> > > > conditions where this is true on a very temporary basis, commits on the
> > > leader
> > > > and follower can trigger at different wall-clock times. Say your soft
> > > commit
> > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> > > be the
> > > > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> > > was
> > > > sent. This doesn't seem likely from what you've described though...
> > > >
> > > > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> > > dummy
> > > > collections laying around. Currently I have only the active one, and
> > > > if there's some
> > > > code path whereby the RTG request goes to a replica of a different
> > > > collection, my
> > > > test setup wouldn't reproduce it.
> > > >
> > > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > > > way that the replicas
> > > > get out of sync that wouldn't show either.
> > > >
> > > > So I'm starting another run with these changes:
> > > > > opening a new connection each query
> > > > > switched so the collection I'm querying is 2x2
> > > > > added some dummy collections that are empty
> > > >
> > > > One nit, while "core" is exactly correct. When we talk about a core
> > > > that's part of a collection, we try to use "replica" to be clear we're
> > > > talking about
> > > > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > > > No big deal
> > > > of course....
> > > >
> > > > Best,
> > > > Erick
> > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <[hidden email]>
> > > wrote:
> > > > >
> > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > @Shawn
> > > > > > We're running two instance on one machine for two reason:
> > > > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> > > I was
> > > > > > reading that it's not recommended to use more than 31GB of heap in
> > > SOLR we
> > > > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > > > instance was a good idea.
> > > > >
> > > > > Do you know that these Solr instances actually DO need 31 GB of heap,
> > > or
> > > > > are you following advice from somewhere, saying "use one quarter of
> > > your
> > > > > memory as the heap size"?  That advice is not in the Solr
> > > documentation,
> > > > > and never will be.  Figuring out the right heap size requires
> > > > > experimentation.
> > > > >
> > > > >
> > > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > >
> > > > > How big (on disk) are each of these nine cores, and how many documents
> > > > > are in each one?  Which of them is in each Solr instance?  With that
> > > > > information, we can make a *guess* about how big your heap should be.
> > > > > Figuring out whether the guess is correct generally requires careful
> > > > > analysis of a GC log.
> > > > >
> > > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> > > we will
> > > > > > most likely have a much bigger deployment once going to production.
> > > In prod
> > > > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > > > key/value document store an has SOLR built-in for search, but we are
> > > trying
> > > > > > to push the key/value aspect of Riak inside SOLR. That way we would
> > > have
> > > > > > one less piece to worry about in our system.
> > > > >
> > > > > Solr is not a database.  It is not intended to be a data repository.
> > > > > All of its optimizations (most of which are actually in Lucene) are
> > > > > geared towards search.  While technically it can be a key-value store,
> > > > > that is not what it was MADE for.  Software actually designed for that
> > > > > role is going to be much better than Solr as a key-value store.
> > > > >
> > > > > > When I say null document, I mean the /get API returns: {doc: null}
> > > > > >
> > > > > > The problem is definitely not always there. We also have large
> > > period of
> > > > > > time (few hours) were we have no problems. I'm just extremely
> > > hesitant on
> > > > > > retrying when I get a null document because in some case, getting a
> > > null
> > > > > > document is a valid outcome. Our caching layer heavily rely on this
> > > for
> > > > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > > > performance.
> > > > >
> > > > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > > > looks like returning doc:null actually is how the RTG handler says it
> > > > > didn't find the document.  This seems very wrong to me, but I didn't
> > > > > design it, and that response needs SOME kind of format.
> > > > >
> > > > > Have you done any testing to see whether the standard searching handler
> > > > > (typically /select, but many other URL paths are possible) returns
> > > > > results when RTG doesn't?  Do you know for these failures whether the
> > > > > document has been committed or not?
> > > > >
> > > > > > As for your last comment, part of our testing phase is also testing
> > > the
> > > > > > limits. Our framework has auto-scaling built-in so if we have a
> > > burst of
> > > > > > request, the system will automatically spin up more clients. We're
> > > pushing
> > > > > > 10% of our production system to that Test server to see how it will
> > > handle
> > > > > > it.
> > > > >
> > > > > To spin up another replica, Solr must copy all its index data from the
> > > > > leader replica.  Not only can this take a long time if the index is
> > > big,
> > > > > but it will put a lot of extra I/O load on the machine(s) with the
> > > > > leader roles.  So performance will actually be WORSE before it gets
> > > > > better when you spin up another replica, and if the index is big, that
> > > > > condition will persist for quite a while.  Copying the index data will
> > > > > be constrained by the speed of your network and by the speed of your
> > > > > disks.  Often the disks are slower than the network, but that is not
> > > > > always the case.
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > >
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

Erick Erickson
Well assigning a bogus version that generates a 409 error then
immediately doing an RTG on the doc doesn't fail for me either 18
million tries later. So I'm afraid I haven't a clue where to go from
here. Unless we can somehow find a way to generate this failure I'm
going to drop it for the foreseeable future.

Erick
On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <[hidden email]> wrote:

>
> Hmmmm. I wonder if a version conflict or perhaps other failure can
> somehow cause this. It shouldn't be very hard to add that to my test
> setup, just randomly add n _version_ field value.
>
> Erick
> On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <[hidden email]> wrote:
> >
> > Thanks. I'll be away for the rest of the week, so won't be able to try
> > anything more....
> > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <[hidden email]> wrote:
> > >
> > > In our case, we are heavily indexing in the collection while the /get
> > > requests are happening which is what we assumed was causing this very rare
> > > behavior. However, we have experienced the problem for a collection where
> > > the following happens in sequence with minutes in between them.
> > >
> > > 1. Document id=1 is indexed
> > > 2. Document successfully retrieved with /get?id=1
> > > 3. Document failed to be retrieved with /get?id=1
> > > 4. Document successfully retrieved with /get?id=1
> > >
> > > We've haven't looked at the issue in a while, so I don't have the exact
> > > timing of that sequence on hand right now. I'll try to find an actual
> > > example, although I'm relatively certain it was multiple minutes in between
> > > each of those requests. However our autocommit (and soft commit) times are
> > > 60s for both collections.
> > >
> > > I think the following two are probably the biggest differences for our
> > > setup, besides the version difference (v6.3.0):
> > >
> > > > index to this collection, perhaps not at a high rate
> > > > separate the machines running solr from the one doing any querying or
> > > indexing
> > >
> > > The clients are on 3 hosts separate from the solr instances. The total
> > > number of threads that are making updates and making /get requests is
> > > around 120-150. About 40-50 per host. Each of our two collections gets an
> > > average of 500 requests per second constantly for ~5 minutes, and then the
> > > number slowly tapers off to essentially 0 after ~15 minutes.
> > >
> > > Every thread attempts to make the same series of requests.
> > >
> > > -- Update with "_version_=-1". If successful, no other requests are made.
> > > -- On 409 Conflict failure, it makes a /get request for the id
> > > -- On doc:null failure, the client handles the error and moves on
> > >
> > > Combining this with the previous series of /get requests, we end up with
> > > situations where an update fails as expected, but the subsequent /get
> > > request fails to retrieve the existing document:
> > >
> > > 1. Thread 1 updates id=1 successfully
> > > 2. Thread 2 tries to update id=1, fails (409)
> > > 3. Thread 2 tries to get id=1 succeeds.
> > >
> > > ...Minutes later...
> > >
> > > 4. Thread 3 tries to update id=1, fails (409)
> > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > >
> > > ...Minutes later...
> > >
> > > 6. Thread 4 tries to update id=1, fails (409)
> > > 7. Thread 4 tries to get id=1 succeeds.
> > >
> > > As Steven mentioned, it happens very, very rarely. We tried to recreate it
> > > in a more controlled environment, but ran into the same issue that you are,
> > > Erick. Every simplified situation we ran produced no problems. Since it's
> > > not a large issue for us and happens very rarely, we stopped trying to
> > > recreate it.
> > >
> > >
> > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <[hidden email]>
> > > wrote:
> > >
> > > > 57 million queries later, with constant indexing going on and 9 dummy
> > > > collections in the mix and the main collection I'm querying having 2
> > > > shards, 2 replicas each, I have no errors.
> > > >
> > > > So unless the code doesn't look like it exercises any similar path,
> > > > I'm not sure what more I can test. "It works on my machine" ;)
> > > >
> > > > Here's my querying code, does it look like it what you're seeing?
> > > >
> > > >       while (Main.allStop.get() == false) {
> > > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
> > > >
> > > >           //SolrQuery query = new SolrQuery();
> > > >           String lower = Integer.toString(rand.nextInt(1_000_000));
> > > >           SolrDocument rsp = client.getById(lower);
> > > >           if (rsp == null) {
> > > >             System.out.println("Got a null response!");
> > > >             Main.allStop.set(true);
> > > >           }
> > > >
> > > >           rsp = client.getById(lower);
> > > >
> > > >           if (rsp.get("id").equals(lower) == false) {
> > > >             System.out.println("Got an invalid response, looking for "
> > > > + lower + " got: " + rsp.get("id"));
> > > >             Main.allStop.set(true);
> > > >           }
> > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > >           if ((queries % 100_000) == 0) {
> > > >             long seconds = (System.currentTimeMillis() - Main.start) /
> > > > 1000;
> > > >             System.out.println("Query count: " +
> > > > numFormatter.format(queries) + ", rate is " +
> > > > numFormatter.format(queries / seconds) + " QPS");
> > > >           }
> > > >         } catch (Exception cle) {
> > > >           cle.printStackTrace();
> > > >           Main.allStop.set(true);
> > > >         }
> > > >       }
> > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > <[hidden email]> wrote:
> > > > >
> > > > > Steve:
> > > > >
> > > > > bq.  Basically, one core had data in it that should belong to another
> > > > > core. Here's my question about this: Is it possible that two request to
> > > > the
> > > > > /get API coming in at the same time would get confused and either both
> > > > get
> > > > > the same result or result get inverted?
> > > > >
> > > > > Well, that shouldn't be happening, these are all supposed to be
> > > > thread-safe
> > > > > calls.... All things are possible of course ;)
> > > > >
> > > > > If two replicas of the same shard have different documents, that could
> > > > account
> > > > > for what you're seeing, meanwhile begging the question of why that is
> > > > the case
> > > > > since it should never be true for a quiescent index. Technically there
> > > > _are_
> > > > > conditions where this is true on a very temporary basis, commits on the
> > > > leader
> > > > > and follower can trigger at different wall-clock times. Say your soft
> > > > commit
> > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> > > > be the
> > > > > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> > > > was
> > > > > sent. This doesn't seem likely from what you've described though...
> > > > >
> > > > > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> > > > dummy
> > > > > collections laying around. Currently I have only the active one, and
> > > > > if there's some
> > > > > code path whereby the RTG request goes to a replica of a different
> > > > > collection, my
> > > > > test setup wouldn't reproduce it.
> > > > >
> > > > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > > > > way that the replicas
> > > > > get out of sync that wouldn't show either.
> > > > >
> > > > > So I'm starting another run with these changes:
> > > > > > opening a new connection each query
> > > > > > switched so the collection I'm querying is 2x2
> > > > > > added some dummy collections that are empty
> > > > >
> > > > > One nit, while "core" is exactly correct. When we talk about a core
> > > > > that's part of a collection, we try to use "replica" to be clear we're
> > > > > talking about
> > > > > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > > > > No big deal
> > > > > of course....
> > > > >
> > > > > Best,
> > > > > Erick
> > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <[hidden email]>
> > > > wrote:
> > > > > >
> > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > @Shawn
> > > > > > > We're running two instance on one machine for two reason:
> > > > > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> > > > I was
> > > > > > > reading that it's not recommended to use more than 31GB of heap in
> > > > SOLR we
> > > > > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > > > > instance was a good idea.
> > > > > >
> > > > > > Do you know that these Solr instances actually DO need 31 GB of heap,
> > > > or
> > > > > > are you following advice from somewhere, saying "use one quarter of
> > > > your
> > > > > > memory as the heap size"?  That advice is not in the Solr
> > > > documentation,
> > > > > > and never will be.  Figuring out the right heap size requires
> > > > > > experimentation.
> > > > > >
> > > > > >
> > > > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > >
> > > > > > How big (on disk) are each of these nine cores, and how many documents
> > > > > > are in each one?  Which of them is in each Solr instance?  With that
> > > > > > information, we can make a *guess* about how big your heap should be.
> > > > > > Figuring out whether the guess is correct generally requires careful
> > > > > > analysis of a GC log.
> > > > > >
> > > > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> > > > we will
> > > > > > > most likely have a much bigger deployment once going to production.
> > > > In prod
> > > > > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > > > > key/value document store an has SOLR built-in for search, but we are
> > > > trying
> > > > > > > to push the key/value aspect of Riak inside SOLR. That way we would
> > > > have
> > > > > > > one less piece to worry about in our system.
> > > > > >
> > > > > > Solr is not a database.  It is not intended to be a data repository.
> > > > > > All of its optimizations (most of which are actually in Lucene) are
> > > > > > geared towards search.  While technically it can be a key-value store,
> > > > > > that is not what it was MADE for.  Software actually designed for that
> > > > > > role is going to be much better than Solr as a key-value store.
> > > > > >
> > > > > > > When I say null document, I mean the /get API returns: {doc: null}
> > > > > > >
> > > > > > > The problem is definitely not always there. We also have large
> > > > period of
> > > > > > > time (few hours) were we have no problems. I'm just extremely
> > > > hesitant on
> > > > > > > retrying when I get a null document because in some case, getting a
> > > > null
> > > > > > > document is a valid outcome. Our caching layer heavily rely on this
> > > > for
> > > > > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > > > > performance.
> > > > > >
> > > > > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > > > > looks like returning doc:null actually is how the RTG handler says it
> > > > > > didn't find the document.  This seems very wrong to me, but I didn't
> > > > > > design it, and that response needs SOME kind of format.
> > > > > >
> > > > > > Have you done any testing to see whether the standard searching handler
> > > > > > (typically /select, but many other URL paths are possible) returns
> > > > > > results when RTG doesn't?  Do you know for these failures whether the
> > > > > > document has been committed or not?
> > > > > >
> > > > > > > As for your last comment, part of our testing phase is also testing
> > > > the
> > > > > > > limits. Our framework has auto-scaling built-in so if we have a
> > > > burst of
> > > > > > > request, the system will automatically spin up more clients. We're
> > > > pushing
> > > > > > > 10% of our production system to that Test server to see how it will
> > > > handle
> > > > > > > it.
> > > > > >
> > > > > > To spin up another replica, Solr must copy all its index data from the
> > > > > > leader replica.  Not only can this take a long time if the index is
> > > > big,
> > > > > > but it will put a lot of extra I/O load on the machine(s) with the
> > > > > > leader roles.  So performance will actually be WORSE before it gets
> > > > > > better when you spin up another replica, and if the index is big, that
> > > > > > condition will persist for quite a while.  Copying the index data will
> > > > > > be constrained by the speed of your network and by the speed of your
> > > > > > disks.  Often the disks are slower than the network, but that is not
> > > > > > always the case.
> > > > > >
> > > > > > Thanks,
> > > > > > Shawn
> > > > > >
> > > >
Reply | Threaded
Open this post in threaded view
|

Re: Realtime get not always returning existing data

sgaron cse
I haven't found a way to reproduce the problem other that running our
entire set of code. I've also been trying different things to make sure to
problem is not from my end and so far I haven't managed to fix it by
changing my code. It has to be a race condition somewhere but I just can't
put my finger on it.

I'll message back if I find a way to reproduce.

On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson <[hidden email]>
wrote:

> Well assigning a bogus version that generates a 409 error then
> immediately doing an RTG on the doc doesn't fail for me either 18
> million tries later. So I'm afraid I haven't a clue where to go from
> here. Unless we can somehow find a way to generate this failure I'm
> going to drop it for the foreseeable future.
>
> Erick
> On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <[hidden email]>
> wrote:
> >
> > Hmmmm. I wonder if a version conflict or perhaps other failure can
> > somehow cause this. It shouldn't be very hard to add that to my test
> > setup, just randomly add n _version_ field value.
> >
> > Erick
> > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <[hidden email]>
> wrote:
> > >
> > > Thanks. I'll be away for the rest of the week, so won't be able to try
> > > anything more....
> > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <[hidden email]> wrote:
> > > >
> > > > In our case, we are heavily indexing in the collection while the /get
> > > > requests are happening which is what we assumed was causing this
> very rare
> > > > behavior. However, we have experienced the problem for a collection
> where
> > > > the following happens in sequence with minutes in between them.
> > > >
> > > > 1. Document id=1 is indexed
> > > > 2. Document successfully retrieved with /get?id=1
> > > > 3. Document failed to be retrieved with /get?id=1
> > > > 4. Document successfully retrieved with /get?id=1
> > > >
> > > > We've haven't looked at the issue in a while, so I don't have the
> exact
> > > > timing of that sequence on hand right now. I'll try to find an actual
> > > > example, although I'm relatively certain it was multiple minutes in
> between
> > > > each of those requests. However our autocommit (and soft commit)
> times are
> > > > 60s for both collections.
> > > >
> > > > I think the following two are probably the biggest differences for
> our
> > > > setup, besides the version difference (v6.3.0):
> > > >
> > > > > index to this collection, perhaps not at a high rate
> > > > > separate the machines running solr from the one doing any querying
> or
> > > > indexing
> > > >
> > > > The clients are on 3 hosts separate from the solr instances. The
> total
> > > > number of threads that are making updates and making /get requests is
> > > > around 120-150. About 40-50 per host. Each of our two collections
> gets an
> > > > average of 500 requests per second constantly for ~5 minutes, and
> then the
> > > > number slowly tapers off to essentially 0 after ~15 minutes.
> > > >
> > > > Every thread attempts to make the same series of requests.
> > > >
> > > > -- Update with "_version_=-1". If successful, no other requests are
> made.
> > > > -- On 409 Conflict failure, it makes a /get request for the id
> > > > -- On doc:null failure, the client handles the error and moves on
> > > >
> > > > Combining this with the previous series of /get requests, we end up
> with
> > > > situations where an update fails as expected, but the subsequent /get
> > > > request fails to retrieve the existing document:
> > > >
> > > > 1. Thread 1 updates id=1 successfully
> > > > 2. Thread 2 tries to update id=1, fails (409)
> > > > 3. Thread 2 tries to get id=1 succeeds.
> > > >
> > > > ...Minutes later...
> > > >
> > > > 4. Thread 3 tries to update id=1, fails (409)
> > > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > > >
> > > > ...Minutes later...
> > > >
> > > > 6. Thread 4 tries to update id=1, fails (409)
> > > > 7. Thread 4 tries to get id=1 succeeds.
> > > >
> > > > As Steven mentioned, it happens very, very rarely. We tried to
> recreate it
> > > > in a more controlled environment, but ran into the same issue that
> you are,
> > > > Erick. Every simplified situation we ran produced no problems. Since
> it's
> > > > not a large issue for us and happens very rarely, we stopped trying
> to
> > > > recreate it.
> > > >
> > > >
> > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > 57 million queries later, with constant indexing going on and 9
> dummy
> > > > > collections in the mix and the main collection I'm querying having
> 2
> > > > > shards, 2 replicas each, I have no errors.
> > > > >
> > > > > So unless the code doesn't look like it exercises any similar path,
> > > > > I'm not sure what more I can test. "It works on my machine" ;)
> > > > >
> > > > > Here's my querying code, does it look like it what you're seeing?
> > > > >
> > > > >       while (Main.allStop.get() == false) {
> > > > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > > > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build())
> {
> > > > >
> > > > >           //SolrQuery query = new SolrQuery();
> > > > >           String lower = Integer.toString(rand.nextInt(1_000_000));
> > > > >           SolrDocument rsp = client.getById(lower);
> > > > >           if (rsp == null) {
> > > > >             System.out.println("Got a null response!");
> > > > >             Main.allStop.set(true);
> > > > >           }
> > > > >
> > > > >           rsp = client.getById(lower);
> > > > >
> > > > >           if (rsp.get("id").equals(lower) == false) {
> > > > >             System.out.println("Got an invalid response, looking
> for "
> > > > > + lower + " got: " + rsp.get("id"));
> > > > >             Main.allStop.set(true);
> > > > >           }
> > > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > > >           if ((queries % 100_000) == 0) {
> > > > >             long seconds = (System.currentTimeMillis() -
> Main.start) /
> > > > > 1000;
> > > > >             System.out.println("Query count: " +
> > > > > numFormatter.format(queries) + ", rate is " +
> > > > > numFormatter.format(queries / seconds) + " QPS");
> > > > >           }
> > > > >         } catch (Exception cle) {
> > > > >           cle.printStackTrace();
> > > > >           Main.allStop.set(true);
> > > > >         }
> > > > >       }
> > > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > > <[hidden email]> wrote:
> > > > > >
> > > > > > Steve:
> > > > > >
> > > > > > bq.  Basically, one core had data in it that should belong to
> another
> > > > > > core. Here's my question about this: Is it possible that two
> request to
> > > > > the
> > > > > > /get API coming in at the same time would get confused and
> either both
> > > > > get
> > > > > > the same result or result get inverted?
> > > > > >
> > > > > > Well, that shouldn't be happening, these are all supposed to be
> > > > > thread-safe
> > > > > > calls.... All things are possible of course ;)
> > > > > >
> > > > > > If two replicas of the same shard have different documents, that
> could
> > > > > account
> > > > > > for what you're seeing, meanwhile begging the question of why
> that is
> > > > > the case
> > > > > > since it should never be true for a quiescent index. Technically
> there
> > > > > _are_
> > > > > > conditions where this is true on a very temporary basis, commits
> on the
> > > > > leader
> > > > > > and follower can trigger at different wall-clock times. Say your
> soft
> > > > > commit
> > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should
> never
> > > > > be the
> > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after the
> last update
> > > > > was
> > > > > > sent. This doesn't seem likely from what you've described
> though...
> > > > > >
> > > > > > Hmmmm. I guess that one other thing I can set up is to have a
> bunch of
> > > > > dummy
> > > > > > collections laying around. Currently I have only the active one,
> and
> > > > > > if there's some
> > > > > > code path whereby the RTG request goes to a replica of a
> different
> > > > > > collection, my
> > > > > > test setup wouldn't reproduce it.
> > > > > >
> > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if there's
> some
> > > > > > way that the replicas
> > > > > > get out of sync that wouldn't show either.
> > > > > >
> > > > > > So I'm starting another run with these changes:
> > > > > > > opening a new connection each query
> > > > > > > switched so the collection I'm querying is 2x2
> > > > > > > added some dummy collections that are empty
> > > > > >
> > > > > > One nit, while "core" is exactly correct. When we talk about a
> core
> > > > > > that's part of a collection, we try to use "replica" to be clear
> we're
> > > > > > talking about
> > > > > > a core with some added characteristics, i.e. we're in
> SolrCloud-land.
> > > > > > No big deal
> > > > > > of course....
> > > > > >
> > > > > > Best,
> > > > > > Erick
> > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <
> [hidden email]>
> > > > > wrote:
> > > > > > >
> > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > > @Shawn
> > > > > > > > We're running two instance on one machine for two reason:
> > > > > > > > 1. The box has plenty of resources (48 cores / 256GB ram)
> and since
> > > > > I was
> > > > > > > > reading that it's not recommended to use more than 31GB of
> heap in
> > > > > SOLR we
> > > > > > > > figured 96 GB for keeping index data in OS cache + 31 GB of
> heap per
> > > > > > > > instance was a good idea.
> > > > > > >
> > > > > > > Do you know that these Solr instances actually DO need 31 GB
> of heap,
> > > > > or
> > > > > > > are you following advice from somewhere, saying "use one
> quarter of
> > > > > your
> > > > > > > memory as the heap size"?  That advice is not in the Solr
> > > > > documentation,
> > > > > > > and never will be.  Figuring out the right heap size requires
> > > > > > > experimentation.
> > > > > > >
> > > > > > >
> > > > >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > > >
> > > > > > > How big (on disk) are each of these nine cores, and how many
> documents
> > > > > > > are in each one?  Which of them is in each Solr instance?
> With that
> > > > > > > information, we can make a *guess* about how big your heap
> should be.
> > > > > > > Figuring out whether the guess is correct generally requires
> careful
> > > > > > > analysis of a GC log.
> > > > > > >
> > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud
> configuration,
> > > > > we will
> > > > > > > > most likely have a much bigger deployment once going to
> production.
> > > > > In prod
> > > > > > > > right now, we currently to run a six machines Riak cluster.
> Riak is a
> > > > > > > > key/value document store an has SOLR built-in for search,
> but we are
> > > > > trying
> > > > > > > > to push the key/value aspect of Riak inside SOLR. That way
> we would
> > > > > have
> > > > > > > > one less piece to worry about in our system.
> > > > > > >
> > > > > > > Solr is not a database.  It is not intended to be a data
> repository.
> > > > > > > All of its optimizations (most of which are actually in
> Lucene) are
> > > > > > > geared towards search.  While technically it can be a
> key-value store,
> > > > > > > that is not what it was MADE for.  Software actually designed
> for that
> > > > > > > role is going to be much better than Solr as a key-value store.
> > > > > > >
> > > > > > > > When I say null document, I mean the /get API returns: {doc:
> null}
> > > > > > > >
> > > > > > > > The problem is definitely not always there. We also have
> large
> > > > > period of
> > > > > > > > time (few hours) were we have no problems. I'm just extremely
> > > > > hesitant on
> > > > > > > > retrying when I get a null document because in some case,
> getting a
> > > > > null
> > > > > > > > document is a valid outcome. Our caching layer heavily rely
> on this
> > > > > for
> > > > > > > > example. If I was to retry every nulls I'd pay a big penalty
> in
> > > > > > > > performance.
> > > > > > >
> > > > > > > I've just done a little test with the 7.5.0 techproducts
> example.  It
> > > > > > > looks like returning doc:null actually is how the RTG handler
> says it
> > > > > > > didn't find the document.  This seems very wrong to me, but I
> didn't
> > > > > > > design it, and that response needs SOME kind of format.
> > > > > > >
> > > > > > > Have you done any testing to see whether the standard
> searching handler
> > > > > > > (typically /select, but many other URL paths are possible)
> returns
> > > > > > > results when RTG doesn't?  Do you know for these failures
> whether the
> > > > > > > document has been committed or not?
> > > > > > >
> > > > > > > > As for your last comment, part of our testing phase is also
> testing
> > > > > the
> > > > > > > > limits. Our framework has auto-scaling built-in so if we
> have a
> > > > > burst of
> > > > > > > > request, the system will automatically spin up more clients.
> We're
> > > > > pushing
> > > > > > > > 10% of our production system to that Test server to see how
> it will
> > > > > handle
> > > > > > > > it.
> > > > > > >
> > > > > > > To spin up another replica, Solr must copy all its index data
> from the
> > > > > > > leader replica.  Not only can this take a long time if the
> index is
> > > > > big,
> > > > > > > but it will put a lot of extra I/O load on the machine(s) with
> the
> > > > > > > leader roles.  So performance will actually be WORSE before it
> gets
> > > > > > > better when you spin up another replica, and if the index is
> big, that
> > > > > > > condition will persist for quite a while.  Copying the index
> data will
> > > > > > > be constrained by the speed of your network and by the speed
> of your
> > > > > > > disks.  Often the disks are slower than the network, but that
> is not
> > > > > > > always the case.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Shawn
> > > > > > >
> > > > >
>
12