Inconsistent results for facet queries

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistent results for facet queries

Chris Ulicny
Hi,

We've run into a strange issue with our deployment of solrcloud 6.3.0.
Essentially, a standard facet query on a string field usually comes back
empty when it shouldn't. However, every now and again the query actually
returns the correct values. This is only affecting a single shard in our
setup.

The behavior pattern generally looks like the query works properly when it
hasn't been run recently, and then returns nothing after the query seems to
have been cached (< 50ms QTime). Wait a while and you get the correct
result followed by blanks. It doesn't matter which replica of the shard is
queried; the results are the same.

The general query in question looks like
/select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>

The field is defined in the schema as <field name="market" type="string"
docValues="true"/>

There are numerous other fields defined similarly, and they do not exhibit
the same behavior when used as the facet.field value. They consistently
return the right results on the shard in question.

If we add facet.method=enum to the query, we get the correct results every
time (though slower. So our assumption is that something is sporadically
working when the fc method is chosen by default.

A few other notes about the collection. This collection is not freshly
indexed, but has not had any particularly bad failures beyond follower
replicas going down due to PKIAuthentication timeouts (has been fixed). It
has also had a full reindex after a schema change added docValues some
fields (including the one above), but the collection wasn't emptied first.
We are using the composite router to co-locate documents.

Currently, our plan is just to reindex all of the documents on the affected
shard to see if that fixes the problem. Any ideas on what might be
happening or ways to troubleshoot this are appreciated.

Thanks,
Chris
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent results for facet queries

Erick Erickson
bq: ...but the collection wasn't emptied first....

This is what I'd suspect is the problem. Here's the issue: Segments
aren't merged identically on all replicas. So at some point you had
this field indexed without docValues, changed that and re-indexed. But
the segment merging could "read" the first segment it's going to merge
and think it knows about docValues for that field, when in fact that
segment had the old (non-DV) definition.

This would not necessarily be the same on all replicas even on the _same_ shard.

This can propagate through all following segment merges IIUC.

So my bet is that if you index into a new collection, everything will
be fine. You can also just delete everything first, but I usually
prefer a new collection so I'm absolutely and positively sure that the
above can't happen.

Best,
Erick

On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <[hidden email]> wrote:

> Hi,
>
> We've run into a strange issue with our deployment of solrcloud 6.3.0.
> Essentially, a standard facet query on a string field usually comes back
> empty when it shouldn't. However, every now and again the query actually
> returns the correct values. This is only affecting a single shard in our
> setup.
>
> The behavior pattern generally looks like the query works properly when it
> hasn't been run recently, and then returns nothing after the query seems to
> have been cached (< 50ms QTime). Wait a while and you get the correct
> result followed by blanks. It doesn't matter which replica of the shard is
> queried; the results are the same.
>
> The general query in question looks like
> /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
>
> The field is defined in the schema as <field name="market" type="string"
> docValues="true"/>
>
> There are numerous other fields defined similarly, and they do not exhibit
> the same behavior when used as the facet.field value. They consistently
> return the right results on the shard in question.
>
> If we add facet.method=enum to the query, we get the correct results every
> time (though slower. So our assumption is that something is sporadically
> working when the fc method is chosen by default.
>
> A few other notes about the collection. This collection is not freshly
> indexed, but has not had any particularly bad failures beyond follower
> replicas going down due to PKIAuthentication timeouts (has been fixed). It
> has also had a full reindex after a schema change added docValues some
> fields (including the one above), but the collection wasn't emptied first.
> We are using the composite router to co-locate documents.
>
> Currently, our plan is just to reindex all of the documents on the affected
> shard to see if that fixes the problem. Any ideas on what might be
> happening or ways to troubleshoot this are appreciated.
>
> Thanks,
> Chris
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent results for facet queries

Chris Ulicny
I thought that decision would come back to bite us somehow. At the time, we
didn't have enough space available to do a fresh reindex alongside the old
collection, so the only course of action available was to index over the
old one, and the vast majority of its use worked as expected.

We're planning on upgrading to version 7 at some point in the near future
and will have enough space to do a full, clean reindex at that time.

bq: This can propagate through all following segment merges IIUC.

It is exceedingly unfortunate that reindexing the data on that shard only
probably won't end up fixing the problem.

Out of curiosity, are there any good write-ups or documentation on how two
(or more) lucene segments are merged, or is it just worth looking at the
source code to figure that out?

Thanks,
Chris

On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <[hidden email]>
wrote:

> bq: ...but the collection wasn't emptied first....
>
> This is what I'd suspect is the problem. Here's the issue: Segments
> aren't merged identically on all replicas. So at some point you had
> this field indexed without docValues, changed that and re-indexed. But
> the segment merging could "read" the first segment it's going to merge
> and think it knows about docValues for that field, when in fact that
> segment had the old (non-DV) definition.
>
> This would not necessarily be the same on all replicas even on the _same_
> shard.
>
> This can propagate through all following segment merges IIUC.
>
> So my bet is that if you index into a new collection, everything will
> be fine. You can also just delete everything first, but I usually
> prefer a new collection so I'm absolutely and positively sure that the
> above can't happen.
>
> Best,
> Erick
>
> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <[hidden email]> wrote:
> > Hi,
> >
> > We've run into a strange issue with our deployment of solrcloud 6.3.0.
> > Essentially, a standard facet query on a string field usually comes back
> > empty when it shouldn't. However, every now and again the query actually
> > returns the correct values. This is only affecting a single shard in our
> > setup.
> >
> > The behavior pattern generally looks like the query works properly when
> it
> > hasn't been run recently, and then returns nothing after the query seems
> to
> > have been cached (< 50ms QTime). Wait a while and you get the correct
> > result followed by blanks. It doesn't matter which replica of the shard
> is
> > queried; the results are the same.
> >
> > The general query in question looks like
> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
> >
> > The field is defined in the schema as <field name="market" type="string"
> > docValues="true"/>
> >
> > There are numerous other fields defined similarly, and they do not
> exhibit
> > the same behavior when used as the facet.field value. They consistently
> > return the right results on the shard in question.
> >
> > If we add facet.method=enum to the query, we get the correct results
> every
> > time (though slower. So our assumption is that something is sporadically
> > working when the fc method is chosen by default.
> >
> > A few other notes about the collection. This collection is not freshly
> > indexed, but has not had any particularly bad failures beyond follower
> > replicas going down due to PKIAuthentication timeouts (has been fixed).
> It
> > has also had a full reindex after a schema change added docValues some
> > fields (including the one above), but the collection wasn't emptied
> first.
> > We are using the composite router to co-locate documents.
> >
> > Currently, our plan is just to reindex all of the documents on the
> affected
> > shard to see if that fixes the problem. Any ideas on what might be
> > happening or ways to troubleshoot this are appreciated.
> >
> > Thanks,
> > Chris
>
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent results for facet queries

Erick Erickson
If it's _only_ on a particular replica, here's what you could do:
Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
define the "node" parameter on ADDREPLICA to get it back on the same
node. Then the normal replication process would pull the entire index
down from the leader.

My bet, though, is that this wouldn't really fix things. While it fixes the
particular case you've noticed I'd guess others would pop up. You can
see what replicas return what by firing individual queries at the
particular replica in question with &distrib=false, something like
solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
blah blah


bq: It is exceedingly unfortunate that reindexing the data on that shard only
probably won't end up fixing the problem

Well, we've been working on the DWIM (Do What I Mean) feature for years,
but progress has stalled.

How would that work? You have two segments with vastly different
characteristics for a field. You could change the type, the multiValued-ness,
the analysis chain, there's no end to the things that could go wrong. Fixing
them actually _is_ impossible given how Lucene is structured.

Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
system after I talk to the dev list....

Consider indexed=true stored=false. After stemming, "running" can be
indexed as "run". At merge time you have no way of knowing that
"running" was the original term so you simply couldn't fix it on merge,
not to mention that the performance penalty would be...er...
severe.

Best,
Erick

On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <[hidden email]> wrote:

> I thought that decision would come back to bite us somehow. At the time, we
> didn't have enough space available to do a fresh reindex alongside the old
> collection, so the only course of action available was to index over the
> old one, and the vast majority of its use worked as expected.
>
> We're planning on upgrading to version 7 at some point in the near future
> and will have enough space to do a full, clean reindex at that time.
>
> bq: This can propagate through all following segment merges IIUC.
>
> It is exceedingly unfortunate that reindexing the data on that shard only
> probably won't end up fixing the problem.
>
> Out of curiosity, are there any good write-ups or documentation on how two
> (or more) lucene segments are merged, or is it just worth looking at the
> source code to figure that out?
>
> Thanks,
> Chris
>
> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <[hidden email]>
> wrote:
>
>> bq: ...but the collection wasn't emptied first....
>>
>> This is what I'd suspect is the problem. Here's the issue: Segments
>> aren't merged identically on all replicas. So at some point you had
>> this field indexed without docValues, changed that and re-indexed. But
>> the segment merging could "read" the first segment it's going to merge
>> and think it knows about docValues for that field, when in fact that
>> segment had the old (non-DV) definition.
>>
>> This would not necessarily be the same on all replicas even on the _same_
>> shard.
>>
>> This can propagate through all following segment merges IIUC.
>>
>> So my bet is that if you index into a new collection, everything will
>> be fine. You can also just delete everything first, but I usually
>> prefer a new collection so I'm absolutely and positively sure that the
>> above can't happen.
>>
>> Best,
>> Erick
>>
>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <[hidden email]> wrote:
>> > Hi,
>> >
>> > We've run into a strange issue with our deployment of solrcloud 6.3.0.
>> > Essentially, a standard facet query on a string field usually comes back
>> > empty when it shouldn't. However, every now and again the query actually
>> > returns the correct values. This is only affecting a single shard in our
>> > setup.
>> >
>> > The behavior pattern generally looks like the query works properly when
>> it
>> > hasn't been run recently, and then returns nothing after the query seems
>> to
>> > have been cached (< 50ms QTime). Wait a while and you get the correct
>> > result followed by blanks. It doesn't matter which replica of the shard
>> is
>> > queried; the results are the same.
>> >
>> > The general query in question looks like
>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
>> >
>> > The field is defined in the schema as <field name="market" type="string"
>> > docValues="true"/>
>> >
>> > There are numerous other fields defined similarly, and they do not
>> exhibit
>> > the same behavior when used as the facet.field value. They consistently
>> > return the right results on the shard in question.
>> >
>> > If we add facet.method=enum to the query, we get the correct results
>> every
>> > time (though slower. So our assumption is that something is sporadically
>> > working when the fc method is chosen by default.
>> >
>> > A few other notes about the collection. This collection is not freshly
>> > indexed, but has not had any particularly bad failures beyond follower
>> > replicas going down due to PKIAuthentication timeouts (has been fixed).
>> It
>> > has also had a full reindex after a schema change added docValues some
>> > fields (including the one above), but the collection wasn't emptied
>> first.
>> > We are using the composite router to co-locate documents.
>> >
>> > Currently, our plan is just to reindex all of the documents on the
>> affected
>> > shard to see if that fixes the problem. Any ideas on what might be
>> > happening or ways to troubleshoot this are appreciated.
>> >
>> > Thanks,
>> > Chris
>>
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent results for facet queries

Erick Erickson
Never mind. Anything that didn't merge old segments, just threw them
away when empty (which was my idea) would possibly require as much
disk space as the index currently occupied, so doesn't help your
disk-constrained situation.

Best,
Erick

On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <[hidden email]> wrote:

> If it's _only_ on a particular replica, here's what you could do:
> Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> define the "node" parameter on ADDREPLICA to get it back on the same
> node. Then the normal replication process would pull the entire index
> down from the leader.
>
> My bet, though, is that this wouldn't really fix things. While it fixes the
> particular case you've noticed I'd guess others would pop up. You can
> see what replicas return what by firing individual queries at the
> particular replica in question with &distrib=false, something like
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> blah blah
>
>
> bq: It is exceedingly unfortunate that reindexing the data on that shard only
> probably won't end up fixing the problem
>
> Well, we've been working on the DWIM (Do What I Mean) feature for years,
> but progress has stalled.
>
> How would that work? You have two segments with vastly different
> characteristics for a field. You could change the type, the multiValued-ness,
> the analysis chain, there's no end to the things that could go wrong. Fixing
> them actually _is_ impossible given how Lucene is structured.
>
> Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
> system after I talk to the dev list....
>
> Consider indexed=true stored=false. After stemming, "running" can be
> indexed as "run". At merge time you have no way of knowing that
> "running" was the original term so you simply couldn't fix it on merge,
> not to mention that the performance penalty would be...er...
> severe.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <[hidden email]> wrote:
>> I thought that decision would come back to bite us somehow. At the time, we
>> didn't have enough space available to do a fresh reindex alongside the old
>> collection, so the only course of action available was to index over the
>> old one, and the vast majority of its use worked as expected.
>>
>> We're planning on upgrading to version 7 at some point in the near future
>> and will have enough space to do a full, clean reindex at that time.
>>
>> bq: This can propagate through all following segment merges IIUC.
>>
>> It is exceedingly unfortunate that reindexing the data on that shard only
>> probably won't end up fixing the problem.
>>
>> Out of curiosity, are there any good write-ups or documentation on how two
>> (or more) lucene segments are merged, or is it just worth looking at the
>> source code to figure that out?
>>
>> Thanks,
>> Chris
>>
>> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <[hidden email]>
>> wrote:
>>
>>> bq: ...but the collection wasn't emptied first....
>>>
>>> This is what I'd suspect is the problem. Here's the issue: Segments
>>> aren't merged identically on all replicas. So at some point you had
>>> this field indexed without docValues, changed that and re-indexed. But
>>> the segment merging could "read" the first segment it's going to merge
>>> and think it knows about docValues for that field, when in fact that
>>> segment had the old (non-DV) definition.
>>>
>>> This would not necessarily be the same on all replicas even on the _same_
>>> shard.
>>>
>>> This can propagate through all following segment merges IIUC.
>>>
>>> So my bet is that if you index into a new collection, everything will
>>> be fine. You can also just delete everything first, but I usually
>>> prefer a new collection so I'm absolutely and positively sure that the
>>> above can't happen.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <[hidden email]> wrote:
>>> > Hi,
>>> >
>>> > We've run into a strange issue with our deployment of solrcloud 6.3.0.
>>> > Essentially, a standard facet query on a string field usually comes back
>>> > empty when it shouldn't. However, every now and again the query actually
>>> > returns the correct values. This is only affecting a single shard in our
>>> > setup.
>>> >
>>> > The behavior pattern generally looks like the query works properly when
>>> it
>>> > hasn't been run recently, and then returns nothing after the query seems
>>> to
>>> > have been cached (< 50ms QTime). Wait a while and you get the correct
>>> > result followed by blanks. It doesn't matter which replica of the shard
>>> is
>>> > queried; the results are the same.
>>> >
>>> > The general query in question looks like
>>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
>>> >
>>> > The field is defined in the schema as <field name="market" type="string"
>>> > docValues="true"/>
>>> >
>>> > There are numerous other fields defined similarly, and they do not
>>> exhibit
>>> > the same behavior when used as the facet.field value. They consistently
>>> > return the right results on the shard in question.
>>> >
>>> > If we add facet.method=enum to the query, we get the correct results
>>> every
>>> > time (though slower. So our assumption is that something is sporadically
>>> > working when the fc method is chosen by default.
>>> >
>>> > A few other notes about the collection. This collection is not freshly
>>> > indexed, but has not had any particularly bad failures beyond follower
>>> > replicas going down due to PKIAuthentication timeouts (has been fixed).
>>> It
>>> > has also had a full reindex after a schema change added docValues some
>>> > fields (including the one above), but the collection wasn't emptied
>>> first.
>>> > We are using the composite router to co-locate documents.
>>> >
>>> > Currently, our plan is just to reindex all of the documents on the
>>> affected
>>> > shard to see if that fixes the problem. Any ideas on what might be
>>> > happening or ways to troubleshoot this are appreciated.
>>> >
>>> > Thanks,
>>> > Chris
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent results for facet queries

Chris Ulicny
We tested the query on all replicas for the given shard, and they all have
the same issue. So deleting and adding another replica won't fix the
problem since the leader is exhibiting the behavior as well. I believe the
second replica was moved (new one added, old one deleted) between nodes and
so was just a copy of the leader's index after the problematic merge
happened.

bq: Anything that didn't merge old segments, just threw them
away when empty (which was my idea) would possibly require as much
disk space as the index currently occupied, so doesn't help your
disk-constrained situation.

Something like this was originally what I thought might fix the issue. If
we reindex the data for the affected shard, it would possibly delete all
docs from the old segments and just drop them instead of merging them. As
mentioned, you'd expect the problems to persist through subsequent merges.
So I've got two questions

1) If the problem persists through merges, does it only affect the segments
being merged, and then when solr goes looking for the values, it comes up
empty? Instead of all segments being affected by a single merge they
weren't a part of.

2) Is it expected that any large tainted segments will eventually merge
with clean segments resulting in more tainted segments as enough docs are
deleted on the large segments?

Also, we aren't disk constrained as much as previously. Reindexing a subset
of docs is possible, but a full clean collection reindex isn't.

Thanks,
Chris


On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson <[hidden email]>
wrote:

> Never mind. Anything that didn't merge old segments, just threw them
> away when empty (which was my idea) would possibly require as much
> disk space as the index currently occupied, so doesn't help your
> disk-constrained situation.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <[hidden email]>
> wrote:
> > If it's _only_ on a particular replica, here's what you could do:
> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> > define the "node" parameter on ADDREPLICA to get it back on the same
> > node. Then the normal replication process would pull the entire index
> > down from the leader.
> >
> > My bet, though, is that this wouldn't really fix things. While it fixes
> the
> > particular case you've noticed I'd guess others would pop up. You can
> > see what replicas return what by firing individual queries at the
> > particular replica in question with &distrib=false, something like
> >
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> > blah blah
> >
> >
> > bq: It is exceedingly unfortunate that reindexing the data on that shard
> only
> > probably won't end up fixing the problem
> >
> > Well, we've been working on the DWIM (Do What I Mean) feature for years,
> > but progress has stalled.
> >
> > How would that work? You have two segments with vastly different
> > characteristics for a field. You could change the type, the
> multiValued-ness,
> > the analysis chain, there's no end to the things that could go wrong.
> Fixing
> > them actually _is_ impossible given how Lucene is structured.
> >
> > Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
> > system after I talk to the dev list....
> >
> > Consider indexed=true stored=false. After stemming, "running" can be
> > indexed as "run". At merge time you have no way of knowing that
> > "running" was the original term so you simply couldn't fix it on merge,
> > not to mention that the performance penalty would be...er...
> > severe.
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <[hidden email]> wrote:
> >> I thought that decision would come back to bite us somehow. At the
> time, we
> >> didn't have enough space available to do a fresh reindex alongside the
> old
> >> collection, so the only course of action available was to index over the
> >> old one, and the vast majority of its use worked as expected.
> >>
> >> We're planning on upgrading to version 7 at some point in the near
> future
> >> and will have enough space to do a full, clean reindex at that time.
> >>
> >> bq: This can propagate through all following segment merges IIUC.
> >>
> >> It is exceedingly unfortunate that reindexing the data on that shard
> only
> >> probably won't end up fixing the problem.
> >>
> >> Out of curiosity, are there any good write-ups or documentation on how
> two
> >> (or more) lucene segments are merged, or is it just worth looking at the
> >> source code to figure that out?
> >>
> >> Thanks,
> >> Chris
> >>
> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <[hidden email]
> >
> >> wrote:
> >>
> >>> bq: ...but the collection wasn't emptied first....
> >>>
> >>> This is what I'd suspect is the problem. Here's the issue: Segments
> >>> aren't merged identically on all replicas. So at some point you had
> >>> this field indexed without docValues, changed that and re-indexed. But
> >>> the segment merging could "read" the first segment it's going to merge
> >>> and think it knows about docValues for that field, when in fact that
> >>> segment had the old (non-DV) definition.
> >>>
> >>> This would not necessarily be the same on all replicas even on the
> _same_
> >>> shard.
> >>>
> >>> This can propagate through all following segment merges IIUC.
> >>>
> >>> So my bet is that if you index into a new collection, everything will
> >>> be fine. You can also just delete everything first, but I usually
> >>> prefer a new collection so I'm absolutely and positively sure that the
> >>> above can't happen.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <[hidden email]>
> wrote:
> >>> > Hi,
> >>> >
> >>> > We've run into a strange issue with our deployment of solrcloud
> 6.3.0.
> >>> > Essentially, a standard facet query on a string field usually comes
> back
> >>> > empty when it shouldn't. However, every now and again the query
> actually
> >>> > returns the correct values. This is only affecting a single shard in
> our
> >>> > setup.
> >>> >
> >>> > The behavior pattern generally looks like the query works properly
> when
> >>> it
> >>> > hasn't been run recently, and then returns nothing after the query
> seems
> >>> to
> >>> > have been cached (< 50ms QTime). Wait a while and you get the correct
> >>> > result followed by blanks. It doesn't matter which replica of the
> shard
> >>> is
> >>> > queried; the results are the same.
> >>> >
> >>> > The general query in question looks like
> >>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
> >>> >
> >>> > The field is defined in the schema as <field name="market"
> type="string"
> >>> > docValues="true"/>
> >>> >
> >>> > There are numerous other fields defined similarly, and they do not
> >>> exhibit
> >>> > the same behavior when used as the facet.field value. They
> consistently
> >>> > return the right results on the shard in question.
> >>> >
> >>> > If we add facet.method=enum to the query, we get the correct results
> >>> every
> >>> > time (though slower. So our assumption is that something is
> sporadically
> >>> > working when the fc method is chosen by default.
> >>> >
> >>> > A few other notes about the collection. This collection is not
> freshly
> >>> > indexed, but has not had any particularly bad failures beyond
> follower
> >>> > replicas going down due to PKIAuthentication timeouts (has been
> fixed).
> >>> It
> >>> > has also had a full reindex after a schema change added docValues
> some
> >>> > fields (including the one above), but the collection wasn't emptied
> >>> first.
> >>> > We are using the composite router to co-locate documents.
> >>> >
> >>> > Currently, our plan is just to reindex all of the documents on the
> >>> affected
> >>> > shard to see if that fixes the problem. Any ideas on what might be
> >>> > happening or ways to troubleshoot this are appreciated.
> >>> >
> >>> > Thanks,
> >>> > Chris
> >>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent results for facet queries

Erick Erickson
(1) It doesn't matter whether it "affect only segments being merged".
You can't get accurate information if different segments have
different expectations.

(2) I strongly doubt it. The problem is that the "tainted" segments'
meta-data is still read when merging. If the segment consisted of
_only_ deleted documents you'd probably lose it, but it'll be
re-merged long before it consists of exclusively deleted documents.

Really, you have to re-index to be sure, I suspect you can find some
way to do this faster than exploring undefined behavior and hoping.

If you can re-index _anywhere_ to a collection with the same number of
shards you can get this done, it'll take some tricky dancing but....

0> copy one index directory from each shard someplace safe.....
1> reindex somewhere, single-replica will do.
2> Delete all replicas except one for your current collection
3> issue an admin API command fetchindex for each replica in old
collection, pulling the index "from the right place" in the new
collection. It's important that there only be a single replica for
each shard active at this point. These two collection do _not_ need to
be part of the same SolrCloud, the fetchindex command just takes a URL
of the core to fetch from.
4> add the replicas back and let them replicate.

Your installation would be unavailable for searching during steps 2-4 of course.

Best,
Erick

On Thu, Oct 12, 2017 at 9:01 AM, Chris Ulicny <[hidden email]> wrote:

> We tested the query on all replicas for the given shard, and they all have
> the same issue. So deleting and adding another replica won't fix the
> problem since the leader is exhibiting the behavior as well. I believe the
> second replica was moved (new one added, old one deleted) between nodes and
> so was just a copy of the leader's index after the problematic merge
> happened.
>
> bq: Anything that didn't merge old segments, just threw them
> away when empty (which was my idea) would possibly require as much
> disk space as the index currently occupied, so doesn't help your
> disk-constrained situation.
>
> Something like this was originally what I thought might fix the issue. If
> we reindex the data for the affected shard, it would possibly delete all
> docs from the old segments and just drop them instead of merging them. As
> mentioned, you'd expect the problems to persist through subsequent merges.
> So I've got two questions
>
> 1) If the problem persists through merges, does it only affect the segments
> being merged, and then when solr goes looking for the values, it comes up
> empty? Instead of all segments being affected by a single merge they
> weren't a part of.
>
> 2) Is it expected that any large tainted segments will eventually merge
> with clean segments resulting in more tainted segments as enough docs are
> deleted on the large segments?
>
> Also, we aren't disk constrained as much as previously. Reindexing a subset
> of docs is possible, but a full clean collection reindex isn't.
>
> Thanks,
> Chris
>
>
> On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson <[hidden email]>
> wrote:
>
>> Never mind. Anything that didn't merge old segments, just threw them
>> away when empty (which was my idea) would possibly require as much
>> disk space as the index currently occupied, so doesn't help your
>> disk-constrained situation.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <[hidden email]>
>> wrote:
>> > If it's _only_ on a particular replica, here's what you could do:
>> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
>> > define the "node" parameter on ADDREPLICA to get it back on the same
>> > node. Then the normal replication process would pull the entire index
>> > down from the leader.
>> >
>> > My bet, though, is that this wouldn't really fix things. While it fixes
>> the
>> > particular case you've noticed I'd guess others would pop up. You can
>> > see what replicas return what by firing individual queries at the
>> > particular replica in question with &distrib=false, something like
>> >
>> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
>> > blah blah
>> >
>> >
>> > bq: It is exceedingly unfortunate that reindexing the data on that shard
>> only
>> > probably won't end up fixing the problem
>> >
>> > Well, we've been working on the DWIM (Do What I Mean) feature for years,
>> > but progress has stalled.
>> >
>> > How would that work? You have two segments with vastly different
>> > characteristics for a field. You could change the type, the
>> multiValued-ness,
>> > the analysis chain, there's no end to the things that could go wrong.
>> Fixing
>> > them actually _is_ impossible given how Lucene is structured.
>> >
>> > Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
>> > system after I talk to the dev list....
>> >
>> > Consider indexed=true stored=false. After stemming, "running" can be
>> > indexed as "run". At merge time you have no way of knowing that
>> > "running" was the original term so you simply couldn't fix it on merge,
>> > not to mention that the performance penalty would be...er...
>> > severe.
>> >
>> > Best,
>> > Erick
>> >
>> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <[hidden email]> wrote:
>> >> I thought that decision would come back to bite us somehow. At the
>> time, we
>> >> didn't have enough space available to do a fresh reindex alongside the
>> old
>> >> collection, so the only course of action available was to index over the
>> >> old one, and the vast majority of its use worked as expected.
>> >>
>> >> We're planning on upgrading to version 7 at some point in the near
>> future
>> >> and will have enough space to do a full, clean reindex at that time.
>> >>
>> >> bq: This can propagate through all following segment merges IIUC.
>> >>
>> >> It is exceedingly unfortunate that reindexing the data on that shard
>> only
>> >> probably won't end up fixing the problem.
>> >>
>> >> Out of curiosity, are there any good write-ups or documentation on how
>> two
>> >> (or more) lucene segments are merged, or is it just worth looking at the
>> >> source code to figure that out?
>> >>
>> >> Thanks,
>> >> Chris
>> >>
>> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <[hidden email]
>> >
>> >> wrote:
>> >>
>> >>> bq: ...but the collection wasn't emptied first....
>> >>>
>> >>> This is what I'd suspect is the problem. Here's the issue: Segments
>> >>> aren't merged identically on all replicas. So at some point you had
>> >>> this field indexed without docValues, changed that and re-indexed. But
>> >>> the segment merging could "read" the first segment it's going to merge
>> >>> and think it knows about docValues for that field, when in fact that
>> >>> segment had the old (non-DV) definition.
>> >>>
>> >>> This would not necessarily be the same on all replicas even on the
>> _same_
>> >>> shard.
>> >>>
>> >>> This can propagate through all following segment merges IIUC.
>> >>>
>> >>> So my bet is that if you index into a new collection, everything will
>> >>> be fine. You can also just delete everything first, but I usually
>> >>> prefer a new collection so I'm absolutely and positively sure that the
>> >>> above can't happen.
>> >>>
>> >>> Best,
>> >>> Erick
>> >>>
>> >>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <[hidden email]>
>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > We've run into a strange issue with our deployment of solrcloud
>> 6.3.0.
>> >>> > Essentially, a standard facet query on a string field usually comes
>> back
>> >>> > empty when it shouldn't. However, every now and again the query
>> actually
>> >>> > returns the correct values. This is only affecting a single shard in
>> our
>> >>> > setup.
>> >>> >
>> >>> > The behavior pattern generally looks like the query works properly
>> when
>> >>> it
>> >>> > hasn't been run recently, and then returns nothing after the query
>> seems
>> >>> to
>> >>> > have been cached (< 50ms QTime). Wait a while and you get the correct
>> >>> > result followed by blanks. It doesn't matter which replica of the
>> shard
>> >>> is
>> >>> > queried; the results are the same.
>> >>> >
>> >>> > The general query in question looks like
>> >>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
>> >>> >
>> >>> > The field is defined in the schema as <field name="market"
>> type="string"
>> >>> > docValues="true"/>
>> >>> >
>> >>> > There are numerous other fields defined similarly, and they do not
>> >>> exhibit
>> >>> > the same behavior when used as the facet.field value. They
>> consistently
>> >>> > return the right results on the shard in question.
>> >>> >
>> >>> > If we add facet.method=enum to the query, we get the correct results
>> >>> every
>> >>> > time (though slower. So our assumption is that something is
>> sporadically
>> >>> > working when the fc method is chosen by default.
>> >>> >
>> >>> > A few other notes about the collection. This collection is not
>> freshly
>> >>> > indexed, but has not had any particularly bad failures beyond
>> follower
>> >>> > replicas going down due to PKIAuthentication timeouts (has been
>> fixed).
>> >>> It
>> >>> > has also had a full reindex after a schema change added docValues
>> some
>> >>> > fields (including the one above), but the collection wasn't emptied
>> >>> first.
>> >>> > We are using the composite router to co-locate documents.
>> >>> >
>> >>> > Currently, our plan is just to reindex all of the documents on the
>> >>> affected
>> >>> > shard to see if that fixes the problem. Any ideas on what might be
>> >>> > happening or ways to troubleshoot this are appreciated.
>> >>> >
>> >>> > Thanks,
>> >>> > Chris
>> >>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent results for facet queries

Chris Ulicny
I'm not sure if that method is viable for reindexing and fetching the whole
collection at once for us, but unless there is something inherent in that
process which happens at the collection level, we could do it a few shards
at a time since it is a multi-tenant setup.

I'll see if we can setup a small test in QA for this, and test it out. This
facet issue is the only one we've noticed and is able to be worked around,
so we might end up just waiting until we reindex for version 7.X to
permanently fix it.

Thanks
Chris

On Thu, Oct 12, 2017 at 1:41 PM Erick Erickson <[hidden email]>
wrote:

> (1) It doesn't matter whether it "affect only segments being merged".
> You can't get accurate information if different segments have
> different expectations.
>
> (2) I strongly doubt it. The problem is that the "tainted" segments'
> meta-data is still read when merging. If the segment consisted of
> _only_ deleted documents you'd probably lose it, but it'll be
> re-merged long before it consists of exclusively deleted documents.
>
> Really, you have to re-index to be sure, I suspect you can find some
> way to do this faster than exploring undefined behavior and hoping.
>
> If you can re-index _anywhere_ to a collection with the same number of
> shards you can get this done, it'll take some tricky dancing but....
>
> 0> copy one index directory from each shard someplace safe.....
> 1> reindex somewhere, single-replica will do.
> 2> Delete all replicas except one for your current collection
> 3> issue an admin API command fetchindex for each replica in old
> collection, pulling the index "from the right place" in the new
> collection. It's important that there only be a single replica for
> each shard active at this point. These two collection do _not_ need to
> be part of the same SolrCloud, the fetchindex command just takes a URL
> of the core to fetch from.
> 4> add the replicas back and let them replicate.
>
> Your installation would be unavailable for searching during steps 2-4 of
> course.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 9:01 AM, Chris Ulicny <[hidden email]> wrote:
> > We tested the query on all replicas for the given shard, and they all
> have
> > the same issue. So deleting and adding another replica won't fix the
> > problem since the leader is exhibiting the behavior as well. I believe
> the
> > second replica was moved (new one added, old one deleted) between nodes
> and
> > so was just a copy of the leader's index after the problematic merge
> > happened.
> >
> > bq: Anything that didn't merge old segments, just threw them
> > away when empty (which was my idea) would possibly require as much
> > disk space as the index currently occupied, so doesn't help your
> > disk-constrained situation.
> >
> > Something like this was originally what I thought might fix the issue. If
> > we reindex the data for the affected shard, it would possibly delete all
> > docs from the old segments and just drop them instead of merging them. As
> > mentioned, you'd expect the problems to persist through subsequent
> merges.
> > So I've got two questions
> >
> > 1) If the problem persists through merges, does it only affect the
> segments
> > being merged, and then when solr goes looking for the values, it comes up
> > empty? Instead of all segments being affected by a single merge they
> > weren't a part of.
> >
> > 2) Is it expected that any large tainted segments will eventually merge
> > with clean segments resulting in more tainted segments as enough docs are
> > deleted on the large segments?
> >
> > Also, we aren't disk constrained as much as previously. Reindexing a
> subset
> > of docs is possible, but a full clean collection reindex isn't.
> >
> > Thanks,
> > Chris
> >
> >
> > On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson <[hidden email]
> >
> > wrote:
> >
> >> Never mind. Anything that didn't merge old segments, just threw them
> >> away when empty (which was my idea) would possibly require as much
> >> disk space as the index currently occupied, so doesn't help your
> >> disk-constrained situation.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <
> [hidden email]>
> >> wrote:
> >> > If it's _only_ on a particular replica, here's what you could do:
> >> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> >> > define the "node" parameter on ADDREPLICA to get it back on the same
> >> > node. Then the normal replication process would pull the entire index
> >> > down from the leader.
> >> >
> >> > My bet, though, is that this wouldn't really fix things. While it
> fixes
> >> the
> >> > particular case you've noticed I'd guess others would pop up. You can
> >> > see what replicas return what by firing individual queries at the
> >> > particular replica in question with &distrib=false, something like
> >> >
> >>
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> >> > blah blah
> >> >
> >> >
> >> > bq: It is exceedingly unfortunate that reindexing the data on that
> shard
> >> only
> >> > probably won't end up fixing the problem
> >> >
> >> > Well, we've been working on the DWIM (Do What I Mean) feature for
> years,
> >> > but progress has stalled.
> >> >
> >> > How would that work? You have two segments with vastly different
> >> > characteristics for a field. You could change the type, the
> >> multiValued-ness,
> >> > the analysis chain, there's no end to the things that could go wrong.
> >> Fixing
> >> > them actually _is_ impossible given how Lucene is structured.
> >> >
> >> > Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
> >> > system after I talk to the dev list....
> >> >
> >> > Consider indexed=true stored=false. After stemming, "running" can be
> >> > indexed as "run". At merge time you have no way of knowing that
> >> > "running" was the original term so you simply couldn't fix it on
> merge,
> >> > not to mention that the performance penalty would be...er...
> >> > severe.
> >> >
> >> > Best,
> >> > Erick
> >> >
> >> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <[hidden email]>
> wrote:
> >> >> I thought that decision would come back to bite us somehow. At the
> >> time, we
> >> >> didn't have enough space available to do a fresh reindex alongside
> the
> >> old
> >> >> collection, so the only course of action available was to index over
> the
> >> >> old one, and the vast majority of its use worked as expected.
> >> >>
> >> >> We're planning on upgrading to version 7 at some point in the near
> >> future
> >> >> and will have enough space to do a full, clean reindex at that time.
> >> >>
> >> >> bq: This can propagate through all following segment merges IIUC.
> >> >>
> >> >> It is exceedingly unfortunate that reindexing the data on that shard
> >> only
> >> >> probably won't end up fixing the problem.
> >> >>
> >> >> Out of curiosity, are there any good write-ups or documentation on
> how
> >> two
> >> >> (or more) lucene segments are merged, or is it just worth looking at
> the
> >> >> source code to figure that out?
> >> >>
> >> >> Thanks,
> >> >> Chris
> >> >>
> >> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <
> [hidden email]
> >> >
> >> >> wrote:
> >> >>
> >> >>> bq: ...but the collection wasn't emptied first....
> >> >>>
> >> >>> This is what I'd suspect is the problem. Here's the issue: Segments
> >> >>> aren't merged identically on all replicas. So at some point you had
> >> >>> this field indexed without docValues, changed that and re-indexed.
> But
> >> >>> the segment merging could "read" the first segment it's going to
> merge
> >> >>> and think it knows about docValues for that field, when in fact that
> >> >>> segment had the old (non-DV) definition.
> >> >>>
> >> >>> This would not necessarily be the same on all replicas even on the
> >> _same_
> >> >>> shard.
> >> >>>
> >> >>> This can propagate through all following segment merges IIUC.
> >> >>>
> >> >>> So my bet is that if you index into a new collection, everything
> will
> >> >>> be fine. You can also just delete everything first, but I usually
> >> >>> prefer a new collection so I'm absolutely and positively sure that
> the
> >> >>> above can't happen.
> >> >>>
> >> >>> Best,
> >> >>> Erick
> >> >>>
> >> >>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <[hidden email]>
> >> wrote:
> >> >>> > Hi,
> >> >>> >
> >> >>> > We've run into a strange issue with our deployment of solrcloud
> >> 6.3.0.
> >> >>> > Essentially, a standard facet query on a string field usually
> comes
> >> back
> >> >>> > empty when it shouldn't. However, every now and again the query
> >> actually
> >> >>> > returns the correct values. This is only affecting a single shard
> in
> >> our
> >> >>> > setup.
> >> >>> >
> >> >>> > The behavior pattern generally looks like the query works properly
> >> when
> >> >>> it
> >> >>> > hasn't been run recently, and then returns nothing after the query
> >> seems
> >> >>> to
> >> >>> > have been cached (< 50ms QTime). Wait a while and you get the
> correct
> >> >>> > result followed by blanks. It doesn't matter which replica of the
> >> shard
> >> >>> is
> >> >>> > queried; the results are the same.
> >> >>> >
> >> >>> > The general query in question looks like
> >> >>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some
> filters>
> >> >>> >
> >> >>> > The field is defined in the schema as <field name="market"
> >> type="string"
> >> >>> > docValues="true"/>
> >> >>> >
> >> >>> > There are numerous other fields defined similarly, and they do not
> >> >>> exhibit
> >> >>> > the same behavior when used as the facet.field value. They
> >> consistently
> >> >>> > return the right results on the shard in question.
> >> >>> >
> >> >>> > If we add facet.method=enum to the query, we get the correct
> results
> >> >>> every
> >> >>> > time (though slower. So our assumption is that something is
> >> sporadically
> >> >>> > working when the fc method is chosen by default.
> >> >>> >
> >> >>> > A few other notes about the collection. This collection is not
> >> freshly
> >> >>> > indexed, but has not had any particularly bad failures beyond
> >> follower
> >> >>> > replicas going down due to PKIAuthentication timeouts (has been
> >> fixed).
> >> >>> It
> >> >>> > has also had a full reindex after a schema change added docValues
> >> some
> >> >>> > fields (including the one above), but the collection wasn't
> emptied
> >> >>> first.
> >> >>> > We are using the composite router to co-locate documents.
> >> >>> >
> >> >>> > Currently, our plan is just to reindex all of the documents on the
> >> >>> affected
> >> >>> > shard to see if that fixes the problem. Any ideas on what might be
> >> >>> > happening or ways to troubleshoot this are appreciated.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Chris
> >> >>>
> >>
>