Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie
Hello

Apologies for the lack of actual detail in this, we're still digging into
it ourselves. I will provide more detail, and maybe some logs, once I have
a better idea of what is actually happening.
But I thought I might as well ask if anyone knows of changes that were made
in the Solr 8.3 release that are likely to have caused an issue like this?

We were on Solr 8.1.1 for several months and moved to 8.2.0 for about 2
weeks before moving to 8.3.0 last week.
We didn't see this issue at all on the previous releases. Since moving to
8.3 we have had a consistent (but non-deterministic) set of failing tests,
on Windows and Linux.

The issue we are seeing as that during updates, the data we have sent is
*sometimes* corrupted, as though a buffer has been used incorrectly. For
example if the well formed data went was
*'fieldName':"this is a long string"*
The error we see from Solr might be that
unknown field * 'fieldNamis a long string" *

And variations of that kind of behaviour, were part of the data is missing
or corrupted. The data we are indexing does include fields which store
(escaped) serialized JSON strings - if that might have any bearing - but
the error isn't always on those fields.
For example, given a valid document that looks like this (I've replaced the
values by hand, so if the json is messed up here, that's not relevant:)
when returned with the json response writer:




*{    "id": "abcd",    "testField": "blah",    "jsonField":
"{\"thing\":{\"abcd\":\"value\",\"xyz\":[\"abc\",\"def\",\"ghi\"],\"nnn\":\"xyz\"},\"stuff\":[{\"qqq\":\"rrr\"}],\"ttt\":0,\"mmm\":\"Some
string\",\"someBool\":true}"}*
We've had errors during indexing like:
*unknown field
'testField:"value","xyz":["abc","def","ghi"],"nnn":"xyz"},"stuff":[{"qqq":"rrr"}],"ttt":0,"mmm":"Some
string","someBool":true}���������������������������'*
(those � unprintable characters are part of it)

So far we've not been able to reproduce the problem on a collection with a
single shard, so it does seem like the problem is only happening internally
when updates are distributed to the other shards... But that's not been
totally verified.

We've also only encountered the problem on one of the collections we build
(the data within each collection is generally the same though. The ids are
slightly different - but still strings. The main difference is that this
problematic index is built using an Iterator<SolrInputDocument> to *solrj
org.apache.solr.client.solrj.SolrClient.add(String,
Iterator<SolrInputDocument>)* - the *SolrInputDocument*s are not being
reused in the client, I checked that -, while the other index is built by
streaming CSVs to Solr.)


We will look into it further, but if anyone has any ideas of what might
have changed in 8.3 from 8.1 / 8.2 that could cause this, that would be
helpful.

Cheers
Colvin
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie
I've identified the change which has caused the problem to materialize, but
it shouldn't itself cause a problem.

https://github.com/apache/lucene-solr/commit/e45e8127d5c17af4e4b87a0a4eaf0afaf4f9ff4b#diff-7f7f485122d8257bd5d3210c092b967fR52
for https://issues.apache.org/jira/browse/SOLR-13682

In writeMap, the new BiConsumer unwraps the SolrInputField using getValue
rather than getRawValue (which the JavaBinCodec calls):


*      if (o instanceof SolrInputField) {        o = ((SolrInputField)
o).getValue();      }*
As a result the JavaBinCodec will now be hitting different writer methods
based on the value retrieved from the SolrInputField, rather than just
writing the org.apache.solr.common.util.JavaBinCodec.writeKnownType(Object)


*    if (val instanceof SolrInputField) {      return
writeKnownType(((SolrInputField) val).getRawValue());    }*
https://github.com/apache/lucene-solr/blob/branch_8_3/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L362

SolrInputField getValue uses
org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(Object)
while getRawValue just returns whatever value the SolrInputField has, so
the EntryWriter in the JavaBinCodec hits different paths from the ones
which must non-deterministically produce garbage data when getValue() is
used.

Changing *getValue()* to *getRawValue()* in the SolrInputDocument's
*writeMap()* appears to "fix" the problem. (With getValue() the test I have
reliably fails within 50 iterations of indexing 2500 documents, with
getRawValue() it succeeds for the 500 iterations I'm running it for)

I'll see about providing a test that can be shared that demonstrates the
problem, and see if we can find what is going wrong in the codec...


On Tue, 19 Nov 2019 at 13:48, Colvin Cowie <[hidden email]>
wrote:

> Hello
>
> Apologies for the lack of actual detail in this, we're still digging into
> it ourselves. I will provide more detail, and maybe some logs, once I have
> a better idea of what is actually happening.
> But I thought I might as well ask if anyone knows of changes that were
> made in the Solr 8.3 release that are likely to have caused an issue like
> this?
>
> We were on Solr 8.1.1 for several months and moved to 8.2.0 for about 2
> weeks before moving to 8.3.0 last week.
> We didn't see this issue at all on the previous releases. Since moving to
> 8.3 we have had a consistent (but non-deterministic) set of failing tests,
> on Windows and Linux.
>
> The issue we are seeing as that during updates, the data we have sent is
> *sometimes* corrupted, as though a buffer has been used incorrectly. For
> example if the well formed data went was
> *'fieldName':"this is a long string"*
> The error we see from Solr might be that
> unknown field * 'fieldNamis a long string" *
>
> And variations of that kind of behaviour, were part of the data is missing
> or corrupted. The data we are indexing does include fields which store
> (escaped) serialized JSON strings - if that might have any bearing - but
> the error isn't always on those fields.
> For example, given a valid document that looks like this (I've replaced
> the values by hand, so if the json is messed up here, that's not relevant:)
> when returned with the json response writer:
>
>
>
>
> *{    "id": "abcd",    "testField": "blah",    "jsonField":
> "{\"thing\":{\"abcd\":\"value\",\"xyz\":[\"abc\",\"def\",\"ghi\"],\"nnn\":\"xyz\"},\"stuff\":[{\"qqq\":\"rrr\"}],\"ttt\":0,\"mmm\":\"Some
> string\",\"someBool\":true}"}*
> We've had errors during indexing like:
> *unknown field
> 'testField:"value","xyz":["abc","def","ghi"],"nnn":"xyz"},"stuff":[{"qqq":"rrr"}],"ttt":0,"mmm":"Some
> string","someBool":true}���������������������������'*
> (those � unprintable characters are part of it)
>
> So far we've not been able to reproduce the problem on a collection with a
> single shard, so it does seem like the problem is only happening internally
> when updates are distributed to the other shards... But that's not been
> totally verified.
>
> We've also only encountered the problem on one of the collections we build
> (the data within each collection is generally the same though. The ids are
> slightly different - but still strings. The main difference is that this
> problematic index is built using an Iterator<SolrInputDocument> to *solrj
> org.apache.solr.client.solrj.SolrClient.add(String,
> Iterator<SolrInputDocument>)* - the *SolrInputDocument*s are not being
> reused in the client, I checked that -, while the other index is built by
> streaming CSVs to Solr.)
>
>
> We will look into it further, but if anyone has any ideas of what might
> have changed in 8.3 from 8.1 / 8.2 that could cause this, that would be
> helpful.
>
> Cheers
> Colvin
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Noble Paul നോബിള്‍  नोब्ळ्
Thanks Colvin, I'll take a look

On Thu, Nov 21, 2019 at 4:24 AM Colvin Cowie <[hidden email]> wrote:

>
> I've identified the change which has caused the problem to materialize, but
> it shouldn't itself cause a problem.
>
> https://github.com/apache/lucene-solr/commit/e45e8127d5c17af4e4b87a0a4eaf0afaf4f9ff4b#diff-7f7f485122d8257bd5d3210c092b967fR52
> for https://issues.apache.org/jira/browse/SOLR-13682
>
> In writeMap, the new BiConsumer unwraps the SolrInputField using getValue
> rather than getRawValue (which the JavaBinCodec calls):
>
>
> *      if (o instanceof SolrInputField) {        o = ((SolrInputField)
> o).getValue();      }*
> As a result the JavaBinCodec will now be hitting different writer methods
> based on the value retrieved from the SolrInputField, rather than just
> writing the org.apache.solr.common.util.JavaBinCodec.writeKnownType(Object)
>
>
> *    if (val instanceof SolrInputField) {      return
> writeKnownType(((SolrInputField) val).getRawValue());    }*
> https://github.com/apache/lucene-solr/blob/branch_8_3/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L362
>
> SolrInputField getValue uses
> org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(Object)
> while getRawValue just returns whatever value the SolrInputField has, so
> the EntryWriter in the JavaBinCodec hits different paths from the ones
> which must non-deterministically produce garbage data when getValue() is
> used.
>
> Changing *getValue()* to *getRawValue()* in the SolrInputDocument's
> *writeMap()* appears to "fix" the problem. (With getValue() the test I have
> reliably fails within 50 iterations of indexing 2500 documents, with
> getRawValue() it succeeds for the 500 iterations I'm running it for)
>
> I'll see about providing a test that can be shared that demonstrates the
> problem, and see if we can find what is going wrong in the codec...
>
>
> On Tue, 19 Nov 2019 at 13:48, Colvin Cowie <[hidden email]>
> wrote:
>
> > Hello
> >
> > Apologies for the lack of actual detail in this, we're still digging into
> > it ourselves. I will provide more detail, and maybe some logs, once I have
> > a better idea of what is actually happening.
> > But I thought I might as well ask if anyone knows of changes that were
> > made in the Solr 8.3 release that are likely to have caused an issue like
> > this?
> >
> > We were on Solr 8.1.1 for several months and moved to 8.2.0 for about 2
> > weeks before moving to 8.3.0 last week.
> > We didn't see this issue at all on the previous releases. Since moving to
> > 8.3 we have had a consistent (but non-deterministic) set of failing tests,
> > on Windows and Linux.
> >
> > The issue we are seeing as that during updates, the data we have sent is
> > *sometimes* corrupted, as though a buffer has been used incorrectly. For
> > example if the well formed data went was
> > *'fieldName':"this is a long string"*
> > The error we see from Solr might be that
> > unknown field * 'fieldNamis a long string" *
> >
> > And variations of that kind of behaviour, were part of the data is missing
> > or corrupted. The data we are indexing does include fields which store
> > (escaped) serialized JSON strings - if that might have any bearing - but
> > the error isn't always on those fields.
> > For example, given a valid document that looks like this (I've replaced
> > the values by hand, so if the json is messed up here, that's not relevant:)
> > when returned with the json response writer:
> >
> >
> >
> >
> > *{    "id": "abcd",    "testField": "blah",    "jsonField":
> > "{\"thing\":{\"abcd\":\"value\",\"xyz\":[\"abc\",\"def\",\"ghi\"],\"nnn\":\"xyz\"},\"stuff\":[{\"qqq\":\"rrr\"}],\"ttt\":0,\"mmm\":\"Some
> > string\",\"someBool\":true}"}*
> > We've had errors during indexing like:
> > *unknown field
> > 'testField:"value","xyz":["abc","def","ghi"],"nnn":"xyz"},"stuff":[{"qqq":"rrr"}],"ttt":0,"mmm":"Some
> > string","someBool":true}���������������������������'*
> > (those � unprintable characters are part of it)
> >
> > So far we've not been able to reproduce the problem on a collection with a
> > single shard, so it does seem like the problem is only happening internally
> > when updates are distributed to the other shards... But that's not been
> > totally verified.
> >
> > We've also only encountered the problem on one of the collections we build
> > (the data within each collection is generally the same though. The ids are
> > slightly different - but still strings. The main difference is that this
> > problematic index is built using an Iterator<SolrInputDocument> to *solrj
> > org.apache.solr.client.solrj.SolrClient.add(String,
> > Iterator<SolrInputDocument>)* - the *SolrInputDocument*s are not being
> > reused in the client, I checked that -, while the other index is built by
> > streaming CSVs to Solr.)
> >
> >
> > We will look into it further, but if anyone has any ideas of what might
> > have changed in 8.3 from 8.1 / 8.2 that could cause this, that would be
> > helpful.
> >
> > Cheers
> > Colvin
> >
> >



--
-----------------------------------------------------
Noble Paul
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Noble Paul നോബിള്‍  नोब्ळ्
Can you share the test please

On Thu, Nov 21, 2019 at 7:02 AM Noble Paul <[hidden email]> wrote:

>
> Thanks Colvin, I'll take a look
>
> On Thu, Nov 21, 2019 at 4:24 AM Colvin Cowie <[hidden email]> wrote:
> >
> > I've identified the change which has caused the problem to materialize, but
> > it shouldn't itself cause a problem.
> >
> > https://github.com/apache/lucene-solr/commit/e45e8127d5c17af4e4b87a0a4eaf0afaf4f9ff4b#diff-7f7f485122d8257bd5d3210c092b967fR52
> > for https://issues.apache.org/jira/browse/SOLR-13682
> >
> > In writeMap, the new BiConsumer unwraps the SolrInputField using getValue
> > rather than getRawValue (which the JavaBinCodec calls):
> >
> >
> > *      if (o instanceof SolrInputField) {        o = ((SolrInputField)
> > o).getValue();      }*
> > As a result the JavaBinCodec will now be hitting different writer methods
> > based on the value retrieved from the SolrInputField, rather than just
> > writing the org.apache.solr.common.util.JavaBinCodec.writeKnownType(Object)
> >
> >
> > *    if (val instanceof SolrInputField) {      return
> > writeKnownType(((SolrInputField) val).getRawValue());    }*
> > https://github.com/apache/lucene-solr/blob/branch_8_3/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L362
> >
> > SolrInputField getValue uses
> > org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(Object)
> > while getRawValue just returns whatever value the SolrInputField has, so
> > the EntryWriter in the JavaBinCodec hits different paths from the ones
> > which must non-deterministically produce garbage data when getValue() is
> > used.
> >
> > Changing *getValue()* to *getRawValue()* in the SolrInputDocument's
> > *writeMap()* appears to "fix" the problem. (With getValue() the test I have
> > reliably fails within 50 iterations of indexing 2500 documents, with
> > getRawValue() it succeeds for the 500 iterations I'm running it for)
> >
> > I'll see about providing a test that can be shared that demonstrates the
> > problem, and see if we can find what is going wrong in the codec...
> >
> >
> > On Tue, 19 Nov 2019 at 13:48, Colvin Cowie <[hidden email]>
> > wrote:
> >
> > > Hello
> > >
> > > Apologies for the lack of actual detail in this, we're still digging into
> > > it ourselves. I will provide more detail, and maybe some logs, once I have
> > > a better idea of what is actually happening.
> > > But I thought I might as well ask if anyone knows of changes that were
> > > made in the Solr 8.3 release that are likely to have caused an issue like
> > > this?
> > >
> > > We were on Solr 8.1.1 for several months and moved to 8.2.0 for about 2
> > > weeks before moving to 8.3.0 last week.
> > > We didn't see this issue at all on the previous releases. Since moving to
> > > 8.3 we have had a consistent (but non-deterministic) set of failing tests,
> > > on Windows and Linux.
> > >
> > > The issue we are seeing as that during updates, the data we have sent is
> > > *sometimes* corrupted, as though a buffer has been used incorrectly. For
> > > example if the well formed data went was
> > > *'fieldName':"this is a long string"*
> > > The error we see from Solr might be that
> > > unknown field * 'fieldNamis a long string" *
> > >
> > > And variations of that kind of behaviour, were part of the data is missing
> > > or corrupted. The data we are indexing does include fields which store
> > > (escaped) serialized JSON strings - if that might have any bearing - but
> > > the error isn't always on those fields.
> > > For example, given a valid document that looks like this (I've replaced
> > > the values by hand, so if the json is messed up here, that's not relevant:)
> > > when returned with the json response writer:
> > >
> > >
> > >
> > >
> > > *{    "id": "abcd",    "testField": "blah",    "jsonField":
> > > "{\"thing\":{\"abcd\":\"value\",\"xyz\":[\"abc\",\"def\",\"ghi\"],\"nnn\":\"xyz\"},\"stuff\":[{\"qqq\":\"rrr\"}],\"ttt\":0,\"mmm\":\"Some
> > > string\",\"someBool\":true}"}*
> > > We've had errors during indexing like:
> > > *unknown field
> > > 'testField:"value","xyz":["abc","def","ghi"],"nnn":"xyz"},"stuff":[{"qqq":"rrr"}],"ttt":0,"mmm":"Some
> > > string","someBool":true}���������������������������'*
> > > (those � unprintable characters are part of it)
> > >
> > > So far we've not been able to reproduce the problem on a collection with a
> > > single shard, so it does seem like the problem is only happening internally
> > > when updates are distributed to the other shards... But that's not been
> > > totally verified.
> > >
> > > We've also only encountered the problem on one of the collections we build
> > > (the data within each collection is generally the same though. The ids are
> > > slightly different - but still strings. The main difference is that this
> > > problematic index is built using an Iterator<SolrInputDocument> to *solrj
> > > org.apache.solr.client.solrj.SolrClient.add(String,
> > > Iterator<SolrInputDocument>)* - the *SolrInputDocument*s are not being
> > > reused in the client, I checked that -, while the other index is built by
> > > streaming CSVs to Solr.)
> > >
> > >
> > > We will look into it further, but if anyone has any ideas of what might
> > > have changed in 8.3 from 8.1 / 8.2 that could cause this, that would be
> > > helpful.
> > >
> > > Cheers
> > > Colvin
> > >
> > >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul



--
-----------------------------------------------------
Noble Paul
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie
Hi, I'll share it when I'm back at work tomorrow.

I've found that the issue appears to not be reproducible using the
solrconfig.xml in the _default example, so it must be something in ours (or
something missing from it) that is making the problem appear, in
combination with the code change.

Thanks


On Wednesday, 20 November 2019, Noble Paul <[hidden email]> wrote:

> Can you share the test please
>
> On Thu, Nov 21, 2019 at 7:02 AM Noble Paul <[hidden email]> wrote:
> >
> > Thanks Colvin, I'll take a look
> >
> > On Thu, Nov 21, 2019 at 4:24 AM Colvin Cowie <[hidden email]>
> wrote:
> > >
> > > I've identified the change which has caused the problem to
> materialize, but
> > > it shouldn't itself cause a problem.
> > >
> > > https://github.com/apache/lucene-solr/commit/
> e45e8127d5c17af4e4b87a0a4eaf0afaf4f9ff4b#diff-
> 7f7f485122d8257bd5d3210c092b967fR52
> > > for https://issues.apache.org/jira/browse/SOLR-13682
> > >
> > > In writeMap, the new BiConsumer unwraps the SolrInputField using
> getValue
> > > rather than getRawValue (which the JavaBinCodec calls):
> > >
> > >
> > > *      if (o instanceof SolrInputField) {        o = ((SolrInputField)
> > > o).getValue();      }*
> > > As a result the JavaBinCodec will now be hitting different writer
> methods
> > > based on the value retrieved from the SolrInputField, rather than just
> > > writing the org.apache.solr.common.util.JavaBinCodec.writeKnownType(
> Object)
> > >
> > >
> > > *    if (val instanceof SolrInputField) {      return
> > > writeKnownType(((SolrInputField) val).getRawValue());    }*
> > > https://github.com/apache/lucene-solr/blob/branch_8_3/
> solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L362
> > >
> > > SolrInputField getValue uses
> > > org.apache.solr.common.util.ByteArrayUtf8CharSequence.
> convertCharSeq(Object)
> > > while getRawValue just returns whatever value the SolrInputField has,
> so
> > > the EntryWriter in the JavaBinCodec hits different paths from the ones
> > > which must non-deterministically produce garbage data when getValue()
> is
> > > used.
> > >
> > > Changing *getValue()* to *getRawValue()* in the SolrInputDocument's
> > > *writeMap()* appears to "fix" the problem. (With getValue() the test I
> have
> > > reliably fails within 50 iterations of indexing 2500 documents, with
> > > getRawValue() it succeeds for the 500 iterations I'm running it for)
> > >
> > > I'll see about providing a test that can be shared that demonstrates
> the
> > > problem, and see if we can find what is going wrong in the codec...
> > >
> > >
> > > On Tue, 19 Nov 2019 at 13:48, Colvin Cowie <[hidden email]
> >
> > > wrote:
> > >
> > > > Hello
> > > >
> > > > Apologies for the lack of actual detail in this, we're still digging
> into
> > > > it ourselves. I will provide more detail, and maybe some logs, once
> I have
> > > > a better idea of what is actually happening.
> > > > But I thought I might as well ask if anyone knows of changes that
> were
> > > > made in the Solr 8.3 release that are likely to have caused an issue
> like
> > > > this?
> > > >
> > > > We were on Solr 8.1.1 for several months and moved to 8.2.0 for
> about 2
> > > > weeks before moving to 8.3.0 last week.
> > > > We didn't see this issue at all on the previous releases. Since
> moving to
> > > > 8.3 we have had a consistent (but non-deterministic) set of failing
> tests,
> > > > on Windows and Linux.
> > > >
> > > > The issue we are seeing as that during updates, the data we have
> sent is
> > > > *sometimes* corrupted, as though a buffer has been used incorrectly.
> For
> > > > example if the well formed data went was
> > > > *'fieldName':"this is a long string"*
> > > > The error we see from Solr might be that
> > > > unknown field * 'fieldNamis a long string" *
> > > >
> > > > And variations of that kind of behaviour, were part of the data is
> missing
> > > > or corrupted. The data we are indexing does include fields which
> store
> > > > (escaped) serialized JSON strings - if that might have any bearing -
> but
> > > > the error isn't always on those fields.
> > > > For example, given a valid document that looks like this (I've
> replaced
> > > > the values by hand, so if the json is messed up here, that's not
> relevant:)
> > > > when returned with the json response writer:
> > > >
> > > >
> > > >
> > > >
> > > > *{    "id": "abcd",    "testField": "blah",    "jsonField":
> > > > "{\"thing\":{\"abcd\":\"value\",\"xyz\":[\"abc\",\"def\",\"
> ghi\"],\"nnn\":\"xyz\"},\"stuff\":[{\"qqq\":\"rrr\"}],\"
> ttt\":0,\"mmm\":\"Some
> > > > string\",\"someBool\":true}"}*
> > > > We've had errors during indexing like:
> > > > *unknown field
> > > > 'testField:"value","xyz":["abc","def","ghi"],"nnn":"xyz"}
> ,"stuff":[{"qqq":"rrr"}],"ttt":0,"mmm":"Some
> > > > string","someBool":true}���������������������������'*
> > > > (those � unprintable characters are part of it)
> > > >
> > > > So far we've not been able to reproduce the problem on a collection
> with a
> > > > single shard, so it does seem like the problem is only happening
> internally
> > > > when updates are distributed to the other shards... But that's not
> been
> > > > totally verified.
> > > >
> > > > We've also only encountered the problem on one of the collections we
> build
> > > > (the data within each collection is generally the same though. The
> ids are
> > > > slightly different - but still strings. The main difference is that
> this
> > > > problematic index is built using an Iterator<SolrInputDocument> to
> *solrj
> > > > org.apache.solr.client.solrj.SolrClient.add(String,
> > > > Iterator<SolrInputDocument>)* - the *SolrInputDocument*s are not
> being
> > > > reused in the client, I checked that -, while the other index is
> built by
> > > > streaming CSVs to Solr.)
> > > >
> > > >
> > > > We will look into it further, but if anyone has any ideas of what
> might
> > > > have changed in 8.3 from 8.1 / 8.2 that could cause this, that would
> be
> > > > helpful.
> > > >
> > > > Cheers
> > > > Colvin
> > > >
> > > >
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul
>
>
>
> --
> -----------------------------------------------------
> Noble Paul
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Noble Paul നോബിള്‍  नोब्ळ्
Sure, looking forward to that
thnaks

On Thu, Nov 21, 2019 at 8:06 AM Colvin Cowie <[hidden email]> wrote:

>
> Hi, I'll share it when I'm back at work tomorrow.
>
> I've found that the issue appears to not be reproducible using the
> solrconfig.xml in the _default example, so it must be something in ours (or
> something missing from it) that is making the problem appear, in
> combination with the code change.
>
> Thanks
>
>
> On Wednesday, 20 November 2019, Noble Paul <[hidden email]> wrote:
>
> > Can you share the test please
> >
> > On Thu, Nov 21, 2019 at 7:02 AM Noble Paul <[hidden email]> wrote:
> > >
> > > Thanks Colvin, I'll take a look
> > >
> > > On Thu, Nov 21, 2019 at 4:24 AM Colvin Cowie <[hidden email]>
> > wrote:
> > > >
> > > > I've identified the change which has caused the problem to
> > materialize, but
> > > > it shouldn't itself cause a problem.
> > > >
> > > > https://github.com/apache/lucene-solr/commit/
> > e45e8127d5c17af4e4b87a0a4eaf0afaf4f9ff4b#diff-
> > 7f7f485122d8257bd5d3210c092b967fR52
> > > > for https://issues.apache.org/jira/browse/SOLR-13682
> > > >
> > > > In writeMap, the new BiConsumer unwraps the SolrInputField using
> > getValue
> > > > rather than getRawValue (which the JavaBinCodec calls):
> > > >
> > > >
> > > > *      if (o instanceof SolrInputField) {        o = ((SolrInputField)
> > > > o).getValue();      }*
> > > > As a result the JavaBinCodec will now be hitting different writer
> > methods
> > > > based on the value retrieved from the SolrInputField, rather than just
> > > > writing the org.apache.solr.common.util.JavaBinCodec.writeKnownType(
> > Object)
> > > >
> > > >
> > > > *    if (val instanceof SolrInputField) {      return
> > > > writeKnownType(((SolrInputField) val).getRawValue());    }*
> > > > https://github.com/apache/lucene-solr/blob/branch_8_3/
> > solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L362
> > > >
> > > > SolrInputField getValue uses
> > > > org.apache.solr.common.util.ByteArrayUtf8CharSequence.
> > convertCharSeq(Object)
> > > > while getRawValue just returns whatever value the SolrInputField has,
> > so
> > > > the EntryWriter in the JavaBinCodec hits different paths from the ones
> > > > which must non-deterministically produce garbage data when getValue()
> > is
> > > > used.
> > > >
> > > > Changing *getValue()* to *getRawValue()* in the SolrInputDocument's
> > > > *writeMap()* appears to "fix" the problem. (With getValue() the test I
> > have
> > > > reliably fails within 50 iterations of indexing 2500 documents, with
> > > > getRawValue() it succeeds for the 500 iterations I'm running it for)
> > > >
> > > > I'll see about providing a test that can be shared that demonstrates
> > the
> > > > problem, and see if we can find what is going wrong in the codec...
> > > >
> > > >
> > > > On Tue, 19 Nov 2019 at 13:48, Colvin Cowie <[hidden email]
> > >
> > > > wrote:
> > > >
> > > > > Hello
> > > > >
> > > > > Apologies for the lack of actual detail in this, we're still digging
> > into
> > > > > it ourselves. I will provide more detail, and maybe some logs, once
> > I have
> > > > > a better idea of what is actually happening.
> > > > > But I thought I might as well ask if anyone knows of changes that
> > were
> > > > > made in the Solr 8.3 release that are likely to have caused an issue
> > like
> > > > > this?
> > > > >
> > > > > We were on Solr 8.1.1 for several months and moved to 8.2.0 for
> > about 2
> > > > > weeks before moving to 8.3.0 last week.
> > > > > We didn't see this issue at all on the previous releases. Since
> > moving to
> > > > > 8.3 we have had a consistent (but non-deterministic) set of failing
> > tests,
> > > > > on Windows and Linux.
> > > > >
> > > > > The issue we are seeing as that during updates, the data we have
> > sent is
> > > > > *sometimes* corrupted, as though a buffer has been used incorrectly.
> > For
> > > > > example if the well formed data went was
> > > > > *'fieldName':"this is a long string"*
> > > > > The error we see from Solr might be that
> > > > > unknown field * 'fieldNamis a long string" *
> > > > >
> > > > > And variations of that kind of behaviour, were part of the data is
> > missing
> > > > > or corrupted. The data we are indexing does include fields which
> > store
> > > > > (escaped) serialized JSON strings - if that might have any bearing -
> > but
> > > > > the error isn't always on those fields.
> > > > > For example, given a valid document that looks like this (I've
> > replaced
> > > > > the values by hand, so if the json is messed up here, that's not
> > relevant:)
> > > > > when returned with the json response writer:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *{    "id": "abcd",    "testField": "blah",    "jsonField":
> > > > > "{\"thing\":{\"abcd\":\"value\",\"xyz\":[\"abc\",\"def\",\"
> > ghi\"],\"nnn\":\"xyz\"},\"stuff\":[{\"qqq\":\"rrr\"}],\"
> > ttt\":0,\"mmm\":\"Some
> > > > > string\",\"someBool\":true}"}*
> > > > > We've had errors during indexing like:
> > > > > *unknown field
> > > > > 'testField:"value","xyz":["abc","def","ghi"],"nnn":"xyz"}
> > ,"stuff":[{"qqq":"rrr"}],"ttt":0,"mmm":"Some
> > > > > string","someBool":true}���������������������������'*
> > > > > (those � unprintable characters are part of it)
> > > > >
> > > > > So far we've not been able to reproduce the problem on a collection
> > with a
> > > > > single shard, so it does seem like the problem is only happening
> > internally
> > > > > when updates are distributed to the other shards... But that's not
> > been
> > > > > totally verified.
> > > > >
> > > > > We've also only encountered the problem on one of the collections we
> > build
> > > > > (the data within each collection is generally the same though. The
> > ids are
> > > > > slightly different - but still strings. The main difference is that
> > this
> > > > > problematic index is built using an Iterator<SolrInputDocument> to
> > *solrj
> > > > > org.apache.solr.client.solrj.SolrClient.add(String,
> > > > > Iterator<SolrInputDocument>)* - the *SolrInputDocument*s are not
> > being
> > > > > reused in the client, I checked that -, while the other index is
> > built by
> > > > > streaming CSVs to Solr.)
> > > > >
> > > > >
> > > > > We will look into it further, but if anyone has any ideas of what
> > might
> > > > > have changed in 8.3 from 8.1 / 8.2 that could cause this, that would
> > be
> > > > > helpful.
> > > > >
> > > > > Cheers
> > > > > Colvin
> > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > -----------------------------------------------------
> > > Noble Paul
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul
> >



--
-----------------------------------------------------
Noble Paul
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Jason Gerlowski
Very curious what the config change that's related to reproducing this
looks like.  Maybe it's something that is worth adding
test-randomization around?  Just thinking aloud.
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie
I've been a bit snowed under, but I've found the difference is because the
_default config has the dynamic schema building in it, which I assume is
pushing it down a different code path.

  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"
default="${update.autoCreateFields:true}"

 processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">

I'm using the vanilla Solr 8.3.0 binary8.3.0
2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with
Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM
1.8.0_191 25.191-b12 as well

I've put a testcase and configsets in Google Drive:
https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
The configsets are a copy of the _default configset, except the "problem"
configset has autoCreateFields set to false.
I created a collection with 4 shards, replication factor 1 for each
configset. The test case reliably fails on the "problem" collection and
reliably passes against the "no_problem" collection.

The test (well it's not actually a @Test but still) has static data (though
it was originally generated randomly). The data is a bit mad... but it was
easier to reproduce the problem reliably with this data, than with the
normal documents we use in our product.
Each document has a different (dynamically named) field to index data into,
but it's the same data in each field.
The problem only appears (or probably is just more likely to appear?) when
the field names in the request are of different lengths.
The length / value of the data doesn't appear to matter. Or is less
impactful than variations in the field names.
*If you run the test 10 times you will see a variety of different errors.
i.e. it's not the same error every time.*
I've included some examples of the errors in the Drive folder. One of the
most fundamental (and probably points at the root cause) is this:
































*2019-11-21 17:02:53.720 ERROR
(updateExecutor-3-thread-6-processing-x:problem_collection_shard2_replica_n2
r:core_node5 null n:10.0.75.1:8983_solr c:problem_collection s:shard2)
[c:problem_collection s:shard2 r:core_node5
x:problem_collection_shard2_replica_n2]
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode:
http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
<http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> to
http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
<http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> =>
java.lang.StringIndexOutOfBoundsException at
java.lang.String.<init>(String.java:668)java.lang.StringIndexOutOfBoundsException:
null at java.lang.String.<init>(String.java:668) ~[?:1.8.0_232] at
org.noggit.CharArr.toString(CharArr.java:182) ~[?:?] at
org.apache.solr.common.util.JavaBinCodec.lambda$getStringProvider$1(JavaBinCodec.java:966)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec$$Lambda$668.0000000000000000.apply(Unknown
Source) ~[?:?] at
org.apache.solr.common.util.ByteArrayUtf8CharSequence._getStr(ByteArrayUtf8CharSequence.java:156)
~[?:?] at
org.apache.solr.common.util.ByteArrayUtf8CharSequence.toString(ByteArrayUtf8CharSequence.java:235)
~[?:?] at
org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(ByteArrayUtf8CharSequence.java:215)
~[?:?] at
org.apache.solr.common.SolrInputField.getValue(SolrInputField.java:128)
~[?:?] at
org.apache.solr.common.SolrInputDocument.lambda$writeMap$0(SolrInputDocument.java:55)
~[?:?] at
org.apache.solr.common.SolrInputDocument$$Lambda$743.000000002774E7B0.accept(Unknown
Source) ~[?:?] at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
~[?:1.8.0_232] at
org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:59)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:658)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:813)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:411)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:750)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:395)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:248)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
~[?:?] at
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:167)
~[?:?] at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102)
~[?:?] at
org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
~[?:?] at
org.apache.solr.client.solrj.impl.Http2SolrClient.send(Http2SolrClient.java:340)
~[?:?] at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:231)
~[?:?]*

And











*java.lang.StringIndexOutOfBoundsException: String index out of range: 39
at java.lang.String.<init>(String.java:205) at
org.noggit.CharArr.toString(CharArr.java:182) at
org.apache.solr.common.util.JavaBinCodec._readStr(JavaBinCodec.java:929) at
org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:918) at
org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:1194)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
at
org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:625)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:340)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:321)*

*...*

Sometimes the indexing will succeed because of the nature of the dynamic
field, but retrieving the documents show that the field names have been
corrupted:
Exception in thread "main" java.lang.IllegalStateException: Doc 224 does
not have field *name_wmJmiiWghggUHmNiQAg_prop_s* it has [id,
*SomebodymiiWghggUHmNiQAg_prop_s*, _version_]
Which is a concatenation of the data value "Somebody" from some record, and
part of the actual field name  *name_wmJ**miiWghggUHmNiQAg_prop_s *

On Thu, 21 Nov 2019 at 13:16, Jason Gerlowski <[hidden email]> wrote:

> Very curious what the config change that's related to reproducing this
> looks like.  Maybe it's something that is worth adding
> test-randomization around?  Just thinking aloud.
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie
*> the difference is because the _default config has the dynamic schema
building in it, which I assume is pushing it down a different code path. *

Also to add to that, I assumed initially that this just meant that it was
working because the corrupted field names would just cause it to create a
field with the dodgy name (since that's the idea for the dynamic schema),
but checking the documents on retrieval showed they all had the right field
names...
So I assume it's a result of going into a different branch of code instead.


On an unrelated matter, I saw this in the logs when running with embedded
zookeeper... I don't think I've seen it mentioned anywhere else, so I will
raise an issue for it
















*2019-11-21 17:25:14.292 INFO  (main) [   ] o.a.s.c.SolrZkServer STARTING
EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792
INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
client=localhost:99832019-11-21 17:25:18.833 WARN  (Thread-13) [   ]
o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting
JettyAdminServer => java.lang.NoClassDefFoundError:
org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native
Method)java.lang.NoClassDefFoundError: org/eclipse/jetty/server/Connector
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at
java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at
org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43)
~[?:?] at
org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136)
~[?:?] at org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121)
~[?:?]Caused by: java.lang.ClassNotFoundException:
org.eclipse.jetty.server.Connector at
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577)
~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] ... 5
more2019-11-21 17:25:19.365 INFO  (main) [   ] o.a.s.c.c.ConnectionManager
Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO
 (zkConnectionManagerCallback-7-thread-1) [   ] o.a.s.c.c.ConnectionManager
zkClient has connected2019-11-21 17:25:19.396 INFO  (main) [   ]
o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper*

On Thu, 21 Nov 2019 at 17:30, Colvin Cowie <[hidden email]>
wrote:

> I've been a bit snowed under, but I've found the difference is because the
> _default config has the dynamic schema building in it, which I assume is
> pushing it down a different code path.
>
>   <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"
> default="${update.autoCreateFields:true}"
>
>  processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
>
> I'm using the vanilla Solr 8.3.0 binary8.3.0
> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with
> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM
> 1.8.0_191 25.191-b12 as well
>
> I've put a testcase and configsets in Google Drive:
> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
> The configsets are a copy of the _default configset, except the "problem"
> configset has autoCreateFields set to false.
> I created a collection with 4 shards, replication factor 1 for each
> configset. The test case reliably fails on the "problem" collection and
> reliably passes against the "no_problem" collection.
>
> The test (well it's not actually a @Test but still) has static data
> (though it was originally generated randomly). The data is a bit mad... but
> it was easier to reproduce the problem reliably with this data, than with
> the normal documents we use in our product.
> Each document has a different (dynamically named) field to index data
> into, but it's the same data in each field.
> The problem only appears (or probably is just more likely to appear?) when
> the field names in the request are of different lengths.
> The length / value of the data doesn't appear to matter. Or is less
> impactful than variations in the field names.
> *If you run the test 10 times you will see a variety of different errors.
> i.e. it's not the same error every time.*
> I've included some examples of the errors in the Drive folder. One of the
> most fundamental (and probably points at the root cause) is this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *2019-11-21 17:02:53.720 ERROR
> (updateExecutor-3-thread-6-processing-x:problem_collection_shard2_replica_n2
> r:core_node5 null n:10.0.75.1:8983_solr c:problem_collection s:shard2)
> [c:problem_collection s:shard2 r:core_node5
> x:problem_collection_shard2_replica_n2]
> o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
> SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode:
> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> to
> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> =>
> java.lang.StringIndexOutOfBoundsException at
> java.lang.String.<init>(String.java:668)java.lang.StringIndexOutOfBoundsException:
> null at java.lang.String.<init>(String.java:668) ~[?:1.8.0_232] at
> org.noggit.CharArr.toString(CharArr.java:182) ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.lambda$getStringProvider$1(JavaBinCodec.java:966)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec$$Lambda$668.0000000000000000.apply(Unknown
> Source) ~[?:?] at
> org.apache.solr.common.util.ByteArrayUtf8CharSequence._getStr(ByteArrayUtf8CharSequence.java:156)
> ~[?:?] at
> org.apache.solr.common.util.ByteArrayUtf8CharSequence.toString(ByteArrayUtf8CharSequence.java:235)
> ~[?:?] at
> org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(ByteArrayUtf8CharSequence.java:215)
> ~[?:?] at
> org.apache.solr.common.SolrInputField.getValue(SolrInputField.java:128)
> ~[?:?] at
> org.apache.solr.common.SolrInputDocument.lambda$writeMap$0(SolrInputDocument.java:55)
> ~[?:?] at
> org.apache.solr.common.SolrInputDocument$$Lambda$743.000000002774E7B0.accept(Unknown
> Source) ~[?:?] at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
> ~[?:1.8.0_232] at
> org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:59)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:658)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:813)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:411)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:750)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:395)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:248)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> ~[?:?] at
> org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:167)
> ~[?:?] at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102)
> ~[?:?] at
> org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
> ~[?:?] at
> org.apache.solr.client.solrj.impl.Http2SolrClient.send(Http2SolrClient.java:340)
> ~[?:?] at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:231)
> ~[?:?]*
>
> And
>
>
>
>
>
>
>
>
>
>
>
> *java.lang.StringIndexOutOfBoundsException: String index out of range: 39
> at java.lang.String.<init>(String.java:205) at
> org.noggit.CharArr.toString(CharArr.java:182) at
> org.apache.solr.common.util.JavaBinCodec._readStr(JavaBinCodec.java:929) at
> org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:918) at
> org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:1194)
> at
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
> at
> org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:625)
> at
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:340)
> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:321)*
>
> *...*
>
> Sometimes the indexing will succeed because of the nature of the dynamic
> field, but retrieving the documents show that the field names have been
> corrupted:
> Exception in thread "main" java.lang.IllegalStateException: Doc 224 does
> not have field *name_wmJmiiWghggUHmNiQAg_prop_s* it has [id,
> *SomebodymiiWghggUHmNiQAg_prop_s*, _version_]
> Which is a concatenation of the data value "Somebody" from some record,
> and part of the actual field name  *name_wmJ**miiWghggUHmNiQAg_prop_s *
>
> On Thu, 21 Nov 2019 at 13:16, Jason Gerlowski <[hidden email]>
> wrote:
>
>> Very curious what the config change that's related to reproducing this
>> looks like.  Maybe it's something that is worth adding
>> test-randomization around?  Just thinking aloud.
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie
I've found the problem, JavaBinCodec has a CharArr,* arr*, which is
modified in two different locations, but only one of which is protected
with a synchronized block

getStringProvider(), which is used when you call getValue() rather than
getRawValue() on the string based SolrInputFields, synchronizes:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966
but  _readStr() doesn't:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930

Adding a synchronized block into _readStr() fixes the problem. At least as
far as my test goes.

I'll raise a JIRA issue and can provide a patch with the synchronized
block, but not sure what test(s) should be updated / added to cover this?



On Thu, 21 Nov 2019 at 18:23, Colvin Cowie <[hidden email]>
wrote:

> *> the difference is because the _default config has the dynamic schema
> building in it, which I assume is pushing it down a different code path. *
>
> Also to add to that, I assumed initially that this just meant that it was
> working because the corrupted field names would just cause it to create a
> field with the dodgy name (since that's the idea for the dynamic schema),
> but checking the documents on retrieval showed they all had the right field
> names...
> So I assume it's a result of going into a different branch of code instead.
>
>
> On an unrelated matter, I saw this in the logs when running with embedded
> zookeeper... I don't think I've seen it mentioned anywhere else, so I will
> raise an issue for it
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *2019-11-21 17:25:14.292 INFO  (main) [   ] o.a.s.c.SolrZkServer STARTING
> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792
> INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
> client=localhost:99832019-11-21 17:25:18.833 WARN  (Thread-13) [   ]
> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting
> JettyAdminServer => java.lang.NoClassDefFoundError:
> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native
> Method)java.lang.NoClassDefFoundError: org/eclipse/jetty/server/Connector
> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at
> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at
> org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43)
> ~[?:?] at
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136)
> ~[?:?] at org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121)
> ~[?:?]Caused by: java.lang.ClassNotFoundException:
> org.eclipse.jetty.server.Connector at
> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577)
> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at
> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] ... 5
> more2019-11-21 17:25:19.365 INFO  (main) [   ] o.a.s.c.c.ConnectionManager
> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO
>  (zkConnectionManagerCallback-7-thread-1) [   ] o.a.s.c.c.ConnectionManager
> zkClient has connected2019-11-21 17:25:19.396 INFO  (main) [   ]
> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper*
>
> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie <[hidden email]>
> wrote:
>
>> I've been a bit snowed under, but I've found the difference is because
>> the _default config has the dynamic schema building in it, which I assume
>> is pushing it down a different code path.
>>
>>   <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"
>> default="${update.autoCreateFields:true}"
>>
>>  processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
>>
>> I'm using the vanilla Solr 8.3.0 binary8.3.0
>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with
>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server
>> VM 1.8.0_191 25.191-b12 as well
>>
>> I've put a testcase and configsets in Google Drive:
>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
>> The configsets are a copy of the _default configset, except the "problem"
>> configset has autoCreateFields set to false.
>> I created a collection with 4 shards, replication factor 1 for each
>> configset. The test case reliably fails on the "problem" collection and
>> reliably passes against the "no_problem" collection.
>>
>> The test (well it's not actually a @Test but still) has static data
>> (though it was originally generated randomly). The data is a bit mad... but
>> it was easier to reproduce the problem reliably with this data, than with
>> the normal documents we use in our product.
>> Each document has a different (dynamically named) field to index data
>> into, but it's the same data in each field.
>> The problem only appears (or probably is just more likely to appear?)
>> when the field names in the request are of different lengths.
>> The length / value of the data doesn't appear to matter. Or is less
>> impactful than variations in the field names.
>> *If you run the test 10 times you will see a variety of different errors.
>> i.e. it's not the same error every time.*
>> I've included some examples of the errors in the Drive folder. One of the
>> most fundamental (and probably points at the root cause) is this:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *2019-11-21 17:02:53.720 ERROR
>> (updateExecutor-3-thread-6-processing-x:problem_collection_shard2_replica_n2
>> r:core_node5 null n:10.0.75.1:8983_solr c:problem_collection s:shard2)
>> [c:problem_collection s:shard2 r:core_node5
>> x:problem_collection_shard2_replica_n2]
>> o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
>> SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode:
>> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
>> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> to
>> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
>> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> =>
>> java.lang.StringIndexOutOfBoundsException at
>> java.lang.String.<init>(String.java:668)java.lang.StringIndexOutOfBoundsException:
>> null at java.lang.String.<init>(String.java:668) ~[?:1.8.0_232] at
>> org.noggit.CharArr.toString(CharArr.java:182) ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.lambda$getStringProvider$1(JavaBinCodec.java:966)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec$$Lambda$668.0000000000000000.apply(Unknown
>> Source) ~[?:?] at
>> org.apache.solr.common.util.ByteArrayUtf8CharSequence._getStr(ByteArrayUtf8CharSequence.java:156)
>> ~[?:?] at
>> org.apache.solr.common.util.ByteArrayUtf8CharSequence.toString(ByteArrayUtf8CharSequence.java:235)
>> ~[?:?] at
>> org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(ByteArrayUtf8CharSequence.java:215)
>> ~[?:?] at
>> org.apache.solr.common.SolrInputField.getValue(SolrInputField.java:128)
>> ~[?:?] at
>> org.apache.solr.common.SolrInputDocument.lambda$writeMap$0(SolrInputDocument.java:55)
>> ~[?:?] at
>> org.apache.solr.common.SolrInputDocument$$Lambda$743.000000002774E7B0.accept(Unknown
>> Source) ~[?:?] at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
>> ~[?:1.8.0_232] at
>> org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:59)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:658)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:813)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:411)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:750)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:395)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:248)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>> ~[?:?] at
>> org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:167)
>> ~[?:?] at
>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102)
>> ~[?:?] at
>> org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
>> ~[?:?] at
>> org.apache.solr.client.solrj.impl.Http2SolrClient.send(Http2SolrClient.java:340)
>> ~[?:?] at
>> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:231)
>> ~[?:?]*
>>
>> And
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *java.lang.StringIndexOutOfBoundsException: String index out of range: 39
>> at java.lang.String.<init>(String.java:205) at
>> org.noggit.CharArr.toString(CharArr.java:182) at
>> org.apache.solr.common.util.JavaBinCodec._readStr(JavaBinCodec.java:929) at
>> org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:918) at
>> org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:1194)
>> at
>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
>> at
>> org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:625)
>> at
>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:340)
>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
>> at
>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:321)*
>>
>> *...*
>>
>> Sometimes the indexing will succeed because of the nature of the dynamic
>> field, but retrieving the documents show that the field names have been
>> corrupted:
>> Exception in thread "main" java.lang.IllegalStateException: Doc 224 does
>> not have field *name_wmJmiiWghggUHmNiQAg_prop_s* it has [id,
>> *SomebodymiiWghggUHmNiQAg_prop_s*, _version_]
>> Which is a concatenation of the data value "Somebody" from some record,
>> and part of the actual field name  *name_wmJ**miiWghggUHmNiQAg_prop_s *
>>
>> On Thu, 21 Nov 2019 at 13:16, Jason Gerlowski <[hidden email]>
>> wrote:
>>
>>> Very curious what the config change that's related to reproducing this
>>> looks like.  Maybe it's something that is worth adding
>>> test-randomization around?  Just thinking aloud.
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie
https://issues.apache.org/jira/browse/SOLR-13963

I'll see about modifying the test I have to fit in with the existing tests,
and if there's a better option then open to whatever

On Sat, 23 Nov 2019 at 16:43, Colvin Cowie <[hidden email]>
wrote:

> I've found the problem, JavaBinCodec has a CharArr,* arr*, which is
> modified in two different locations, but only one of which is protected
> with a synchronized block
>
> getStringProvider(), which is used when you call getValue() rather than
> getRawValue() on the string based SolrInputFields, synchronizes:
>
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966
> but  _readStr() doesn't:
>
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930
>
> Adding a synchronized block into _readStr() fixes the problem. At least as
> far as my test goes.
>
> I'll raise a JIRA issue and can provide a patch with the synchronized
> block, but not sure what test(s) should be updated / added to cover this?
>
>
>
> On Thu, 21 Nov 2019 at 18:23, Colvin Cowie <[hidden email]>
> wrote:
>
>> *> the difference is because the _default config has the dynamic schema
>> building in it, which I assume is pushing it down a different code path. *
>>
>> Also to add to that, I assumed initially that this just meant that it was
>> working because the corrupted field names would just cause it to create a
>> field with the dodgy name (since that's the idea for the dynamic schema),
>> but checking the documents on retrieval showed they all had the right field
>> names...
>> So I assume it's a result of going into a different branch of code
>> instead.
>>
>>
>> On an unrelated matter, I saw this in the logs when running with embedded
>> zookeeper... I don't think I've seen it mentioned anywhere else, so I will
>> raise an issue for it
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *2019-11-21 17:25:14.292 INFO  (main) [   ] o.a.s.c.SolrZkServer STARTING
>> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792
>> INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
>> client=localhost:99832019-11-21 17:25:18.833 WARN  (Thread-13) [   ]
>> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting
>> JettyAdminServer => java.lang.NoClassDefFoundError:
>> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native
>> Method)java.lang.NoClassDefFoundError: org/eclipse/jetty/server/Connector
>> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at
>> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at
>> org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43)
>> ~[?:?] at
>> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136)
>> ~[?:?] at org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121)
>> ~[?:?]Caused by: java.lang.ClassNotFoundException:
>> org.eclipse.jetty.server.Connector at
>> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577)
>> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] ... 5
>> more2019-11-21 17:25:19.365 INFO  (main) [   ] o.a.s.c.c.ConnectionManager
>> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO
>>  (zkConnectionManagerCallback-7-thread-1) [   ] o.a.s.c.c.ConnectionManager
>> zkClient has connected2019-11-21 17:25:19.396 INFO  (main) [   ]
>> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper*
>>
>> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie <[hidden email]>
>> wrote:
>>
>>> I've been a bit snowed under, but I've found the difference is because
>>> the _default config has the dynamic schema building in it, which I assume
>>> is pushing it down a different code path.
>>>
>>>   <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"
>>> default="${update.autoCreateFields:true}"
>>>
>>>  processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
>>>
>>> I'm using the vanilla Solr 8.3.0 binary8.3.0
>>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with
>>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
>>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server
>>> VM 1.8.0_191 25.191-b12 as well
>>>
>>> I've put a testcase and configsets in Google Drive:
>>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
>>> The configsets are a copy of the _default configset, except the
>>> "problem" configset has autoCreateFields set to false.
>>> I created a collection with 4 shards, replication factor 1 for each
>>> configset. The test case reliably fails on the "problem" collection and
>>> reliably passes against the "no_problem" collection.
>>>
>>> The test (well it's not actually a @Test but still) has static data
>>> (though it was originally generated randomly). The data is a bit mad... but
>>> it was easier to reproduce the problem reliably with this data, than with
>>> the normal documents we use in our product.
>>> Each document has a different (dynamically named) field to index data
>>> into, but it's the same data in each field.
>>> The problem only appears (or probably is just more likely to appear?)
>>> when the field names in the request are of different lengths.
>>> The length / value of the data doesn't appear to matter. Or is less
>>> impactful than variations in the field names.
>>> *If you run the test 10 times you will see a variety of different
>>> errors. i.e. it's not the same error every time.*
>>> I've included some examples of the errors in the Drive folder. One of
>>> the most fundamental (and probably points at the root cause) is this:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *2019-11-21 17:02:53.720 ERROR
>>> (updateExecutor-3-thread-6-processing-x:problem_collection_shard2_replica_n2
>>> r:core_node5 null n:10.0.75.1:8983_solr c:problem_collection s:shard2)
>>> [c:problem_collection s:shard2 r:core_node5
>>> x:problem_collection_shard2_replica_n2]
>>> o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
>>> SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode:
>>> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
>>> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> to
>>> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
>>> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> =>
>>> java.lang.StringIndexOutOfBoundsException at
>>> java.lang.String.<init>(String.java:668)java.lang.StringIndexOutOfBoundsException:
>>> null at java.lang.String.<init>(String.java:668) ~[?:1.8.0_232] at
>>> org.noggit.CharArr.toString(CharArr.java:182) ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.lambda$getStringProvider$1(JavaBinCodec.java:966)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec$$Lambda$668.0000000000000000.apply(Unknown
>>> Source) ~[?:?] at
>>> org.apache.solr.common.util.ByteArrayUtf8CharSequence._getStr(ByteArrayUtf8CharSequence.java:156)
>>> ~[?:?] at
>>> org.apache.solr.common.util.ByteArrayUtf8CharSequence.toString(ByteArrayUtf8CharSequence.java:235)
>>> ~[?:?] at
>>> org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(ByteArrayUtf8CharSequence.java:215)
>>> ~[?:?] at
>>> org.apache.solr.common.SolrInputField.getValue(SolrInputField.java:128)
>>> ~[?:?] at
>>> org.apache.solr.common.SolrInputDocument.lambda$writeMap$0(SolrInputDocument.java:55)
>>> ~[?:?] at
>>> org.apache.solr.common.SolrInputDocument$$Lambda$743.000000002774E7B0.accept(Unknown
>>> Source) ~[?:?] at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
>>> ~[?:1.8.0_232] at
>>> org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:59)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:658)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:813)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:411)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:750)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:395)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:248)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
>>> ~[?:?] at
>>> org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:167)
>>> ~[?:?] at
>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102)
>>> ~[?:?] at
>>> org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
>>> ~[?:?] at
>>> org.apache.solr.client.solrj.impl.Http2SolrClient.send(Http2SolrClient.java:340)
>>> ~[?:?] at
>>> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:231)
>>> ~[?:?]*
>>>
>>> And
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *java.lang.StringIndexOutOfBoundsException: String index out of range:
>>> 39 at java.lang.String.<init>(String.java:205) at
>>> org.noggit.CharArr.toString(CharArr.java:182) at
>>> org.apache.solr.common.util.JavaBinCodec._readStr(JavaBinCodec.java:929) at
>>> org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:918) at
>>> org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:1194)
>>> at
>>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
>>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
>>> at
>>> org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:625)
>>> at
>>> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:340)
>>> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
>>> at
>>> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:321)*
>>>
>>> *...*
>>>
>>> Sometimes the indexing will succeed because of the nature of the dynamic
>>> field, but retrieving the documents show that the field names have been
>>> corrupted:
>>> Exception in thread "main" java.lang.IllegalStateException: Doc 224 does
>>> not have field *name_wmJmiiWghggUHmNiQAg_prop_s* it has [id,
>>> *SomebodymiiWghggUHmNiQAg_prop_s*, _version_]
>>> Which is a concatenation of the data value "Somebody" from some record,
>>> and part of the actual field name  *name_wmJ**miiWghggUHmNiQAg_prop_s *
>>>
>>> On Thu, 21 Nov 2019 at 13:16, Jason Gerlowski <[hidden email]>
>>> wrote:
>>>
>>>> Very curious what the config change that's related to reproducing this
>>>> looks like.  Maybe it's something that is worth adding
>>>> test-randomization around?  Just thinking aloud.
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Noble Paul നോബിള്‍  नोब्ळ्
Thanks Colvin.
Can you share the details in the ticket?

I plan to debug this today.

It's unlikely to be a synchronization issue because
serialization/deserialization usually happens in single thread.

On Sun, Nov 24, 2019, 4:09 AM Colvin Cowie <[hidden email]>
wrote:

> https://issues.apache.org/jira/browse/SOLR-13963
>
> I'll see about modifying the test I have to fit in with the existing tests,
> and if there's a better option then open to whatever
>
> On Sat, 23 Nov 2019 at 16:43, Colvin Cowie <[hidden email]>
> wrote:
>
> > I've found the problem, JavaBinCodec has a CharArr,* arr*, which is
> > modified in two different locations, but only one of which is protected
> > with a synchronized block
> >
> > getStringProvider(), which is used when you call getValue() rather than
> > getRawValue() on the string based SolrInputFields, synchronizes:
> >
> >
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966
> > but  _readStr() doesn't:
> >
> >
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930
> >
> > Adding a synchronized block into _readStr() fixes the problem. At least
> as
> > far as my test goes.
> >
> > I'll raise a JIRA issue and can provide a patch with the synchronized
> > block, but not sure what test(s) should be updated / added to cover this?
> >
> >
> >
> > On Thu, 21 Nov 2019 at 18:23, Colvin Cowie <[hidden email]>
> > wrote:
> >
> >> *> the difference is because the _default config has the dynamic schema
> >> building in it, which I assume is pushing it down a different code
> path. *
> >>
> >> Also to add to that, I assumed initially that this just meant that it
> was
> >> working because the corrupted field names would just cause it to create
> a
> >> field with the dodgy name (since that's the idea for the dynamic
> schema),
> >> but checking the documents on retrieval showed they all had the right
> field
> >> names...
> >> So I assume it's a result of going into a different branch of code
> >> instead.
> >>
> >>
> >> On an unrelated matter, I saw this in the logs when running with
> embedded
> >> zookeeper... I don't think I've seen it mentioned anywhere else, so I
> will
> >> raise an issue for it
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> *2019-11-21 17:25:14.292 INFO  (main) [   ] o.a.s.c.SolrZkServer
> STARTING
> >> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792
> >> INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
> >> client=localhost:99832019-11-21 17:25:18.833 WARN  (Thread-13) [   ]
> >> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting
> >> JettyAdminServer => java.lang.NoClassDefFoundError:
> >> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native
> >> Method)java.lang.NoClassDefFoundError:
> org/eclipse/jetty/server/Connector
> >> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at
> >> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at
> >>
> org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43)
> >> ~[?:?] at
> >>
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136)
> >> ~[?:?] at
> org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121)
> >> ~[?:?]Caused by: java.lang.ClassNotFoundException:
> >> org.eclipse.jetty.server.Connector at
> >>
> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577)
> >> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at
> >> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191]
> ... 5
> >> more2019-11-21 17:25:19.365 INFO  (main) [   ]
> o.a.s.c.c.ConnectionManager
> >> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO
> >>  (zkConnectionManagerCallback-7-thread-1) [   ]
> o.a.s.c.c.ConnectionManager
> >> zkClient has connected2019-11-21 17:25:19.396 INFO  (main) [   ]
> >> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper*
> >>
> >> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie <[hidden email]>
> >> wrote:
> >>
> >>> I've been a bit snowed under, but I've found the difference is because
> >>> the _default config has the dynamic schema building in it, which I
> assume
> >>> is pushing it down a different code path.
> >>>
> >>>   <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"
> >>> default="${update.autoCreateFields:true}"
> >>>
> >>>
> processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
> >>>
> >>> I'm using the vanilla Solr 8.3.0 binary8.3.0
> >>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22
> with
> >>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
> >>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server
> >>> VM 1.8.0_191 25.191-b12 as well
> >>>
> >>> I've put a testcase and configsets in Google Drive:
> >>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
> >>> The configsets are a copy of the _default configset, except the
> >>> "problem" configset has autoCreateFields set to false.
> >>> I created a collection with 4 shards, replication factor 1 for each
> >>> configset. The test case reliably fails on the "problem" collection and
> >>> reliably passes against the "no_problem" collection.
> >>>
> >>> The test (well it's not actually a @Test but still) has static data
> >>> (though it was originally generated randomly). The data is a bit
> mad... but
> >>> it was easier to reproduce the problem reliably with this data, than
> with
> >>> the normal documents we use in our product.
> >>> Each document has a different (dynamically named) field to index data
> >>> into, but it's the same data in each field.
> >>> The problem only appears (or probably is just more likely to appear?)
> >>> when the field names in the request are of different lengths.
> >>> The length / value of the data doesn't appear to matter. Or is less
> >>> impactful than variations in the field names.
> >>> *If you run the test 10 times you will see a variety of different
> >>> errors. i.e. it's not the same error every time.*
> >>> I've included some examples of the errors in the Drive folder. One of
> >>> the most fundamental (and probably points at the root cause) is this:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *2019-11-21 17:02:53.720 ERROR
> >>>
> (updateExecutor-3-thread-6-processing-x:problem_collection_shard2_replica_n2
> >>> r:core_node5 null n:10.0.75.1:8983_solr c:problem_collection s:shard2)
> >>> [c:problem_collection s:shard2 r:core_node5
> >>> x:problem_collection_shard2_replica_n2]
> >>> o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
> >>> SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode:
> >>> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
> >>> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> to
> >>> http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/
> >>> <http://10.0.75.1:8983/solr/problem_collection_shard3_replica_n4/> =>
> >>> java.lang.StringIndexOutOfBoundsException at
> >>>
> java.lang.String.<init>(String.java:668)java.lang.StringIndexOutOfBoundsException:
> >>> null at java.lang.String.<init>(String.java:668) ~[?:1.8.0_232] at
> >>> org.noggit.CharArr.toString(CharArr.java:182) ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.lambda$getStringProvider$1(JavaBinCodec.java:966)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec$$Lambda$668.0000000000000000.apply(Unknown
> >>> Source) ~[?:?] at
> >>>
> org.apache.solr.common.util.ByteArrayUtf8CharSequence._getStr(ByteArrayUtf8CharSequence.java:156)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.ByteArrayUtf8CharSequence.toString(ByteArrayUtf8CharSequence.java:235)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(ByteArrayUtf8CharSequence.java:215)
> >>> ~[?:?] at
> >>> org.apache.solr.common.SolrInputField.getValue(SolrInputField.java:128)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.SolrInputDocument.lambda$writeMap$0(SolrInputDocument.java:55)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.SolrInputDocument$$Lambda$743.000000002774E7B0.accept(Unknown
> >>> Source) ~[?:?] at
> java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
> >>> ~[?:1.8.0_232] at
> >>>
> org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:59)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:658)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:813)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:411)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:750)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:395)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:248)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
> >>> ~[?:?] at
> >>>
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> >>> ~[?:?] at
> >>> org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:167)
> >>> ~[?:?] at
> >>>
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102)
> >>> ~[?:?] at
> >>>
> org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
> >>> ~[?:?] at
> >>>
> org.apache.solr.client.solrj.impl.Http2SolrClient.send(Http2SolrClient.java:340)
> >>> ~[?:?] at
> >>>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:231)
> >>> ~[?:?]*
> >>>
> >>> And
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *java.lang.StringIndexOutOfBoundsException: String index out of range:
> >>> 39 at java.lang.String.<init>(String.java:205) at
> >>> org.noggit.CharArr.toString(CharArr.java:182) at
> >>>
> org.apache.solr.common.util.JavaBinCodec._readStr(JavaBinCodec.java:929) at
> >>>
> org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:918) at
> >>>
> org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:1194)
> >>> at
> >>>
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
> >>> at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
> >>> at
> >>>
> org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:625)
> >>> at
> >>>
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:340)
> >>> at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:281)
> >>> at
> >>>
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:321)*
> >>>
> >>> *...*
> >>>
> >>> Sometimes the indexing will succeed because of the nature of the
> dynamic
> >>> field, but retrieving the documents show that the field names have been
> >>> corrupted:
> >>> Exception in thread "main" java.lang.IllegalStateException: Doc 224
> does
> >>> not have field *name_wmJmiiWghggUHmNiQAg_prop_s* it has [id,
> >>> *SomebodymiiWghggUHmNiQAg_prop_s*, _version_]
> >>> Which is a concatenation of the data value "Somebody" from some record,
> >>> and part of the actual field name  *name_wmJ**miiWghggUHmNiQAg_prop_s *
> >>>
> >>> On Thu, 21 Nov 2019 at 13:16, Jason Gerlowski <[hidden email]>
> >>> wrote:
> >>>
> >>>> Very curious what the config change that's related to reproducing this
> >>>> looks like.  Maybe it's something that is worth adding
> >>>> test-randomization around?  Just thinking aloud.
> >>>>
> >>>
>