Solr 7.7 UpdateRequestProcessor broken

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr 7.7 UpdateRequestProcessor broken

ahubold
Hi,

while trying to update from Solr 7.6 to 7.7 I run into some unexpected
incompatibilites with UpdateRequestProcessors.

The SolrInputDocument passed to UpdateRequestProcessor#processAdd does
not return Strings for string fields anymore but instances of
org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some
related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes"
section.

I can adapt our UpdateRequestProcessor implementations but at least the
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
is broken now as well and needs to be fixed in Solr. It expects String
values and logs messages such as the following now:

2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio]
o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field
name_tokenized not a String value, not including in detection

I wonder what kind of plugins are affected by the change. Does this only
affect UpdateRequestProcessors or more plugins? Do I need to handle
these ByteArrayUtf8CharSequence instances in SolrJ clients now as well?

Cheers,
Andreas


Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.7 UpdateRequestProcessor broken

Jan Høydahl / Cominvent
Hi

This is a subtle change which is not detected by our langid unit tests, as I think it only happens when document is trasferred with SolrJ and Javabin codec.
Was introduced in https://issues.apache.org/jira/browse/SOLR-12992

Please create a new JIRA issue for langid so we can try to fix it in 7.7.1

Other SolrInputDocument users assuming String type for strings in SolrInputDocument would also be vulnerable.

I have a patch ready that you could test:

Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
+++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (date 1550217809000)
@@ -60,12 +60,12 @@
           Collection<Object> fieldValues = doc.getFieldValues(fieldName);
           if (fieldValues != null) {
             for (Object content : fieldValues) {
-              if (content instanceof String) {
-                String stringContent = (String) content;
+              if (content instanceof CharSequence) {
+                CharSequence stringContent = (CharSequence) content;
                 if (stringContent.length() > maxFieldValueChars) {
-                  detector.append(stringContent.substring(0, maxFieldValueChars));
+                  detector.append(stringContent.subSequence(0, maxFieldValueChars).toString());
                 } else {
-                  detector.append(stringContent);
+                  detector.append(stringContent.toString());
                 }
                 detector.append(" ");
               } else {
Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
+++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (date 1550217691000)
@@ -413,10 +413,10 @@
         Collection<Object> fieldValues = doc.getFieldValues(fieldName);
         if (fieldValues != null) {
           for (Object content : fieldValues) {
-            if (content instanceof String) {
-              String stringContent = (String) content;
+            if (content instanceof CharSequence) {
+              CharSequence stringContent = (CharSequence) content;
               if (stringContent.length() > maxFieldValueChars) {
-                sb.append(stringContent.substring(0, maxFieldValueChars));
+                sb.append(stringContent.subSequence(0, maxFieldValueChars));
               } else {
                 sb.append(stringContent);
               }
@@ -449,8 +449,8 @@
         Collection<Object> contents = doc.getFieldValues(field);
         if (contents != null) {
           for (Object content : contents) {
-            if (content instanceof String) {
-              docSize += Math.min(((String) content).length(), maxFieldValueChars);
+            if (content instanceof CharSequence) {
+              docSize += Math.min(((CharSequence) content).length(), maxFieldValueChars);
             }
           }
 


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold <[hidden email]>:
>
> Hi,
>
> while trying to update from Solr 7.6 to 7.7 I run into some unexpected incompatibilites with UpdateRequestProcessors.
>
> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not return Strings for string fields anymore but instances of org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" section.
>
> I can adapt our UpdateRequestProcessor implementations but at least the org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor is broken now as well and needs to be fixed in Solr. It expects String values and logs messages such as the following now:
>
> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized not a String value, not including in detection
>
> I wonder what kind of plugins are affected by the change. Does this only affect UpdateRequestProcessors or more plugins? Do I need to handle these ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
>
> Cheers,
> Andreas
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.7 UpdateRequestProcessor broken

ahubold
Hi,

thank you, Jan.

I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you
want to add your patch to that ticket. I did not have time to test it yet.

So I guess, all SolrJ usages have to handle CharSequence now for string
fields? Well, this really sounds like a major breaking change for custom
code.

Thanks,
Andreas

Jan Høydahl schrieb am 15.02.19 um 09:14:

> Hi
>
> This is a subtle change which is not detected by our langid unit tests, as I think it only happens when document is trasferred with SolrJ and Javabin codec.
> Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
>
> Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
>
> Other SolrInputDocument users assuming String type for strings in SolrInputDocument would also be vulnerable.
>
> I have a patch ready that you could test:
>
> Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===================================================================
> --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (date 1550217809000)
> @@ -60,12 +60,12 @@
>             Collection<Object> fieldValues = doc.getFieldValues(fieldName);
>             if (fieldValues != null) {
>               for (Object content : fieldValues) {
> -              if (content instanceof String) {
> -                String stringContent = (String) content;
> +              if (content instanceof CharSequence) {
> +                CharSequence stringContent = (CharSequence) content;
>                   if (stringContent.length() > maxFieldValueChars) {
> -                  detector.append(stringContent.substring(0, maxFieldValueChars));
> +                  detector.append(stringContent.subSequence(0, maxFieldValueChars).toString());
>                   } else {
> -                  detector.append(stringContent);
> +                  detector.append(stringContent.toString());
>                   }
>                   detector.append(" ");
>                 } else {
> Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===================================================================
> --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (date 1550217691000)
> @@ -413,10 +413,10 @@
>           Collection<Object> fieldValues = doc.getFieldValues(fieldName);
>           if (fieldValues != null) {
>             for (Object content : fieldValues) {
> -            if (content instanceof String) {
> -              String stringContent = (String) content;
> +            if (content instanceof CharSequence) {
> +              CharSequence stringContent = (CharSequence) content;
>                 if (stringContent.length() > maxFieldValueChars) {
> -                sb.append(stringContent.substring(0, maxFieldValueChars));
> +                sb.append(stringContent.subSequence(0, maxFieldValueChars));
>                 } else {
>                   sb.append(stringContent);
>                 }
> @@ -449,8 +449,8 @@
>           Collection<Object> contents = doc.getFieldValues(field);
>           if (contents != null) {
>             for (Object content : contents) {
> -            if (content instanceof String) {
> -              docSize += Math.min(((String) content).length(), maxFieldValueChars);
> +            if (content instanceof CharSequence) {
> +              docSize += Math.min(((CharSequence) content).length(), maxFieldValueChars);
>               }
>             }
>  
>
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold <[hidden email]>:
>>
>> Hi,
>>
>> while trying to update from Solr 7.6 to 7.7 I run into some unexpected incompatibilites with UpdateRequestProcessors.
>>
>> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not return Strings for string fields anymore but instances of org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" section.
>>
>> I can adapt our UpdateRequestProcessor implementations but at least the org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor is broken now as well and needs to be fixed in Solr. It expects String values and logs messages such as the following now:
>>
>> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized not a String value, not including in detection
>>
>> I wonder what kind of plugins are affected by the change. Does this only affect UpdateRequestProcessors or more plugins? Do I need to handle these ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
>>
>> Cheers,
>> Andreas
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

RE: Solr 7.7 UpdateRequestProcessor broken

Markus Jelsma-2
In reply to this post by ahubold
I stumbled upon this too yesterday and created SOLR-13249. In local unit tests we get String but in distributed unit tests we get a ByteArrayUtf8CharSequence instead.

https://issues.apache.org/jira/browse/SOLR-13249 

 
 
-----Original message-----

> From:Andreas Hubold <[hidden email]>
> Sent: Friday 15th February 2019 10:10
> To: [hidden email]
> Subject: Re: Solr 7.7 UpdateRequestProcessor broken
>
> Hi,
>
> thank you, Jan.
>
> I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you
> want to add your patch to that ticket. I did not have time to test it yet.
>
> So I guess, all SolrJ usages have to handle CharSequence now for string
> fields? Well, this really sounds like a major breaking change for custom
> code.
>
> Thanks,
> Andreas
>
> Jan Høydahl schrieb am 15.02.19 um 09:14:
> > Hi
> >
> > This is a subtle change which is not detected by our langid unit tests, as I think it only happens when document is trasferred with SolrJ and Javabin codec.
> > Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
> >
> > Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
> >
> > Other SolrInputDocument users assuming String type for strings in SolrInputDocument would also be vulnerable.
> >
> > I have a patch ready that you could test:
> >
> > Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> > IDEA additional info:
> > Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> > <+>UTF-8
> > ===================================================================
> > --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> > +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (date 1550217809000)
> > @@ -60,12 +60,12 @@
> >             Collection<Object> fieldValues = doc.getFieldValues(fieldName);
> >             if (fieldValues != null) {
> >               for (Object content : fieldValues) {
> > -              if (content instanceof String) {
> > -                String stringContent = (String) content;
> > +              if (content instanceof CharSequence) {
> > +                CharSequence stringContent = (CharSequence) content;
> >                   if (stringContent.length() > maxFieldValueChars) {
> > -                  detector.append(stringContent.substring(0, maxFieldValueChars));
> > +                  detector.append(stringContent.subSequence(0, maxFieldValueChars).toString());
> >                   } else {
> > -                  detector.append(stringContent);
> > +                  detector.append(stringContent.toString());
> >                   }
> >                   detector.append(" ");
> >                 } else {
> > Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> > IDEA additional info:
> > Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> > <+>UTF-8
> > ===================================================================
> > --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> > +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (date 1550217691000)
> > @@ -413,10 +413,10 @@
> >           Collection<Object> fieldValues = doc.getFieldValues(fieldName);
> >           if (fieldValues != null) {
> >             for (Object content : fieldValues) {
> > -            if (content instanceof String) {
> > -              String stringContent = (String) content;
> > +            if (content instanceof CharSequence) {
> > +              CharSequence stringContent = (CharSequence) content;
> >                 if (stringContent.length() > maxFieldValueChars) {
> > -                sb.append(stringContent.substring(0, maxFieldValueChars));
> > +                sb.append(stringContent.subSequence(0, maxFieldValueChars));
> >                 } else {
> >                   sb.append(stringContent);
> >                 }
> > @@ -449,8 +449,8 @@
> >           Collection<Object> contents = doc.getFieldValues(field);
> >           if (contents != null) {
> >             for (Object content : contents) {
> > -            if (content instanceof String) {
> > -              docSize += Math.min(((String) content).length(), maxFieldValueChars);
> > +            if (content instanceof CharSequence) {
> > +              docSize += Math.min(((CharSequence) content).length(), maxFieldValueChars);
> >               }
> >             }
> >  
> >
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> >> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold <[hidden email]>:
> >>
> >> Hi,
> >>
> >> while trying to update from Solr 7.6 to 7.7 I run into some unexpected incompatibilites with UpdateRequestProcessors.
> >>
> >> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not return Strings for string fields anymore but instances of org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" section.
> >>
> >> I can adapt our UpdateRequestProcessor implementations but at least the org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor is broken now as well and needs to be fixed in Solr. It expects String values and logs messages such as the following now:
> >>
> >> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized not a String value, not including in detection
> >>
> >> I wonder what kind of plugins are affected by the change. Does this only affect UpdateRequestProcessors or more plugins? Do I need to handle these ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
> >>
> >> Cheers,
> >> Andreas
> >>
> >>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.7 UpdateRequestProcessor broken

Jan Høydahl / Cominvent
Thanks for chiming in Markus. Yea, same with the langid tests, they just work locally with manually constructed SolrInputDocument objects.
This bug breaking change sounds really scary and we should add an UPGRADE NOTE somewhere.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 15. feb. 2019 kl. 10:34 skrev Markus Jelsma <[hidden email]>:
>
> I stumbled upon this too yesterday and created SOLR-13249. In local unit tests we get String but in distributed unit tests we get a ByteArrayUtf8CharSequence instead.
>
> https://issues.apache.org/jira/browse/SOLR-13249 
>
>
>
> -----Original message-----
>> From:Andreas Hubold <[hidden email]>
>> Sent: Friday 15th February 2019 10:10
>> To: [hidden email]
>> Subject: Re: Solr 7.7 UpdateRequestProcessor broken
>>
>> Hi,
>>
>> thank you, Jan.
>>
>> I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you
>> want to add your patch to that ticket. I did not have time to test it yet.
>>
>> So I guess, all SolrJ usages have to handle CharSequence now for string
>> fields? Well, this really sounds like a major breaking change for custom
>> code.
>>
>> Thanks,
>> Andreas
>>
>> Jan Høydahl schrieb am 15.02.19 um 09:14:
>>> Hi
>>>
>>> This is a subtle change which is not detected by our langid unit tests, as I think it only happens when document is trasferred with SolrJ and Javabin codec.
>>> Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
>>>
>>> Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
>>>
>>> Other SolrInputDocument users assuming String type for strings in SolrInputDocument would also be vulnerable.
>>>
>>> I have a patch ready that you could test:
>>>
>>> Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
>>> IDEA additional info:
>>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
>>> <+>UTF-8
>>> ===================================================================
>>> --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
>>> +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java (date 1550217809000)
>>> @@ -60,12 +60,12 @@
>>>            Collection<Object> fieldValues = doc.getFieldValues(fieldName);
>>>            if (fieldValues != null) {
>>>              for (Object content : fieldValues) {
>>> -              if (content instanceof String) {
>>> -                String stringContent = (String) content;
>>> +              if (content instanceof CharSequence) {
>>> +                CharSequence stringContent = (CharSequence) content;
>>>                  if (stringContent.length() > maxFieldValueChars) {
>>> -                  detector.append(stringContent.substring(0, maxFieldValueChars));
>>> +                  detector.append(stringContent.subSequence(0, maxFieldValueChars).toString());
>>>                  } else {
>>> -                  detector.append(stringContent);
>>> +                  detector.append(stringContent.toString());
>>>                  }
>>>                  detector.append(" ");
>>>                } else {
>>> Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
>>> IDEA additional info:
>>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
>>> <+>UTF-8
>>> ===================================================================
>>> --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
>>> +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java (date 1550217691000)
>>> @@ -413,10 +413,10 @@
>>>          Collection<Object> fieldValues = doc.getFieldValues(fieldName);
>>>          if (fieldValues != null) {
>>>            for (Object content : fieldValues) {
>>> -            if (content instanceof String) {
>>> -              String stringContent = (String) content;
>>> +            if (content instanceof CharSequence) {
>>> +              CharSequence stringContent = (CharSequence) content;
>>>                if (stringContent.length() > maxFieldValueChars) {
>>> -                sb.append(stringContent.substring(0, maxFieldValueChars));
>>> +                sb.append(stringContent.subSequence(0, maxFieldValueChars));
>>>                } else {
>>>                  sb.append(stringContent);
>>>                }
>>> @@ -449,8 +449,8 @@
>>>          Collection<Object> contents = doc.getFieldValues(field);
>>>          if (contents != null) {
>>>            for (Object content : contents) {
>>> -            if (content instanceof String) {
>>> -              docSize += Math.min(((String) content).length(), maxFieldValueChars);
>>> +            if (content instanceof CharSequence) {
>>> +              docSize += Math.min(((CharSequence) content).length(), maxFieldValueChars);
>>>              }
>>>            }
>>>
>>>
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>>
>>>> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold <[hidden email]>:
>>>>
>>>> Hi,
>>>>
>>>> while trying to update from Solr 7.6 to 7.7 I run into some unexpected incompatibilites with UpdateRequestProcessors.
>>>>
>>>> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not return Strings for string fields anymore but instances of org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" section.
>>>>
>>>> I can adapt our UpdateRequestProcessor implementations but at least the org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor is broken now as well and needs to be fixed in Solr. It expects String values and logs messages such as the following now:
>>>>
>>>> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized not a String value, not including in detection
>>>>
>>>> I wonder what kind of plugins are affected by the change. Does this only affect UpdateRequestProcessors or more plugins? Do I need to handle these ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
>>>>
>>>> Cheers,
>>>> Andreas
>>>>
>>>>
>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.7 UpdateRequestProcessor broken

Jason Gerlowski
Hey all,

I have a proposed update which adds a 7.7 section to our "Upgrade
Notes" ref-guide page.  I put a mention of this in there, but don't
have a ton of context on the issue.  Would appreciate a review from
anyone more familiar.  Check out SOLR-13256 if you get a few minutes.

Best,

Jason

On Mon, Feb 18, 2019 at 9:06 AM Jan Høydahl <[hidden email]> wrote:

>
> Thanks for chiming in Markus. Yea, same with the langid tests, they just work locally with manually constructed SolrInputDocument objects.
> This bug breaking change sounds really scary and we should add an UPGRADE NOTE somewhere.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 15. feb. 2019 kl. 10:34 skrev Markus Jelsma <[hidden email]>:
> >
> > I stumbled upon this too yesterday and created SOLR-13249. In local unit tests we get String but in distributed unit tests we get a ByteArrayUtf8CharSequence instead.
> >
> > https://issues.apache.org/jira/browse/SOLR-13249
> >
> >
> >
> > -----Original message-----
> >> From:Andreas Hubold <[hidden email]>
> >> Sent: Friday 15th February 2019 10:10
> >> To: [hidden email]
> >> Subject: Re: Solr 7.7 UpdateRequestProcessor broken
> >>
> >> Hi,
> >>
> >> thank you, Jan.
> >>
> >> I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you
> >> want to add your patch to that ticket. I did not have time to test it yet.
> >>
> >> So I guess, all SolrJ usages have to handle CharSequence now for string
> >> fields? Well, this really sounds like a major breaking change for custom
> >> code.
> >>
> >> Thanks,
> >> Andreas
> >>
> >> Jan Høydahl schrieb am 15.02.19 um 09:14:
> >>> Hi
> >>>
> >>> This is a subtle change which is not detected by our langid unit tests, as I think it only happens when document is trasferred with SolrJ and Javabin codec.
> >>> Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
> >>>
> >>> Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
> >>>
> >>> Other SolrInputDocument users assuming String type for strings in SolrInputDocument would also be vulnerable.
> >>>
> >>> I have a patch ready that you could test:
> >>>
> >>> Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> >>> IDEA additional info:
> >>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> >>> <+>UTF-8
> >>> ===================================================================
> >>> --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java  (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> >>> +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java  (date 1550217809000)
> >>> @@ -60,12 +60,12 @@
> >>>            Collection<Object> fieldValues = doc.getFieldValues(fieldName);
> >>>            if (fieldValues != null) {
> >>>              for (Object content : fieldValues) {
> >>> -              if (content instanceof String) {
> >>> -                String stringContent = (String) content;
> >>> +              if (content instanceof CharSequence) {
> >>> +                CharSequence stringContent = (CharSequence) content;
> >>>                  if (stringContent.length() > maxFieldValueChars) {
> >>> -                  detector.append(stringContent.substring(0, maxFieldValueChars));
> >>> +                  detector.append(stringContent.subSequence(0, maxFieldValueChars).toString());
> >>>                  } else {
> >>> -                  detector.append(stringContent);
> >>> +                  detector.append(stringContent.toString());
> >>>                  }
> >>>                  detector.append(" ");
> >>>                } else {
> >>> Index: solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> >>> IDEA additional info:
> >>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> >>> <+>UTF-8
> >>> ===================================================================
> >>> --- solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java    (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> >>> +++ solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java    (date 1550217691000)
> >>> @@ -413,10 +413,10 @@
> >>>          Collection<Object> fieldValues = doc.getFieldValues(fieldName);
> >>>          if (fieldValues != null) {
> >>>            for (Object content : fieldValues) {
> >>> -            if (content instanceof String) {
> >>> -              String stringContent = (String) content;
> >>> +            if (content instanceof CharSequence) {
> >>> +              CharSequence stringContent = (CharSequence) content;
> >>>                if (stringContent.length() > maxFieldValueChars) {
> >>> -                sb.append(stringContent.substring(0, maxFieldValueChars));
> >>> +                sb.append(stringContent.subSequence(0, maxFieldValueChars));
> >>>                } else {
> >>>                  sb.append(stringContent);
> >>>                }
> >>> @@ -449,8 +449,8 @@
> >>>          Collection<Object> contents = doc.getFieldValues(field);
> >>>          if (contents != null) {
> >>>            for (Object content : contents) {
> >>> -            if (content instanceof String) {
> >>> -              docSize += Math.min(((String) content).length(), maxFieldValueChars);
> >>> +            if (content instanceof CharSequence) {
> >>> +              docSize += Math.min(((CharSequence) content).length(), maxFieldValueChars);
> >>>              }
> >>>            }
> >>>
> >>>
> >>>
> >>> --
> >>> Jan Høydahl, search solution architect
> >>> Cominvent AS - www.cominvent.com
> >>>
> >>>> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold <[hidden email]>:
> >>>>
> >>>> Hi,
> >>>>
> >>>> while trying to update from Solr 7.6 to 7.7 I run into some unexpected incompatibilites with UpdateRequestProcessors.
> >>>>
> >>>> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not return Strings for string fields anymore but instances of org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" section.
> >>>>
> >>>> I can adapt our UpdateRequestProcessor implementations but at least the org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor is broken now as well and needs to be fixed in Solr. It expects String values and logs messages such as the following now:
> >>>>
> >>>> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized not a String value, not including in detection
> >>>>
> >>>> I wonder what kind of plugins are affected by the change. Does this only affect UpdateRequestProcessors or more plugins? Do I need to handle these ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
> >>>>
> >>>> Cheers,
> >>>> Andreas
> >>>>
> >>>>
> >>>
> >>
> >>
>