Issues with the handling of NULLs in Streaming Expressions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Issues with the handling of NULLs in Streaming Expressions

Oleksandr Chornyi
Hi guys!

I'm working on a generic query builder for Streaming Expressions which
allows building various requests containing row level expressions (i.e.
evaluators), aggregations/metrics, sorts, etc. On this way, I bumped into
many issues related to the handling of NULL values by the engine. Here are
the issues in the descending order of their severity (from my standpoint):

1. *There is no way to check if a value in a tuple is NULL* because
*eq* function
fails to accept *null *as an argument:

> *eq(1,null) *

fails with

> "Unable to check eq(...) because a null value was found"

even though the documentation says
<https://lucene.apache.org/solr/guide/7_7/stream-evaluator-reference.html#eq>
that "If any any parameters are null and there is at least one parameter
that is not null then false will be returned."
This issue makes it impossible to evaluate an expression from the *if* function
documentation
<https://lucene.apache.org/solr/guide/7_7/stream-evaluator-reference.html#if>
:

> if(eq(fieldB,null), null, div(fieldA,fieldB)) // if fieldB is null then
> null else fieldA / fieldB

I think that the root cause of the issue is coming from the fact that
*EqualToEvaluator* extends *RecursiveBooleanEvaluator* which checks that
none of the arguments is *null*, but I don't think that's what we want
here. *Can you confirm that what I see is a bug and I should file it?*

2. The fact that *FieldValueEvaluator returns a field name when a value is
null* breaks any evaluator/decorator which otherwise would handle
*nulls*. Consider
these examples (I'm using *cartesianProduct *on an integer array to get
several tuples with integers because I couldn't find a way to do so
directly):

> cartesianProduct(
>     tuple(a=array(1,null,3)),
>     a
> )

returns values preserving *nulls: *

> "docs": [
>   {"a": 1},
>   {"a": null},
>   {"a": 3},
> ...]

If I just execute *add(1, null) *it works as expected and returns *null.* Now,
if I'm trying to apply any stream evaluator which should work fine with
*nulls* to this stream:

> select(
>     cartesianProduct(
>         tuple(a=array(1,null,3)),
>         a
>     ),
>     add(a, 1) as a
> )

it fails to process the second record saying that:

> "docs": [
>   {"a": 2},
>   {
>     "EXCEPTION": "Failed to evaluate expression add(a,val(1)) - Numeric
> value expected but found type java.lang.String for value a",
>     ...
>   }
> ]

It looks even more confusing when running the following query:

> select(
>     cartesianProduct(
>         tuple(a=array(1,null,3)),
>         a
>     ),
>     coalesce(a, 42) as a
> )

produces

> "docs": [
>   {"a": 1},
>   {"a": "a"},
>   {"a": 3},
> ...]

 instead of

> "docs": [
>   {"a": 1},
>   {"a": *42*},
>   {"a": 3},
> ...]

As I mentioned in the issue description, I think the issue lies in these
lines of *FieldValueEvaluator:*

> if(value == null) {
>    return fieldName;
> }

I consider this to be very counterintuitive. *Can you confirm that this is
a bug, rather than a designed feature?*

3. *Most Boolean Stream Evaluators* state that they *don't work with
NULLs.* However,
it's very inconvenient and there is no other way to work around it (see
item #1)*. *I'm talking about the following evaluators: *and, eor, or, gt,
lt, gteq, lteq. *At the moment these evaluators just throw exceptions when
an argument is *null. **Have you considered making their behavior more
SQL-like?* When the behavior is like this:

   - *gt, lt, gteq, lteq *evaluators return *null* if any of the arguments
   is *null*
   - *or(true, null)* returns *true*
   - *and(true, null)* returns *false*
   - *having* decorator treats *null* returned by *booleanEvaluator* as
   *false*

4. Some *inconsistencies in evaluators behavior* and/or documentation:

   - *div(1, null)* fails while *mult(1, null), add(1, null), sub(1, null)*
   return *null*. *Should I file a bug for div?*
   - documentation for *not *says that "The function will fail to execute
   if the parameter is non-boolean or null" however it returns *null*
for *not(null).
   **Should I create a task to fix the doc?*

I know I mixed many questions into one thread, however for me they are all
interrelated. Thank in advance for your help.

--
Best Regards,
Alex Chornyi
Reply | Threaded
Open this post in threaded view
|

Re: Issues with the handling of NULLs in Streaming Expressions

Joel Bernstein
Interesting questions. I suspect we need to beef up our test cases that
deal with nulls and make sure they behave in a consistent manner.

One of the things that likely needs to be looked at more carefully is how
string literals are handled as opposed to nulls. In some cases I believe if
null is encountered it's treated as a string literal and doesn't preserve
the null. So I think it's worth creating a ticket outlining your findings
and we can think about solutions.

Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Jun 6, 2019 at 9:22 AM Oleksandr Chornyi <[hidden email]>
wrote:

> Hi guys!
>
> I'm working on a generic query builder for Streaming Expressions which
> allows building various requests containing row level expressions (i.e.
> evaluators), aggregations/metrics, sorts, etc. On this way, I bumped into
> many issues related to the handling of NULL values by the engine. Here are
> the issues in the descending order of their severity (from my standpoint):
>
> 1. *There is no way to check if a value in a tuple is NULL* because
> *eq* function
> fails to accept *null *as an argument:
>
> > *eq(1,null) *
>
> fails with
>
> > "Unable to check eq(...) because a null value was found"
>
> even though the documentation says
> <
> https://lucene.apache.org/solr/guide/7_7/stream-evaluator-reference.html#eq
> >
> that "If any any parameters are null and there is at least one parameter
> that is not null then false will be returned."
> This issue makes it impossible to evaluate an expression from the *if*
> function
> documentation
> <
> https://lucene.apache.org/solr/guide/7_7/stream-evaluator-reference.html#if
> >
> :
>
> > if(eq(fieldB,null), null, div(fieldA,fieldB)) // if fieldB is null then
> > null else fieldA / fieldB
>
> I think that the root cause of the issue is coming from the fact that
> *EqualToEvaluator* extends *RecursiveBooleanEvaluator* which checks that
> none of the arguments is *null*, but I don't think that's what we want
> here. *Can you confirm that what I see is a bug and I should file it?*
>
> 2. The fact that *FieldValueEvaluator returns a field name when a value is
> null* breaks any evaluator/decorator which otherwise would handle
> *nulls*. Consider
> these examples (I'm using *cartesianProduct *on an integer array to get
> several tuples with integers because I couldn't find a way to do so
> directly):
>
> > cartesianProduct(
> >     tuple(a=array(1,null,3)),
> >     a
> > )
>
> returns values preserving *nulls: *
>
> > "docs": [
> >   {"a": 1},
> >   {"a": null},
> >   {"a": 3},
> > ...]
>
> If I just execute *add(1, null) *it works as expected and returns *null.*
> Now,
> if I'm trying to apply any stream evaluator which should work fine with
> *nulls* to this stream:
>
> > select(
> >     cartesianProduct(
> >         tuple(a=array(1,null,3)),
> >         a
> >     ),
> >     add(a, 1) as a
> > )
>
> it fails to process the second record saying that:
>
> > "docs": [
> >   {"a": 2},
> >   {
> >     "EXCEPTION": "Failed to evaluate expression add(a,val(1)) - Numeric
> > value expected but found type java.lang.String for value a",
> >     ...
> >   }
> > ]
>
> It looks even more confusing when running the following query:
>
> > select(
> >     cartesianProduct(
> >         tuple(a=array(1,null,3)),
> >         a
> >     ),
> >     coalesce(a, 42) as a
> > )
>
> produces
>
> > "docs": [
> >   {"a": 1},
> >   {"a": "a"},
> >   {"a": 3},
> > ...]
>
>  instead of
>
> > "docs": [
> >   {"a": 1},
> >   {"a": *42*},
> >   {"a": 3},
> > ...]
>
> As I mentioned in the issue description, I think the issue lies in these
> lines of *FieldValueEvaluator:*
>
> > if(value == null) {
> >    return fieldName;
> > }
>
> I consider this to be very counterintuitive. *Can you confirm that this is
> a bug, rather than a designed feature?*
>
> 3. *Most Boolean Stream Evaluators* state that they *don't work with
> NULLs.* However,
> it's very inconvenient and there is no other way to work around it (see
> item #1)*. *I'm talking about the following evaluators: *and, eor, or, gt,
> lt, gteq, lteq. *At the moment these evaluators just throw exceptions when
> an argument is *null. **Have you considered making their behavior more
> SQL-like?* When the behavior is like this:
>
>    - *gt, lt, gteq, lteq *evaluators return *null* if any of the arguments
>    is *null*
>    - *or(true, null)* returns *true*
>    - *and(true, null)* returns *false*
>    - *having* decorator treats *null* returned by *booleanEvaluator* as
>    *false*
>
> 4. Some *inconsistencies in evaluators behavior* and/or documentation:
>
>    - *div(1, null)* fails while *mult(1, null), add(1, null), sub(1, null)*
>    return *null*. *Should I file a bug for div?*
>    - documentation for *not *says that "The function will fail to execute
>    if the parameter is non-boolean or null" however it returns *null*
> for *not(null).
>    **Should I create a task to fix the doc?*
>
> I know I mixed many questions into one thread, however for me they are all
> interrelated. Thank in advance for your help.
>
> --
> Best Regards,
> Alex Chornyi
>
Reply | Threaded
Open this post in threaded view
|

Re: Issues with the handling of NULLs in Streaming Expressions

pratik@semandex
I am facing exactly the same issue right now. There is no way to check if a
particular field is not present in tuple or is null.

Was there any development related to this issue? Is there a work around?

In my case, I have an incoming stream of tuples and I want to filter out all
the tuples which do not have certain field set, so I was thinking of
"having" function like this.

having( seed_expr, not(eq(fieldA,null) )

this would result in stream of tuples which definitely have fieldA set and I
can do some operation on it.

Problem is that "eq" evaluator fails with null value.

Is there a related JIRA that I can track?

@Joel is there any way/ workaround  to achieve this? i.e. to know whether
certain field is null or not?


Thanks and Regards,
Pratik



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html