DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Chantal Ackermann
Hi,

my use case is the following:

In a sub-entity I request rows from a database for an input list of
strings:
<entity name="prog" ...>
        <field name="vip" ...> /* multivalued, not required */
        <entity name="ssc_entry" dataSource="ssc" onError="continue"
                query="select SSC_VALUE from SSC_VALUE
                        where SSC_ATTRIBUTE_ID=1
                          and SSC_VALUE in (${prog.vip})">
                <field column="SSC_VALUE" name="vip_ssc" />
        </entity>
</entity>

The root entity is "prog" and it has an optional multivalued field
called "vip". When the list of "vip" values is empty, the SQL for the
sub-entity above throws an SQLException. (Working with Oracle which does
not allow an empty expression in the "in"-clause.)

Two things:
(A) best would be not to run the query whenever ${prog.vip} is null or
empty.
(B) From the documentation, it is not clear that onError is only checked
in the transformer runs but not checked when the SQL for the entity
throws an exception. (Trunk version JdbcDataSource lines 250pp).

IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
(A) is not easily fixable, making (B) work would be helpful.)

Looking through the code, I've realized that the replacement of the
variables is done in a very generic way. I've not yet seen an
appropriate way to check on those variables in order to stop the
processing of the entity if the variable is empty.
Is there a way to do this? Or maybe there is a completely different way
to get my use case working. Any help most appreciated!

Thanks,
Chantal

Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

MitchK
Hi Chantal,

did you tried to write a custom DIH Function?
If not, I think this will be a solution.
Just check, whether "${prog.vip}" is an empty string or null.
If so, you need to replace it with a value that never can response anything.

So the vip-field will always be empty for such queries.
Maybe that helps?

Hopefully, the variable resolver is able to resolve something like
${dih.functions.getReplacementIfNeeded(prog.vip).

Kind regards,
- Mitch


Chantal Ackermann wrote
Hi,

my use case is the following:

In a sub-entity I request rows from a database for an input list of
strings:
<entity name="prog" ...>
        <field name="vip" ...> /* multivalued, not required */
        <entity name="ssc_entry" dataSource="ssc" onError="continue"
                query="select SSC_VALUE from SSC_VALUE
                        where SSC_ATTRIBUTE_ID=1
                          and SSC_VALUE in (${prog.vip})">
                <field column="SSC_VALUE" name="vip_ssc" />
        </entity>
</entity>

The root entity is "prog" and it has an optional multivalued field
called "vip". When the list of "vip" values is empty, the SQL for the
sub-entity above throws an SQLException. (Working with Oracle which does
not allow an empty expression in the "in"-clause.)

Two things:
(A) best would be not to run the query whenever ${prog.vip} is null or
empty.
(B) From the documentation, it is not clear that onError is only checked
in the transformer runs but not checked when the SQL for the entity
throws an exception. (Trunk version JdbcDataSource lines 250pp).

IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
(A) is not easily fixable, making (B) work would be helpful.)

Looking through the code, I've realized that the replacement of the
variables is done in a very generic way. I've not yet seen an
appropriate way to check on those variables in order to stop the
processing of the entity if the variable is empty.
Is there a way to do this? Or maybe there is a completely different way
to get my use case working. Any help most appreciated!

Thanks,
Chantal
Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Chantal Ackermann
Hi Mitch,

thanks for that suggestion. I wasn't aware of that. I've already added a
temporary field in my ScriptTransformer that does basically the same.

However, with this approach indexing time went up from 20min to more
than 5 hours.

The new approach is to query the solr index for that other database that
I've already setup. This is only a bit slower than the original query
(20min). (I'm using URLDataSource to be 1.4.1 conform.)

As with the db entity before, for every document a request is sent to
the solr core even if it is useless because the input variable is empty.
It seems that once an entity processor kicks in you cannot avoid the
initial request to its data source?

Thanks,
Chantal

On Mon, 2010-07-26 at 16:22 +0200, MitchK wrote:

> Hi Chantal,
>
> did you tried to write a  http://wiki.apache.org/solr/DIHCustomFunctions
> custom DIH Function ?
> If not, I think this will be a solution.
> Just check, whether "${prog.vip}" is an empty string or null.
> If so, you need to replace it with a value that never can response anything.
>
> So the vip-field will always be empty for such queries.
> Maybe that helps?
>
> Hopefully, the variable resolver is able to resolve something like
> ${dih.functions.getReplacementIfNeeded(prog.vip).
>
> Kind regards,
> - Mitch
>
>
>
> Chantal Ackermann wrote:
> >
> > Hi,
> >
> > my use case is the following:
> >
> > In a sub-entity I request rows from a database for an input list of
> > strings:
> > <entity name="prog" ...>
> > <field name="vip" ...> /* multivalued, not required */
> > <entity name="ssc_entry" dataSource="ssc" onError="continue"
> > query="select SSC_VALUE from SSC_VALUE
> > where SSC_ATTRIBUTE_ID=1
> >  and SSC_VALUE in (${prog.vip})">
> > <field column="SSC_VALUE" name="vip_ssc" />
> > </entity>
> > </entity>
> >
> > The root entity is "prog" and it has an optional multivalued field
> > called "vip". When the list of "vip" values is empty, the SQL for the
> > sub-entity above throws an SQLException. (Working with Oracle which does
> > not allow an empty expression in the "in"-clause.)
> >
> > Two things:
> > (A) best would be not to run the query whenever ${prog.vip} is null or
> > empty.
> > (B) From the documentation, it is not clear that onError is only checked
> > in the transformer runs but not checked when the SQL for the entity
> > throws an exception. (Trunk version JdbcDataSource lines 250pp).
> >
> > IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
> > (A) is not easily fixable, making (B) work would be helpful.)
> >
> > Looking through the code, I've realized that the replacement of the
> > variables is done in a very generic way. I've not yet seen an
> > appropriate way to check on those variables in order to stop the
> > processing of the entity if the variable is empty.
> > Is there a way to do this? Or maybe there is a completely different way
> > to get my use case working. Any help most appreciated!
> >
> > Thanks,
> > Chantal
> >
> >
> >


Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

MitchK
Hi Chantal,

However, with this approach indexing time went up from 20min to more
than 5 hours.
This is 15x slower than the initial solution... wow.
From MySQL I know that IN ()-clauses are the embodiment of endlessness - they perform very, very badly.

New idea:
Create a method which returns the query-string:

returnString(theVIP)
{
       if ( theVIP != null || theVIP != "")
       {
               return "a query-string to find the vip"
       }
       else
       {
               return "SELECT 1" // you need to modify this, so that it matches your field-definition
       }
}

The main-idea is to perform a blazing fast query, instead of a complex IN-clause-query.
Does this sounds like a solution???

The new approach is to query the solr index for that other database that
I've already setup. This is only a bit slower than the original query
(20min). (I'm using URLDataSource to be 1.4.1 conform.)
Unfortunately I can not follow you.
You are querying a solr-index for a database?

Kind regards,
- Mitch
Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Chantal Ackermann
Hi Mitch,


> New idea:
> Create a method which returns the query-string:
>
> returnString(theVIP)
> {
>        if ( theVIP != null || theVIP != "")
>        {
>                return "a query-string to find the vip"
>        }
>        else
>        {
>                return "SELECT 1" // you need to modify this, so that it
> matches your field-definition
>        }
> }
>
> The main-idea is to perform a blazing fast query, instead of a complex
> IN-clause-query.
> Does this sounds like a solution???

I was using "in" because it's a multiValued input that results in
multiValued output (not necessarily but it's most probable - it's either
empty or multiple values).
I don't understand how I can make your solution work with multivalued
input/output?

> > The new approach is to query the solr index for that other database that
> > I've already setup. This is only a bit slower than the original query
> > (20min). (I'm using URLDataSource to be 1.4.1 conform.)
> >
> Unfortunately I can not follow you.
> You are querying a solr-index for a database?

Yes, because I've already put one up (second core) and used SolrJ to get
what I want later on, but it would be better to compute the relation
between the two indexes at index time instead of at query time. (If it
would have worked with the db entity the second index wouldn't have been
required, anymore.)
But now that it works well with the url entity I'm fine with maintaining
that second index. It's not that much effort.
I've subclassed URLDataSource to add a check whether the list of input
values is empty and only proceed when this is not the case. If realized
that I have to throw an exception and add the onError attribute to the
entity to make that work.

Thanks!
Chantal

Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

MitchK
Hi Chantal,

instead of:

<entity name="prog" ...> 
        <field name="vip" ...> /* multivalued, not required */
        <entity name="ssc_entry" dataSource="ssc" onError="continue"
                query="select SSC_VALUE from SSC_VALUE
                        where SSC_ATTRIBUTE_ID=1
                          and SSC_VALUE in (${prog.vip})"> 
                <field column="SSC_VALUE" name="vip_ssc" /> 
        </entity> 
</entity> 

you do:

<entity name="prog" ...> 
        <field name="vip" ...> /* multivalued, not required */
        <entity name="ssc_entry" dataSource="ssc" onError="continue"
                query="${yourCustomFunctionToReturnAQueryString(prog.vip, ..., ...)}"> 
                <field column="SSC_VALUE" name="vip_ssc" /> 
        </entity> 
</entity> 

The yourCustomFunctionToReturnAQueryString(vip, querystring1, querystring2)
{
    if(vip != null && !vip.equals(""))
    {
         StringBuilder sb = new StringBuilder(50);
         sb.append(querystring1); // SELECT SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1
                                               and SSC_VALUE in (
         sb.append(vip);//VIP-value
         sb.append(querystring2);//just the closing ")"
         return sb.toString();
     }
     else
     {
            return "SELECT \"\" AS yourFieldName";
     }
}

I expect that this method is called for every vip-value, if there is one.

Solr DIH uses the returned querystring to query the database. So, if vip-value is empty or null, you can use a different query that is blazing fast (i.e. SELECT "" AS yourFieldName - just an example to show the logic).
This query should return a row with an empty string. So Solr fills the current field with an empty string.

I don't know how to prevent Solr from calling your ssc_entry-entity, when vip is null or empty.
But this would be a solution to handle empty vip-strings as efficient as possible.

If realized
that I have to throw an exception and add the onError attribute to the
entity to make that work.
I am curious:
Can you show how to make a method throwing an exception that is accepted by the onError-attribute?

I hope we do not talk past eachother here. :-)

Kind regards,
- Mitch
Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Chantal Ackermann
Hi Mitch,

thanks for the code. Currently, I've got a different solution running
but it's always good to have examples.

> > If realized
> > that I have to throw an exception and add the onError attribute to the
> > entity to make that work.
> >
> I am curious:
> Can you show how to make a method throwing an exception that is accepted by
> the onError-attribute?

the catch clause looks for "Exception" so it's actually easy. :-D

Anyway, I've found a "cleaner" way. It is better to subclass the
XPathEntityProcessor and put it in a state that prevents it from calling
"initQuery" which triggers the dataSource.getData() call.
I have overridden the initContext() method setting a go/no go flag that
I am using in the overridden nextRow() to find out whether to delegate
to the superclass or not.

This way I can also avoid the code that fills the tmp field with an
empty value if there is no value to query on.

Cheers,
Chantal

Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Lance Norskog-2
Should this go into the trunk, or does it only solve problems unique
to your use case?

On Tue, Jul 27, 2010 at 5:49 AM, Chantal Ackermann
<[hidden email]> wrote:

> Hi Mitch,
>
> thanks for the code. Currently, I've got a different solution running
> but it's always good to have examples.
>
>> > If realized
>> > that I have to throw an exception and add the onError attribute to the
>> > entity to make that work.
>> >
>> I am curious:
>> Can you show how to make a method throwing an exception that is accepted by
>> the onError-attribute?
>
> the catch clause looks for "Exception" so it's actually easy. :-D
>
> Anyway, I've found a "cleaner" way. It is better to subclass the
> XPathEntityProcessor and put it in a state that prevents it from calling
> "initQuery" which triggers the dataSource.getData() call.
> I have overridden the initContext() method setting a go/no go flag that
> I am using in the overridden nextRow() to find out whether to delegate
> to the superclass or not.
>
> This way I can also avoid the code that fills the tmp field with an
> empty value if there is no value to query on.
>
> Cheers,
> Chantal
>
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Chantal Ackermann
Hi Lance!

On Wed, 2010-07-28 at 02:31 +0200, Lance Norskog wrote:
> Should this go into the trunk, or does it only solve problems unique
> to your use case?

The solution is generic but is an extension of XPathEntityProcessor
because I didn't want to touch the solr.war. This way I can deploy the
extension into SOLR_HOME/lib.
The problem that it solves is not one with XPathEntityProcessor but more
general. What it does:

It adds an attribute to the entity that I called "skipIfEmpty" which
takes the variable (it could even take more variables seperated by
whitespace).
On entityProcessor.init() which is called for sub-entities per row of
root entity (:= before every new request to the data source), the value
of the attribute is resolved and if it is null or empty (after
trimming), the entity is not further processed.
This attribute is only allowed on sub-entities.

It would probably be nicer to put that somewhere higher up in the class
hierarchy so that all entity processors could make use of it.
But I don't know how common the use case is - all examples I found where
more or less "joins" on primary keys.

Cheers,
Chantal

Here comes the code==================================

import static
org.apache.solr.handler.dataimport.DataImportHandlerException.SEVERE;

import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.DataImportHandlerException;
import org.apache.solr.handler.dataimport.XPathEntityProcessor;

public class OptionalXPathEntityProcessor extends XPathEntityProcessor {
        private Logger log =
Logger.getLogger(OptionalXPathEntityProcessor.class.getName());
        private static final String SKIP_IF_EMPTY = "skipIfEmpty";
        private boolean skip = false;

        @Override
        protected void firstInit(Context context) {
                if (context.isRootEntity()) {
                        throw new DataImportHandlerException(SEVERE,
"OptionalXPathEntityProcessor not allowed for root entities.");
                }
                super.firstInit(context);
        }

        @Override
        public void init(Context context) {
                String value = context.getResolvedEntityAttribute(SKIP_IF_EMPTY);
                if (value == null || value.trim().isEmpty()) {
                        skip = true;
                } else {
                        super.init(context);
                        skip = false;
                }
        }

        @Override
        public Map<String, Object> nextRow() {
                if (skip) return null;
                return super.nextRow();
        }
}