Tika Integration problem with DIH and JDBC

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika Integration problem with DIH and JDBC

Dan Davis-2

What I want to do is to pull an URL out of an Oracle database, and then use TikaEntityProcessor and BinURLDataSource to go fetch and process that URL.   I'm having a problem with this that seems general to JDBC with Tika - I get an exception as follows:

Exception in entity : extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: http://www.cdc.gov/healthypets/pets/wildlife.html Processing Document # 14
	at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
...

Steps to reproduce any problem should be:
  • Try it with the XML and verify you get two documents and they contain text (schema browser with the text field)
  • Try it with a JDBC sqlite3 dataSource and verify that you get an exception, and advise me what may be the problem in my configuration ...
Now, I've tried this 3 ways:
  • My Oracle database - fails as above
  • An SQLite3 database to see if it is Oracle specific - fails with "Unable to execute query", but doesn't have the URL as part of the message.
  • An XML file listing two URLs - succeeds without error.

For the SQL attempts, setting onError="skip" leads the data from the database to be indexed, but the exception is logged for each root entity.   I can tell that nothing is indexed from the text extraction by browsing the "text" field from the schema browser and seeing how few terms there are.   The exceptions also sort of give it away, but it is good to be careful :)

This is using:

  • Tomcat 7.0.55
  • Solr 4.10.1
  • and JDBC drivers
    • ojdbc7.jar
    • sqlite-jdbc-3.7.2.jar

Excerpt of solrconfig.xml:

  <!-- Data Import Handler for Health Topics -->
  <requestHandler name="/dih-healthtopics" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-healthtopics.xml</str>
    </lst>
  </requestHandler>

  <!-- Data Import Handler that imports a single URL via Tika -->
  <requestHandler name="/dih-smallxml" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-smallxml.xml</str>
    </lst>
  </requestHandler>

    <!-- Data Import Handler that imports a single URL via Tika -->
  <requestHandler name="/dih-smallsqlite" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-smallsqlite.xml</str>
    </lst>
  </requestHandler>


The data import handlers and a copy-paste from Solr logging are attached.


exception.txt (5K) Download Attachment
dih-smallsqlite.xml (2K) Download Attachment
dih-healthtopics.xml (2K) Download Attachment
simple.xml (574 bytes) Download Attachment
dih-smallxml.xml (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tika Integration problem with DIH and JDBC

Alexandre Rafalovitch
You say "dataSource='bin'" but I don't see you defining that datasource. E.g.:

<dataSource type="BinURLDataSource" name="bin"/>

So, there might be some weird default fallback that's just causes
strange problems.

Regards,
    Alex.

Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 10 October 2014 14:17, Dan Davis <[hidden email]> wrote:

>
> What I want to do is to pull an URL out of an Oracle database, and then use
> TikaEntityProcessor and BinURLDataSource to go fetch and process that URL.
> I'm having a problem with this that seems general to JDBC with Tika - I get
> an exception as follows:
>
> Exception in entity :
> extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to execute query: http://www.cdc.gov/healthypets/pets/wildlife.html
> Processing Document # 14
> at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
> ...
>
> Steps to reproduce any problem should be:
>
> Try it with the XML and verify you get two documents and they contain text
> (schema browser with the text field)
> Try it with a JDBC sqlite3 dataSource and verify that you get an exception,
> and advise me what may be the problem in my configuration ...
>
> Now, I've tried this 3 ways:
>
> My Oracle database - fails as above
> An SQLite3 database to see if it is Oracle specific - fails with "Unable to
> execute query", but doesn't have the URL as part of the message.
> An XML file listing two URLs - succeeds without error.
>
> For the SQL attempts, setting onError="skip" leads the data from the
> database to be indexed, but the exception is logged for each root entity.
> I can tell that nothing is indexed from the text extraction by browsing the
> "text" field from the schema browser and seeing how few terms there are.
> The exceptions also sort of give it away, but it is good to be careful :)
>
> This is using:
>
> Tomcat 7.0.55
> Solr 4.10.1
> and JDBC drivers
>
> ojdbc7.jar
> sqlite-jdbc-3.7.2.jar
>
> Excerpt of solrconfig.xml:
>
>   <!-- Data Import Handler for Health Topics -->
>   <requestHandler name="/dih-healthtopics" class="solr.DataImportHandler">
>     <lst name="defaults">
>       <str name="config">dih-healthtopics.xml</str>
>     </lst>
>   </requestHandler>
>
>   <!-- Data Import Handler that imports a single URL via Tika -->
>   <requestHandler name="/dih-smallxml" class="solr.DataImportHandler">
>     <lst name="defaults">
>       <str name="config">dih-smallxml.xml</str>
>     </lst>
>   </requestHandler>
>
>     <!-- Data Import Handler that imports a single URL via Tika -->
>   <requestHandler name="/dih-smallsqlite" class="solr.DataImportHandler">
>     <lst name="defaults">
>       <str name="config">dih-smallsqlite.xml</str>
>     </lst>
>   </requestHandler>
>
>
> The data import handlers and a copy-paste from Solr logging are attached.
Reply | Threaded
Open this post in threaded view
|

Fwd: Tika Integration problem with DIH and JDBC

Alexandre Rafalovitch
I would concentrate on the stack traces and try reading them. They
often provide a lot of clues. For example, you original stack trace
had

org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:283)
at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
2) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
at org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
1) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)

I added 1) and 2) to show the lines of importance. You can see in 1)
that your TikaEntityProcessor is calling 2) JdbcDataSource, which was
not what you wanted as you specified BinDataSource. So, you focus on
that until it gets resolved.

Sometimes these happens when the XML file says 'datasource' instead of
'dataSource' (DIH is case-sensitive), but it does not seem to be the
case in your situation.

Regards,
    Alex.
P.s. If you still haven't figure it out, mention the Solr version on
the next email. Sometimes it makes difference, though DIH has been
largely unchanged for a while.

---------- Forwarded message ----------
From: Dan Davis <[hidden email]>
Date: 10 October 2014 15:00
Subject: Re: Tika Integration problem with DIH and JDBC
To: Alexandre Rafalovitch <[hidden email]>


The definition of dataSource name="bin" type="BinURLDataSource" is in
each of the dih-*.xml files.
But only the xml version has the definition at the top, above the document.

Moving the dataSource definition to the top does change the behavior,
now I get the following error for that entity:

Exception in entity :
extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
JDBC URL or JNDI name has to be specified Processing Document # 30

When I changed it to specify url="", it then reverted to form:

Exception in entity :
extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: http://www.cdc.gov/flu/swineflu/ Processing
Document # 1
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)

It does seem to be a problem resolving the dataSource in some way.   I
did double check another part of solrconfig.xml therefore.   Since the
XML example still works, I guess I know it has to be there.

  <lib dir="${solr.solr.home:}/dist/" regex="solr-dataimporthandler-.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-cell-\d.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/clustering/lib/" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-clustering-\d.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/langid/lib/" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-langid-\d.*\.jar" />

  <lib dir="${solr.solr.home:}/contrib/velocity/lib" regex=".*\.jar" />
  <lib dir="${solr.solr.home:}/dist/" regex="solr-velocity-\d.*\.jar" />


On Fri, Oct 10, 2014 at 2:37 PM, Alexandre Rafalovitch
<[hidden email]> wrote:

>
> You say "dataSource='bin'" but I don't see you defining that datasource. E.g.:
>
> <dataSource type="BinURLDataSource" name="bin"/>
>
> So, there might be some weird default fallback that's just causes
> strange problems.
>
> Regards,
>     Alex.
>
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 10 October 2014 14:17, Dan Davis <[hidden email]> wrote:
> >
> > What I want to do is to pull an URL out of an Oracle database, and then use
> > TikaEntityProcessor and BinURLDataSource to go fetch and process that URL.
> > I'm having a problem with this that seems general to JDBC with Tika - I get
> > an exception as follows:
> >
> > Exception in entity :
> > extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> > Unable to execute query: http://www.cdc.gov/healthypets/pets/wildlife.html
> > Processing Document # 14
> >       at
> > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
> > ...
> >
> > Steps to reproduce any problem should be:
> >
> > Try it with the XML and verify you get two documents and they contain text
> > (schema browser with the text field)
> > Try it with a JDBC sqlite3 dataSource and verify that you get an exception,
> > and advise me what may be the problem in my configuration ...
> >
> > Now, I've tried this 3 ways:
> >
> > My Oracle database - fails as above
> > An SQLite3 database to see if it is Oracle specific - fails with "Unable to
> > execute query", but doesn't have the URL as part of the message.
> > An XML file listing two URLs - succeeds without error.
> >
> > For the SQL attempts, setting onError="skip" leads the data from the
> > database to be indexed, but the exception is logged for each root entity.
> > I can tell that nothing is indexed from the text extraction by browsing the
> > "text" field from the schema browser and seeing how few terms there are.
> > The exceptions also sort of give it away, but it is good to be careful :)
> >
> > This is using:
> >
> > Tomcat 7.0.55
> > Solr 4.10.1
> > and JDBC drivers
> >
> > ojdbc7.jar
> > sqlite-jdbc-3.7.2.jar
> >
> > Excerpt of solrconfig.xml:
> >
> >   <!-- Data Import Handler for Health Topics -->
> >   <requestHandler name="/dih-healthtopics" class="solr.DataImportHandler">
> >     <lst name="defaults">
> >       <str name="config">dih-healthtopics.xml</str>
> >     </lst>
> >   </requestHandler>
> >
> >   <!-- Data Import Handler that imports a single URL via Tika -->
> >   <requestHandler name="/dih-smallxml" class="solr.DataImportHandler">
> >     <lst name="defaults">
> >       <str name="config">dih-smallxml.xml</str>
> >     </lst>
> >   </requestHandler>
> >
> >     <!-- Data Import Handler that imports a single URL via Tika -->
> >   <requestHandler name="/dih-smallsqlite" class="solr.DataImportHandler">
> >     <lst name="defaults">
> >       <str name="config">dih-smallsqlite.xml</str>
> >     </lst>
> >   </requestHandler>
> >
> >
> > The data import handlers and a copy-paste from Solr logging are attached.
Reply | Threaded
Open this post in threaded view
|

Re: Tika Integration problem with DIH and JDBC

Dan Davis-2
Thanks, Alexandre.    My role is to kick the tires on this.   We're trying
it a couple of different ways.   So, I'm going to assume this could be
resolved and move on to trying ManifestCF and see whether it can do similar
things for me, e.g. what it adds for free to our bag of tricks.

On Fri, Oct 10, 2014 at 3:16 PM, Alexandre Rafalovitch <[hidden email]>
wrote:

> I would concentrate on the stack traces and try reading them. They
> often provide a lot of clues. For example, you original stack trace
> had
>
>
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:283)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
> 2) at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
> at
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
> 1) at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
>
> I added 1) and 2) to show the lines of importance. You can see in 1)
> that your TikaEntityProcessor is calling 2) JdbcDataSource, which was
> not what you wanted as you specified BinDataSource. So, you focus on
> that until it gets resolved.
>
> Sometimes these happens when the XML file says 'datasource' instead of
> 'dataSource' (DIH is case-sensitive), but it does not seem to be the
> case in your situation.
>
> Regards,
>     Alex.
> P.s. If you still haven't figure it out, mention the Solr version on
> the next email. Sometimes it makes difference, though DIH has been
> largely unchanged for a while.
>
> ---------- Forwarded message ----------
> From: Dan Davis <[hidden email]>
> Date: 10 October 2014 15:00
> Subject: Re: Tika Integration problem with DIH and JDBC
> To: Alexandre Rafalovitch <[hidden email]>
>
>
> The definition of dataSource name="bin" type="BinURLDataSource" is in
> each of the dih-*.xml files.
> But only the xml version has the definition at the top, above the document.
>
> Moving the dataSource definition to the top does change the behavior,
> now I get the following error for that entity:
>
> Exception in entity :
> extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> JDBC URL or JNDI name has to be specified Processing Document # 30
>
> When I changed it to specify url="", it then reverted to form:
>
> Exception in entity :
> extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to execute query: http://www.cdc.gov/flu/swineflu/ Processing
> Document # 1
> at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
>
> It does seem to be a problem resolving the dataSource in some way.   I
> did double check another part of solrconfig.xml therefore.   Since the
> XML example still works, I guess I know it has to be there.
>
>   <lib dir="${solr.solr.home:}/dist/"
> regex="solr-dataimporthandler-.*\.jar" />
>
>   <lib dir="${solr.solr.home:}/contrib/extraction/lib" regex=".*\.jar" />
>   <lib dir="${solr.solr.home:}/dist/" regex="solr-cell-\d.*\.jar" />
>
>   <lib dir="${solr.solr.home:}/contrib/clustering/lib/" regex=".*\.jar" />
>   <lib dir="${solr.solr.home:}/dist/" regex="solr-clustering-\d.*\.jar" />
>
>   <lib dir="${solr.solr.home:}/contrib/langid/lib/" regex=".*\.jar" />
>   <lib dir="${solr.solr.home:}/dist/" regex="solr-langid-\d.*\.jar" />
>
>   <lib dir="${solr.solr.home:}/contrib/velocity/lib" regex=".*\.jar" />
>   <lib dir="${solr.solr.home:}/dist/" regex="solr-velocity-\d.*\.jar" />
>
>
> On Fri, Oct 10, 2014 at 2:37 PM, Alexandre Rafalovitch
> <[hidden email]> wrote:
> >
> > You say "dataSource='bin'" but I don't see you defining that datasource.
> E.g.:
> >
> > <dataSource type="BinURLDataSource" name="bin"/>
> >
> > So, there might be some weird default fallback that's just causes
> > strange problems.
> >
> > Regards,
> >     Alex.
> >
> > Personal: http://www.outerthoughts.com/ and @arafalov
> > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> >
> >
> > On 10 October 2014 14:17, Dan Davis <[hidden email]> wrote:
> > >
> > > What I want to do is to pull an URL out of an Oracle database, and
> then use
> > > TikaEntityProcessor and BinURLDataSource to go fetch and process that
> URL.
> > > I'm having a problem with this that seems general to JDBC with Tika -
> I get
> > > an exception as follows:
> > >
> > > Exception in entity :
> > > extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> > > Unable to execute query:
> http://www.cdc.gov/healthypets/pets/wildlife.html
> > > Processing Document # 14
> > >       at
> > >
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
> > > ...
> > >
> > > Steps to reproduce any problem should be:
> > >
> > > Try it with the XML and verify you get two documents and they contain
> text
> > > (schema browser with the text field)
> > > Try it with a JDBC sqlite3 dataSource and verify that you get an
> exception,
> > > and advise me what may be the problem in my configuration ...
> > >
> > > Now, I've tried this 3 ways:
> > >
> > > My Oracle database - fails as above
> > > An SQLite3 database to see if it is Oracle specific - fails with
> "Unable to
> > > execute query", but doesn't have the URL as part of the message.
> > > An XML file listing two URLs - succeeds without error.
> > >
> > > For the SQL attempts, setting onError="skip" leads the data from the
> > > database to be indexed, but the exception is logged for each root
> entity.
> > > I can tell that nothing is indexed from the text extraction by
> browsing the
> > > "text" field from the schema browser and seeing how few terms there
> are.
> > > The exceptions also sort of give it away, but it is good to be careful
> :)
> > >
> > > This is using:
> > >
> > > Tomcat 7.0.55
> > > Solr 4.10.1
> > > and JDBC drivers
> > >
> > > ojdbc7.jar
> > > sqlite-jdbc-3.7.2.jar
> > >
> > > Excerpt of solrconfig.xml:
> > >
> > >   <!-- Data Import Handler for Health Topics -->
> > >   <requestHandler name="/dih-healthtopics"
> class="solr.DataImportHandler">
> > >     <lst name="defaults">
> > >       <str name="config">dih-healthtopics.xml</str>
> > >     </lst>
> > >   </requestHandler>
> > >
> > >   <!-- Data Import Handler that imports a single URL via Tika -->
> > >   <requestHandler name="/dih-smallxml" class="solr.DataImportHandler">
> > >     <lst name="defaults">
> > >       <str name="config">dih-smallxml.xml</str>
> > >     </lst>
> > >   </requestHandler>
> > >
> > >     <!-- Data Import Handler that imports a single URL via Tika -->
> > >   <requestHandler name="/dih-smallsqlite"
> class="solr.DataImportHandler">
> > >     <lst name="defaults">
> > >       <str name="config">dih-smallsqlite.xml</str>
> > >     </lst>
> > >   </requestHandler>
> > >
> > >
> > > The data import handlers and a copy-paste from Solr logging are
> attached.
>
Reply | Threaded
Open this post in threaded view
|

Re: Tika Integration problem with DIH and JDBC

Dan Davis-2
All,

The problem here was that I gave driver="BinURLDataSource" rather than
type="BinURLDataSource".   Of course, saying driver="BinURLDataSource"
caused it not to be able to find it.
Reply | Threaded
Open this post in threaded view
|

Re: Tika Integration problem with DIH and JDBC

Alexandre Rafalovitch
Yes,

DIH (and used to be Solr schema parser too) is great at ignoring the
things it does not know about and just using defaults instead.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 4 November 2014 15:54, Dan Davis <[hidden email]> wrote:
> All,
>
> The problem here was that I gave driver="BinURLDataSource" rather than
> type="BinURLDataSource".   Of course, saying driver="BinURLDataSource"
> caused it not to be able to find it.