Indexing PDF files in SqlBase database

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing PDF files in SqlBase database

aruna
Hello,

I got a task to index in Solr 7.71 a PDF files which are stored in SqlBase
database. I did half the job - I can to index all table fields, I can do a
search in these fields except field in which is stored a pdf file content.
As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to
understand how to force to extract and index field with pdf content. I need
a help.

Regards,

Aruna

in solrconfig.xml i have


* <lib dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib"
regex=".*\.jar" />  <lib dir="${solr.install.dir:../../../..}/dist/"
regex="solr-dataimporthandler-.*\.jar" /> *
*  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
regex=".*\.jar" />*
*  <lib dir="${solr.install.dir:../../../..}/dist/"
regex="solr-cell-\d.*\.jar" />*









*<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >    <lst
name="defaults">      <str name="lowernames">true</str>      <str
name="fmap.meta">ignored_</str>      <str
name="fmap.content">_text_</str>    </lst>  </requestHandler>*





*<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">   <lst
name="defaults">    <str name="config">db-data-config.xml</str>   </lst>
</requestHandler>*



















*---------------------------------------------------------------------------------------------------------------------------------------------db-data-config.xml<dataConfig><dataSource
type="JdbcDataSource"
driver="jdbc.unify.sqlbase.SqlbaseDriver"
url="jdbc:sqlbase://localhost:2155/PDFDOCS"
user="sysadm"            password="sysadm" />   <document>  <entity
name="PDFDOCUMENTS" query="select ID, PDOCUMENT, UNIT from SYSADM.DOCS">
  <field column="ID" name="idx" />       <field column="PDOCUMENT"
name="PDF" />        <field column="UNIT" name="division" />    </entity>
</document></dataConfig>*
Reply | Threaded
Open this post in threaded view
|

Re: Indexing PDF files in SqlBase database

Erick Erickson
For a lot of reasons, I greatly prefer to put this work on a client rather than use Solr directly. Here’s a place to get started, it connects to a DB and also scans local file directory for docs to push through (local) Tika and index. So you should be able to modify it relatively easily to get the data from SqlBase, read the associated PDF, combine the two and send to Solr.

https://lucidworks.com/2012/02/14/indexing-with-solrj/

The code itself is a bit old, but illustrates the process.

Best,
Erick

> On Apr 2, 2019, at 11:46 PM, Arunas Spurga <[hidden email]> wrote:
>
> Hello,
>
> I got a task to index in Solr 7.71 a PDF files which are stored in SqlBase
> database. I did half the job - I can to index all table fields, I can do a
> search in these fields except field in which is stored a pdf file content.
> As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to
> understand how to force to extract and index field with pdf content. I need
> a help.
>
> Regards,
>
> Aruna
>
> in solrconfig.xml i have
>
>
> * <lib dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib"
> regex=".*\.jar" />  <lib dir="${solr.install.dir:../../../..}/dist/"
> regex="solr-dataimporthandler-.*\.jar" /> *
> *  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
> regex=".*\.jar" />*
> *  <lib dir="${solr.install.dir:../../../..}/dist/"
> regex="solr-cell-\d.*\.jar" />*
>
>
>
>
>
>
>
>
>
> *<requestHandler name="/update/extract"
> startup="lazy"
> class="solr.extraction.ExtractingRequestHandler" >    <lst
> name="defaults">      <str name="lowernames">true</str>      <str
> name="fmap.meta">ignored_</str>      <str
> name="fmap.content">_text_</str>    </lst>  </requestHandler>*
>
>
>
>
>
> *<requestHandler name="/dataimport"
> class="org.apache.solr.handler.dataimport.DataImportHandler">   <lst
> name="defaults">    <str name="config">db-data-config.xml</str>   </lst>
> </requestHandler>*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *---------------------------------------------------------------------------------------------------------------------------------------------db-data-config.xml<dataConfig><dataSource
> type="JdbcDataSource"
> driver="jdbc.unify.sqlbase.SqlbaseDriver"
> url="jdbc:sqlbase://localhost:2155/PDFDOCS"
> user="sysadm"            password="sysadm" />   <document>  <entity
> name="PDFDOCUMENTS" query="select ID, PDOCUMENT, UNIT from SYSADM.DOCS">
>  <field column="ID" name="idx" />       <field column="PDOCUMENT"
> name="PDF" />        <field column="UNIT" name="division" />    </entity>
> </document></dataConfig>*

Reply | Threaded
Open this post in threaded view
|

Re: Indexing PDF files in SqlBase database

aruna
Yes, I know the reasons why put this work on a client rather than use Solr
directly and it should be maybe the next my task.
But I need to finish first my task - index a pdf files stored in SqlBase
database. The pdf files are pretty simple, sometimes only dozens text lines.

Regards,

Aruna

On Wed, Apr 3, 2019 at 5:03 PM Erick Erickson <[hidden email]>
wrote:

> For a lot of reasons, I greatly prefer to put this work on a client rather
> than use Solr directly. Here’s a place to get started, it connects to a DB
> and also scans local file directory for docs to push through (local) Tika
> and index. So you should be able to modify it relatively easily to get the
> data from SqlBase, read the associated PDF, combine the two and send to
> Solr.
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> The code itself is a bit old, but illustrates the process.
>
> Best,
> Erick
>
> > On Apr 2, 2019, at 11:46 PM, Arunas Spurga <[hidden email]> wrote:
> >
> > Hello,
> >
> > I got a task to index in Solr 7.71 a PDF files which are stored in
> SqlBase
> > database. I did half the job - I can to index all table fields, I can do
> a
> > search in these fields except field in which is stored a pdf file
> content.
> > As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to
> > understand how to force to extract and index field with pdf content. I
> need
> > a help.
> >
> > Regards,
> >
> > Aruna
> >
> > in solrconfig.xml i have
> >
> >
> > * <lib
> dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib"
> > regex=".*\.jar" />  <lib dir="${solr.install.dir:../../../..}/dist/"
> > regex="solr-dataimporthandler-.*\.jar" /> *
> > *  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
> > regex=".*\.jar" />*
> > *  <lib dir="${solr.install.dir:../../../..}/dist/"
> > regex="solr-cell-\d.*\.jar" />*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *<requestHandler name="/update/extract"
> > startup="lazy"
> > class="solr.extraction.ExtractingRequestHandler" >    <lst
> > name="defaults">      <str name="lowernames">true</str>      <str
> > name="fmap.meta">ignored_</str>      <str
> > name="fmap.content">_text_</str>    </lst>  </requestHandler>*
> >
> >
> >
> >
> >
> > *<requestHandler name="/dataimport"
> > class="org.apache.solr.handler.dataimport.DataImportHandler">   <lst
> > name="defaults">    <str name="config">db-data-config.xml</str>   </lst>
> > </requestHandler>*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> *---------------------------------------------------------------------------------------------------------------------------------------------db-data-config.xml<dataConfig><dataSource
> > type="JdbcDataSource"
> > driver="jdbc.unify.sqlbase.SqlbaseDriver"
> > url="jdbc:sqlbase://localhost:2155/PDFDOCS"
> > user="sysadm"            password="sysadm" />   <document>  <entity
> > name="PDFDOCUMENTS" query="select ID, PDOCUMENT, UNIT from SYSADM.DOCS">
> >  <field column="ID" name="idx" />       <field column="PDOCUMENT"
> > name="PDF" />        <field column="UNIT" name="division" />    </entity>
> > </document></dataConfig>*
>
>