How to use DataImportHandler with ExtractingRequestHandler?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to use DataImportHandler with ExtractingRequestHandler?

Khai Doan
Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan
Reply | Threaded
Open this post in threaded view
|

Re: How to use DataImportHandler with ExtractingRequestHandler?

Noble Paul നോബിള്‍  नोब्ळ्-2
unfortunately DIH is not yet integrated with ExtractingRequestHandler .
see this https://issues.apache.org/jira/browse/SOLR-1358



On Thu, Sep 3, 2009 at 5:34 AM, Khai Doan<[hidden email]> wrote:

> Hi all,
>
> My name is Khai.  I have a table in a relational database.  I have
> successfully use DataImportHandler to import this data into Apache Solr.
> However, one of the column store the location of PDF file.  How can I
> configure DataImportHandler to use ExtractingRequestHandler to extract the
> content of the PDF?
>
> Thanks!
>
> Khai Doan
>



--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com
Reply | Threaded
Open this post in threaded view
|

Re: How to use DataImportHandler with ExtractingRequestHandler?

Sascha Szott
In reply to this post by Khai Doan
Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3):
For each row, extract the content from the corresponding pdf file using
a parser library of your choice (I suggest Apache PDFBox or Apache Tika
in case you need to process other file types as well), put it between

        <foo><![CDATA[

and

        ]]></foo>

and store it in a text file. To keep the relationship between a file and
its corresponding database row, use the primary key as the file name.

Within data-config.xml use the XPathEntityProcessor as follows (replace
dbRow and primaryKey respectively):

<entity name="pdfcontent"
        processor="XPathEntityProcessor"
        forEach="/foo"
        url="${dbRow.primaryKey}.xml">
   <field column="pdftext" xpath="/foo"/>
</entity>


And, by the way, in Solr 1.4 you do not have to put your content between
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.

Best,
Sascha

Khai Doan schrieb:

> Hi all,
>
> My name is Khai.  I have a table in a relational database.  I have
> successfully use DataImportHandler to import this data into Apache Solr.
> However, one of the column store the location of PDF file.  How can I
> configure DataImportHandler to use ExtractingRequestHandler to extract the
> content of the PDF?
>
> Thanks!
>
> Khai Doan
>

Reply | Threaded
Open this post in threaded view
|

Re: How to use DataImportHandler with ExtractingRequestHandler?

javaxmlsoapdev
This post was updated on .
did you extend DIH to do this work? can you share code samples. I have similar requirement where I need tp index database records and each record has a column with document path so need to create another index for documents (we allow users to search both index separately) in parallel with reading some meta data of documents from database as well. I have all sorts of different document formats to index. I am on solr 1.4.0. Any pointers would be appreciated.

Thanks,

Reply | Threaded
Open this post in threaded view
|

Re: How to use DataImportHandler with ExtractingRequestHandler?

javaxmlsoapdev
Anyone any idea?
javaxmlsoapdev wrote
did you extend DIH to do this work? can you share code samples. I have similar requirement where I need tp index database records and each record has a column with document path so need to create another index for documents (we allow users to search both index separately) in parallel with reading some meta data of documents from database as well. I have all sorts of different document formats to index. I am on solr 1.4.0. Any pointers would be appreciated.

Thanks,
Reply | Threaded
Open this post in threaded view
|

Re: How to use DataImportHandler with ExtractingRequestHandler?

Shalin Shekhar Mangar
In reply to this post by javaxmlsoapdev
On Fri, Nov 20, 2009 at 9:13 PM, javaxmlsoapdev <[hidden email]> wrote:

>
> did you extend DIH to do this work? can you share code samples. I have
> similar requirement where I need tp index database records and each record
> has a column with document path so need to create another index for
> documents (we allow users to search both index separately) in parallel with
> reading some meta data of documents from database as well. I have all sorts
> of different document formats to index. fyi; I am on solr 1.4.0. Any
> pointers would be appreciated.
>
>
He did not extend DIH for this. He extracted out text from his documents and
saved them into files and used XPathEntityProcessor (you can use
PlainTextEntityProcessor) to index them.

I don't know much about ExtractionRequestHandler but if you want to use DIH,
you'll have to extend it to add Tika support. You may want to look at a
couple of open issues:

   1. https://issues.apache.org/jira/browse/SOLR-1358
   2. https://issues.apache.org/jira/browse/SOLR-1583

--
Regards,
Shalin Shekhar Mangar.