Solr dih extract text from inline images in pdf

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr dih extract text from inline images in pdf

lala
Hi,

I am working with solr7, indexing multilingual files existing in a folder,
using DIH (FileListEntityProcessor for the basic entity, &
TikaEntityProcessor for the child entity in configuration file).

My problem relies here: I want to extract texts from images inside PDF
files, that works fine with the /update/extract request handler where I set
the "parseContext.config" attribute to an xml file lets say "context.xml"
where I set the property "extractInlineImages" for the entry
[PDFParserConfig] to true. But I have no Idea how to set the
parseContext.Config in the DIH configuration??

I tried these approaches, none of them worked:

    - set tikaConfig attribute in dih config file to my "context.xml",
obviously won't work since tika config is different :.
    - set the parseContext.config attribute to my "\dataImport"
requestHandler, didn't work

I googled a lot with no result...I really really appreciate any help here!!





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr dih extract text from inline images in pdf

Erick Erickson
It's often much easier to approach this by running Tika separately.
Here's a blog on both the reasoning and sample code:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Among other things, you have a lot more control over how Tika operates.

Best,
Erick

On Tue, Mar 6, 2018 at 12:36 AM, lala <[hidden email]> wrote:

> Hi,
>
> I am working with solr7, indexing multilingual files existing in a folder,
> using DIH (FileListEntityProcessor for the basic entity, &
> TikaEntityProcessor for the child entity in configuration file).
>
> My problem relies here: I want to extract texts from images inside PDF
> files, that works fine with the /update/extract request handler where I set
> the "parseContext.config" attribute to an xml file lets say "context.xml"
> where I set the property "extractInlineImages" for the entry
> [PDFParserConfig] to true. But I have no Idea how to set the
> parseContext.Config in the DIH configuration??
>
> I tried these approaches, none of them worked:
>
>     - set tikaConfig attribute in dih config file to my "context.xml",
> obviously won't work since tika config is different :.
>     - set the parseContext.config attribute to my "\dataImport"
> requestHandler, didn't work
>
> I googled a lot with no result...I really really appreciate any help here!!
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr dih extract text from inline images in pdf

lala
Thanks for your reply Erick,

Actually I am using Solrj to index files among other operations with Solr,
but to index a large amount of differesnt kinds of file, I'm sending a DIH
request to Solr using Solrj API : FileListEntityProcessor with
TikaEntityParser...
Why not benefit from this technology if Solr offers it? It simplifies our
work tremendosely...
Isn't there any way to be able to extract inline images in PDF docs??

Waiting your reply, best regards...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr dih extract text from inline images in pdf

Charlie Hull-3
On 07/03/2018 09:32, lala wrote:
> Thanks for your reply Erick,
>
> Actually I am using Solrj to index files among other operations with Solr,
> but to index a large amount of differesnt kinds of file, I'm sending a DIH
> request to Solr using Solrj API : FileListEntityProcessor with
> TikaEntityParser...
> Why not benefit from this technology if Solr offers it? It simplifies our
> work tremendosely...

It may simplify your work, but it isn't good practice. Tika has some
heavy lifting to do to extract text from some formats and you should
consider how this load will affect Solr. We've often put Tika into a
different process for this reason.

> Isn't there any way to be able to extract inline images in PDF docs??

https://stackoverflow.com/questions/31303735/how-to-extract-images-from-a-file-using-apache-tika 
has some useful suggestions.

Charlie
>
> Waiting your reply, best regards...
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk
Reply | Threaded
Open this post in threaded view
|

Re: Solr dih extract text from inline images in pdf

lala
This post was updated on .
Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
working.?

This is my tika-config.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>       
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                true
                true               
            </params>
        </parser>
    </parsers>
</properties>

I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
Any Idea??



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr dih extract text from inline images in pdf

lala
I dont' know what is the problem, when posting the message, the xml format
inside the   is not correct, it should contain ["<"param
name="extractInlineImages" type="bool">true] AND ["<"param
name="sortByPosition" type="bool">true]...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr dih extract text from inline images in pdf

Erick Erickson
You're missing Charlie's point, and if you read the blog I pointed you
to that point is reiterated.

DIH does the Tika processing on the Solr node that is _also_ indexing
documents and satisfying queries. Parsing a semi-structured document
(PDF in this case) consumes CPU cycles and memory, all _within_ the
Solr process. You can easily create an OOM problem on the Solr node if
someone drops, say, a 2G file in your directory structure and you
blithely send it to Solr via DIH.

Additionally there are so many variants of, say, the PDF "standard"
that some edge case somewhere can (and has) caused Tika to blow it's
brains out. The Tika folks have done a marvelous job of fixing these
when they come up, but it's a never-ending battle.

If you do the Tika processing in your own Java process you isolate
your Solr's from these issues.

Up to you of course.
Erick

On Wed, Mar 7, 2018 at 5:39 AM, lala <[hidden email]> wrote:
> I dont' know what is the problem, when posting the message, the xml format
> inside the   is not correct, it should contain ["<"param
> name="extractInlineImages" type="bool">true] AND ["<"param
> name="sortByPosition" type="bool">true]...
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr dih extract text from inline images in pdf

Charlie Hull-3
In reply to this post by lala
On 07/03/2018 13:29, lala wrote:

> Thanks Charlie...
> It's just confusing for me, In the DIH configuration file, the inner entity
> that takes "TikaEntityProcessor" as its processor, I can easily specify a
> tikaConfig attribute to an xml file, located inside the config folder in the
> core, and where in this file I should be able to override the PDFParser
> default properties... As in parseContext.Config...
> The thing is that I placed my tika-config.xml file in the config folder,
> set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
> images inside PDF file!!!
> Let's say this is just experimenting Solr DIH crawling... Why it's not
> working.?
>
> This is my tika-config.xml file:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>      <parsers>
>          <parser class="org.apache.tika.parser.DefaultParser"/>
>          <parser class="org.apache.tika.parser.pdf.PDFParser">
>              <params>
>                  true
>                  true
>              </params>
>          </parser>
>      </parsers>
> </properties>
>
> I've read the code in both TikaEntityProcessor and TikaConfig... It should
> read the xml file from config folder, extract params and override original
> PDFParser attributes. But It DOESN'T!
> Any Idea??

Hi,

My reading of
https://tika.apache.org/1.17/configuring.html#Using_a_Tika_Configuration_XML_file 
indicates that your PDF parser may not run unless you explicitly exclude
PDFs, which I don't think you're doing above.

I'm not an expert on Tika configuration, but I think you should first
try this xml file with standalone Tika and see if it does what you think
it should. Once you're sure, then try it with DIH or SolrJ.

Cheers

Charlie
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk