Help extracting text from PDF images when indexing files
I'm new to Solr, i've recently downloaded solr 8.0.0 and have been
following the tutorials. Using the 2 example instances created, i'm trying
to create my own collection. I've done a copy of the _default configset and
used it to create my collection.
For my case, the files i want to index are pdf files composed of images. I
have tesseract installed and i can parse correctly the pdf files using an
tika server instance i downloaded, i.e i can get the extracted text from
I'm following the instructions on from page "Uploading Data with Solr Cell
Using Apache Tika" to propertly configure the PDF image extraction but i'm
not being able to correctly get this. My aim is that the content of the PDF
file goes into a field named content that i've created in my schema. From
my attempts this field is non existent or when it exists it doesnt contain
the expected text from the parsed images.
In the configuration of ExtractingRequestHandler, the lib clauses are
present in my solrconfig.xml, that section is as below: