Tika and Solr : rejected document due to mime type restrictions

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika and Solr : rejected document due to mime type restrictions

biso
Hallo.
I startup tika server from command line:
java -jar /opt/tika/tika-server-1.19.1.jar

I configured, with ManifoldCF a connector to Solr.

When I start the ingest of pdf and .xls document, I see in the tika server:

INFO  Setting the server's publish address to be http://localhost:9998/
INFO  Logging initialized @1053ms to org.eclipse.jetty.util.log.Slf4jLog
INFO  jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z; git: d5fc0523cfa96bfebfbda19606cad384d772f04c; jvm 10.0.2+13-Ubuntu-1ubuntu0.18.04.2
INFO  Started ServerConnector@f74e835{HTTP/1.1,[http/1.1]}{localhost:9998}
INFO  Started @1134ms
WARN  Empty contextPath
INFO  Started o.e.j.s.h.ContextHandler@68d6972f{/,null,AVAILABLE}
INFO  Started Apache Tika server at http://localhost:9998/
INFO  meta (application/pdf)
INFO  meta (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-Black'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPSMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'ArialMT'
WARN  Using fallback font 'LiberationSans' for 'CourierNewPSMT'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
INFO  tika (application/pdf)
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-Black'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPSMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'ArialMT'
WARN  Using fallback font 'LiberationSans' for 'CourierNewPSMT'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
INFO  tika (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

so it seems that tika server process the cocuments, but, Solr server doesn't ingest.

I obtain the error:
Solr connector rejected document due to mime type restrictions: (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
Solr connector rejected document due to mime type restrictions: (application/pdf)

I understood that tika converts all documents in text so it index to solr, or are there any restriction about Tika Server mime typ?

Thanks a lot

Mario
Reply | Threaded
Open this post in threaded view
|

Re: Tika and Solr : rejected document due to mime type restrictions

Shawn Heisey-2
On 10/11/2018 9:06 AM, Bisonti Mario wrote:
> I startup tika server from command line:
> java -jar /opt/tika/tika-server-1.19.1.jar
>
> I configured, with ManifoldCF a connector to Solr.
>
> When I start the ingest of pdf and .xls document, I see in the tika server:
<snip>
> so it seems that tika server process the cocuments, but, Solr server doesn't ingest.
>
> I obtain the error:
> Solr connector rejected document due to mime type restrictions: (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
> Solr connector rejected document due to mime type restrictions: (application/pdf)

Those errors are not coming from Solr.  Do you see any errors in
solr.log?  If you do, then we can help you with those.

Since ManifoldCF calls its components connectors, I am betting the
errors are being generated by ManifoldCF, and that for those documents,
nothing has actually been sent to Solr, so you won't see errors in the
solr.log for those files.  ManifoldCF is a separate project within
Apache, which has its own support infrastructure.

https://manifoldcf.apache.org/en_US/mail.html

Thanks,
Shawn