Solr Cell Input Parameter tika.config

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Cell Input Parameter tika.config

Robertson, Eric J
Hello all,

Currently trying to define a tika config to use when posting a pdf to Solr Cell as we may want to override the default tika configuration depending on type of document being ingested.

In the docs it lists tika.config as an input parameter to the Solr Cell endpoint. Though in my tests it does not seem to be working or acknowledging it all.

Does anyone have working example using this input parameter?

I am running solr 7.4.0 on Windows 7.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Solr Cell Input Parameter tika.config

Yasufumi Mizoguchi
Hello,

I could not find the process that parse tika.config parameter from solr
request.
Maybe, tika.config parameter can only be defined in solrconfig.xml as
following.

<requestHandler name="/update/extract"
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler" >
  <str name="tika.config">tika-config.xml</str>
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>

Thanks,
Yasufumi

2018年10月26日(金) 7:07 Robertson, Eric J <[hidden email]>:

> Hello all,
>
> Currently trying to define a tika config to use when posting a pdf to Solr
> Cell as we may want to override the default tika configuration depending on
> type of document being ingested.
>
> In the docs it lists tika.config as an input parameter to the Solr Cell
> endpoint. Though in my tests it does not seem to be working or
> acknowledging it all.
>
> Does anyone have working example using this input parameter?
>
> I am running solr 7.4.0 on Windows 7.
>
> Thanks!
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr Cell Input Parameter tika.config

Jan Høydahl / Cominvent
The tika.config param is documented here:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler

I notice that the code (https://github.com/apache/lucene-solr/blob/964cc88cee7d62edf03a923e3217809d630af5d5/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingRequestHandler.java#L65-L77) uses "new File(tikaConfigLoc)" for resolving the tika config file, while it should probably load it through SolrResourceLoader to play nice with Zookeeper

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 26. okt. 2018 kl. 05:10 skrev Yasufumi Mizoguchi <[hidden email]>:
>
> Hello,
>
> I could not find the process that parse tika.config parameter from solr
> request.
> Maybe, tika.config parameter can only be defined in solrconfig.xml as
> following.
>
> <requestHandler name="/update/extract"
>                startup="lazy"
>                class="solr.extraction.ExtractingRequestHandler" >
>  <str name="tika.config">tika-config.xml</str>
>  <lst name="defaults">
>    <str name="lowernames">true</str>
>    <str name="uprefix">ignored_</str>
>    <str name="captureAttr">true</str>
>    <str name="fmap.a">links</str>
>    <str name="fmap.div">ignored_</str>
>  </lst>
> </requestHandler>
>
> Thanks,
> Yasufumi
>
> 2018年10月26日(金) 7:07 Robertson, Eric J <[hidden email]>:
>
>> Hello all,
>>
>> Currently trying to define a tika config to use when posting a pdf to Solr
>> Cell as we may want to override the default tika configuration depending on
>> type of document being ingested.
>>
>> In the docs it lists tika.config as an input parameter to the Solr Cell
>> endpoint. Though in my tests it does not seem to be working or
>> acknowledging it all.
>>
>> Does anyone have working example using this input parameter?
>>
>> I am running solr 7.4.0 on Windows 7.
>>
>> Thanks!
>>