Solr 7.7: Using Tika in Production

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr 7.7: Using Tika in Production

Dustin Lebsock

Hi!

 

First off, thank you for the help!

 

I’m currently running SolrCloud based off the helm chart found here: https://github.com/helm/charts/tree/master/incubator/solr

 

Everything works great but I’d like to now use Tika to start indexing PDF’s as well. In the documentation, its recommended to not use Solr Cell in a production environment: https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html#solr-cell-performance-implications

 

So I have been trying to figure out a solution to have a Tika service to extract the contents of the possible files and came up with an idea. I could scale the amount of solr pods, have a dedicated service point to specific solr-pods that do not contain any shards on them and that will only be used for content extraction. That way if content-extraction goes wrong, it doesn’t matter if the pod crashes. However, these nodes will still be connected to ZooKeeper for the entire cluster, that way they may index the file to the correct collection immediately after extraction. I’m not sure if this is how SolrCloud works though.

 

If I send an extraction and Index request to a pod that doesn’t contain the specified collection, is it extracted before being sent to the correct pod for indexing? Or is it sent to a pod with the collection and then extracted? If it’s the later, do you have any advice?

 

Thanks for the help!

 

Dustin Pilkington

Associate Software Engineer

[hidden email]

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.7: Using Tika in Production

Erick Erickson
I doubt that’d work. When Solr gets an update, it forwards the document to the leader of the shard it’s going to eventually reside on. Among other things, the Solr node hosting no replicas would need to go to ZK and pull down the config you've created for Tika to know what to do. There’s no technical reason this couldn’t be done but I’m 99.9% certain nobody has, especially since running Tika inside solr is intended for PoC purposes rather than production.

The article you linked to has some SolrJ code that is usually  a better idea, or run Tika in server mode.

Best,
Erick

> On Jan 28, 2020, at 5:02 PM, Dustin Lebsock <[hidden email]> wrote:
>
> Hi!
>  
> First off, thank you for the help!
>  
> I’m currently running SolrCloud based off the helm chart found here: https://github.com/helm/charts/tree/master/incubator/solr
>  
> Everything works great but I’d like to now use Tika to start indexing PDF’s as well. In the documentation, its recommended to not use Solr Cell in a production environment: https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html#solr-cell-performance-implications
>  
> So I have been trying to figure out a solution to have a Tika service to extract the contents of the possible files and came up with an idea. I could scale the amount of solr pods, have a dedicated service point to specific solr-pods that do not contain any shards on them and that will only be used for content extraction. That way if content-extraction goes wrong, it doesn’t matter if the pod crashes. However, these nodes will still be connected to ZooKeeper for the entire cluster, that way they may index the file to the correct collection immediately after extraction. I’m not sure if this is how SolrCloud works though.
>  
> If I send an extraction and Index request to a pod that doesn’t contain the specified collection, is it extracted before being sent to the correct pod for indexing? Or is it sent to a pod with the collection and then extracted? If it’s the later, do you have any advice?
>  
> Thanks for the help!
>  
> Dustin Pilkington
> Associate Software Engineer
> [hidden email]
>  
>