[jira] [Commented] (TIKA-2972) Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2972) Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963956#comment-16963956 ]

Nick Burch commented on TIKA-2972:

It doesn't quite feel like a perfect solution, but I can't think of anything with fewer drawbacks!

My only suggestion is that we put most of the code in Tika Core, and provide some examples for the website on how to make use of it to potentially simplify your java code. I guess we'd provide a method on `TikaConfig` to get all the factories as a minimum? Possibly also one that takes a name that returns a factory, not sure if that should have an implicit default or take an explicit default or return null or throw exception on an invalid name?

> Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml
> -------------------------------------------------------------------------------
>                 Key: TIKA-2972
>                 URL: https://issues.apache.org/jira/browse/TIKA-2972
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
> I'd like to add a tika-eval handler that will calculate text stats at the end of parsing a document so that the user  can get a unified/simpler view of number of tokens/ out of vocabulary, etc. in the metadata rather than having to run their own post-parse process on the content.
> The problem comes with integrating this into tika-app and tika-server -- tika-app balloons to 134MB.  I don't want to nearly double the size of tika-app just so that I can add some stuff that very few folks will use.
> I think we've discussed this option before, but it would be handy to allow users to specify a ContentHandlerFactory or possibly a map of ContentHandlerFactories in tika-config.xml so that users can get custom handling in tika-app and tika-server.
> The idea of a map of ContentHandlerFactories, would be to have a name for each content handler factory, and a user could call different handlers on tika-server like this:
> -{{curl... http://localhost:9998/tika/custom/myhandler1}}-
> -{{curl... http://localhost:9998/tika/custom/myhandler2}}-
> That's not right because we'd want to differentiate classic Tika parsing and the RecursiveParserWrapper...
> {{curl... http://localhost:9998/tika/myhandler1}}
> {{curl... http://localhost:9998/tika/myhandler2}}
> {{curl... http://localhost:9998/rmeta/myhandler1}}
> {{curl... http://localhost:9998/rmeta/myhandler2}}
> or in tika-app:
> {{java -jar tika-app.jar --handlerFactory=myhandler1...}}

This message was sent by Atlassian Jira