[jira] [Updated] (TIKA-2972) Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-2972) Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-2972:
------------------------------
    Description:
I'd like to add a tika-eval handler that will calculate text stats at the end of parsing a document so that the user  can get a unified/simpler view of number of tokens/ out of vocabulary, etc. in the metadata rather than having to run their own post-parse process on the content.

The problem comes with integrating this into tika-app and tika-server -- tika-app balloons to 134MB.  I don't want to nearly double the size of tika-app just so that I can add some stuff that very few folks will use.

I think we've discussed this option before, but it would be handy to allow users to specify a ContentHandlerFactory or possibly a map of ContentHandlerFactories in tika-config.xml so that users can get custom handling in tika-app and tika-server.

The idea of a map of ContentHandlerFactories, would be to have a name for each content handler factory, and a user could call different handlers on tika-server like this:

-{{curl... http://localhost:9998/tika/custom/myhandler1}}-
-{{curl... http://localhost:9998/tika/custom/myhandler2}}-

That's not right because we'd want to differentiate classic Tika parsing and the RecursiveParserWrapper...

{{curl... http://localhost:9998/tika/myhandler1}}
{{curl... http://localhost:9998/tika/myhandler2}}

{{curl... http://localhost:9998/rmeta/myhandler1}}
{{curl... http://localhost:9998/rmeta/myhandler2}}

or in tika-app:

{{java -jar tika-app.jar --handlerFactory=myhandler1...}}

WDYT?

  was:
I'd like to add a tika-eval handler that will calculate text stats at the end of parsing a document so that the user  can get a unified/simpler view of number of tokens/ out of vocabulary, etc. in the metadata rather than having to run their own post-parse process on the content.

The problem comes with integrating this into tika-app and tika-server -- tika-app balloons to 134MB.  I don't want to nearly double the size of tika-app just so that I can add some stuff that very few folks will use.

I think we've discussed this option before, but it would be handy to allow users to specify a ContentHandlerFactory or possibly a map of ContentHandlerFactories in tika-config.xml so that users can get custom handling in tika-app and tika-server.

The idea of a map of ContentHandlerFactories, would be to have a name for each content handler factory, and a user could call different handlers on tika-server like this:

{{curl... http://localhost:9998/tika/custom/myhandler1}}
{{curl... http://localhost:9998/tika/custom/myhandler2}}

or in tika-app:

{{java -jar tika-app.jar --handlerFactory=myhandler1...}}

WDYT?


> Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-2972
>                 URL: https://issues.apache.org/jira/browse/TIKA-2972
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> I'd like to add a tika-eval handler that will calculate text stats at the end of parsing a document so that the user  can get a unified/simpler view of number of tokens/ out of vocabulary, etc. in the metadata rather than having to run their own post-parse process on the content.
> The problem comes with integrating this into tika-app and tika-server -- tika-app balloons to 134MB.  I don't want to nearly double the size of tika-app just so that I can add some stuff that very few folks will use.
> I think we've discussed this option before, but it would be handy to allow users to specify a ContentHandlerFactory or possibly a map of ContentHandlerFactories in tika-config.xml so that users can get custom handling in tika-app and tika-server.
> The idea of a map of ContentHandlerFactories, would be to have a name for each content handler factory, and a user could call different handlers on tika-server like this:
> -{{curl... http://localhost:9998/tika/custom/myhandler1}}-
> -{{curl... http://localhost:9998/tika/custom/myhandler2}}-
> That's not right because we'd want to differentiate classic Tika parsing and the RecursiveParserWrapper...
> {{curl... http://localhost:9998/tika/myhandler1}}
> {{curl... http://localhost:9998/tika/myhandler2}}
> {{curl... http://localhost:9998/rmeta/myhandler1}}
> {{curl... http://localhost:9998/rmeta/myhandler2}}
> or in tika-app:
> {{java -jar tika-app.jar --handlerFactory=myhandler1...}}
> WDYT?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)