[jira] [Commented] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315803#comment-17315803 ]

Tim Allison commented on TIKA-3348:
-----------------------------------

And, do you actually need the metadata for inline images if you also have the raw bytes with an appropriate file suffix (e.g. png, etc.)?



> Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3348
>                 URL: https://issues.apache.org/jira/browse/TIKA-3348
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.25
>            Reporter: Simon Lucy
>            Priority: Major
>             Fix For: 2.0
>
>
> There's a set of bumps in the road to navigate when extracting images from PDFs, retrieving them and managing the metadata using Tika Server.
> The first is knowing that /unpack will do the basic job and return the embedded objects in a zip file (presuming setExtractInlineImages is True). Documenting this clearly in the Tika Server wiki page would help people enormously.
> But processing those images after they've been extracted will either need inspecting with another tool or using /rmeta to return the mime types and the rest of the metadata.
> This means that multiple passes need to be made over the same file and the same processes of extraction, identification and temporary storage will be made over.
> The server processes of /rmeta and /unpack need to be melded. The simplest may be to generate /rmeta metadata in the __META__ file object added to the returned zip file. A more complicated but perhaps more hypermedia way would be to use Content Negotiation and indicate an Accept application/zip in the /rmeta request.
> I've indicated a Fix version of 2.0 because it is if not a breaking change a considerable one.
> I'm available for Help Wanted, if that helps.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)