[jira] [Commented] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315802#comment-17315802 ]

Tim Allison commented on TIKA-3348:
-----------------------------------

I realize we can improve the documentation, and I appreciate this issue!  

Separately, I'm trying to wrap my mind around the use case.

Do you want rendered images of pages, do you want "attached" images, or do you only want inline images (images that are rendered as part of the page)?

If you want only inline images, what do you do with PDFs that do crazy things with inline images, e.g. http://corpora.tika.apache.org/base/docs/govdocs1/905/905020.pdf ?

> Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3348
>                 URL: https://issues.apache.org/jira/browse/TIKA-3348
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.25
>            Reporter: Simon Lucy
>            Priority: Major
>             Fix For: 2.0
>
>
> There's a set of bumps in the road to navigate when extracting images from PDFs, retrieving them and managing the metadata using Tika Server.
> The first is knowing that /unpack will do the basic job and return the embedded objects in a zip file (presuming setExtractInlineImages is True). Documenting this clearly in the Tika Server wiki page would help people enormously.
> But processing those images after they've been extracted will either need inspecting with another tool or using /rmeta to return the mime types and the rest of the metadata.
> This means that multiple passes need to be made over the same file and the same processes of extraction, identification and temporary storage will be made over.
> The server processes of /rmeta and /unpack need to be melded. The simplest may be to generate /rmeta metadata in the __META__ file object added to the returned zip file. A more complicated but perhaps more hypermedia way would be to use Content Negotiation and indicate an Accept application/zip in the /rmeta request.
> I've indicated a Fix version of 2.0 because it is if not a breaking change a considerable one.
> I'm available for Help Wanted, if that helps.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)