[jira] [Commented] (TIKA-3026) Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3026) Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036361#comment-17036361 ]

Tim Allison commented on TIKA-3026:
-----------------------------------

I pushed an initial draft to master and branch_1x.  Let me know what you think.

I noticed some oddities in the IRS file we have like {{<p> <p></p></p>}}, but I _think_ this is good enough as an alpha feature.

> Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor
> --------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3026
>                 URL: https://issues.apache.org/jira/browse/TIKA-3026
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Some PDFs contain tags that _may_ be useful in understanding the structure of the elements within a PDF, e.g. table markup, paragraph breaks, headers, etc.  
>  
>  
> The quality of the tags depends entirely on the software and human generating the PDF.  There are no guarantees.  Nevertheless, it might be useful in some cases for users to be able to extract content with structure tags.
>  
> Some references:
> [https://acrobatusers.com/tutorials/what-are-pdf-tags-and-why-should-i-care/]
> [https://www.adobe.com/accessibility/products/acrobat/pdf-repair-add-tags.html]
> [https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)