[jira] Created: (TIKA-100) Structured PDF parsing

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Created: (TIKA-100) Structured PDF parsing

Tim Allison (Jira)
Structured PDF parsing

                 Key: TIKA-100
                 URL: https://issues.apache.org/jira/browse/TIKA-100
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting
            Priority: Minor

The PDF parser currently extracts and outputs document content as a single string. PDFBox could be used to support structuring at least down to page and paragraph (not sure how accurate) level.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.