[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099848#comment-17099848 ]

Bob Paulin commented on TIKA-3094:

Hey [~tallison] I ran a build on Java 8 and Java 11 and I was unable to recreate #2.  Can you provide more details?  perhaps the output of the stack trace you're getting?

For #3 I do get errors running the build but I'm  not sure which are expected and which are the ones with the wrong metadata.  Can you provide an example.


Also it might be helpful to separate these out into different JIRAs.  This one is snowballing a bit.

> Apache Tika fails to extract text for pptx extension.
> -----------------------------------------------------
>                 Key: TIKA-3094
>                 URL: https://issues.apache.org/jira/browse/TIKA-3094
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24, 1.24.1
>            Reporter: Abhishek Chauhan
>            Assignee: Bob Paulin
>            Priority: Critical
>         Attachments: Sample PPT.pptx
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx ententions which was earlier working with Apache Tika 1.23 is no longer working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you have updated the POI to 4.1.2. That might be the root cause of this problem. POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] which is not present in bundle I guess.

This message was sent by Atlassian Jira