[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

Mihir Sharma (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123914#comment-17123914 ]

Hudson commented on TIKA-3094:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x #339 (See [https://builds.apache.org/job/tika-branch-1x/339/])
TIKA-3094 add ignored unit test that runs the bundle against all of the (tallison: [https://github.com/apache/tika/commit/f6b07702895af9c12a9c5f91a20db50d506a8bbd])
* (edit) tika-bundle/pom.xml
* (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
TIKA-3094 -- new metadata for every parse :( (tallison: [https://github.com/apache/tika/commit/098256bd8eaba266f959c3478c7c9812dbf6e114])
* (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
TIKA-3094: add javax.xml.bind to system packages.  Fix java 11 jaxb. (tallison: [https://github.com/apache/tika/commit/b7c5d2ed1d43430dd29d25f1d7e8954ba48bb46d])
* (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
* (edit) tika-bundle/test-bundles.xml
* (edit) tika-bundle/pom.xml


> Apache Tika fails to extract text for pptx extension.
> -----------------------------------------------------
>
>                 Key: TIKA-3094
>                 URL: https://issues.apache.org/jira/browse/TIKA-3094
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24, 1.24.1
>            Reporter: Abhishek Chauhan
>            Assignee: Bob Paulin
>            Priority: Critical
>         Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx ententions which was earlier working with Apache Tika 1.23 is no longer working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you have updated the POI to 4.1.2. That might be the root cause of this problem. POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)