[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099475#comment-17099475 ]

Tim Allison edited comment on TIKA-3094 at 5/5/20, 1:26 AM:
------------------------------------------------------------

Thank you [~bob]!

For kicks, I ran the osgi'd Tika against all of our test files and found a few more issues.  I left in the Ignored unit test so that you can see what I'm saying.

1) jdom2 is needed by the rss parser (I fixed this in master)
2) java.lang.ClassNotFoundException: javax.xml.bind.JAXBException not found by org.apache.tika.bundle [19] ...can't figure out how to fix this
3) We're left with several exceptions caused by adding the wrong type of metadata, and we aren't seeing those with regular Tika.  I can't figure out why we're getting these in OSGi but not in regular Tika.

On 2), I tried a bunch of variants of the package that should bring that in, but had no luck.
On 3), I'll look more closely tomorrow to try to figure out what's going on.


was (Author: [hidden email]):
Thank you [~bob]!

For kicks, I ran the osgi'd Tika against all of our test files and found a few more issues.

1) jdom2 is needed by the rss parser
2) java.lang.ClassNotFoundException: javax.xml.bind.JAXBException not found by org.apache.tika.bundle [19] ...can't figure out how to fix this
3) We're left with several exceptions caused by adding the wrong type of metadata, and we aren't seeing those with regular Tika.  I can't figure out why we're getting these in OSGi but not in regular Tika.

On 2), I tried a bunch of variants of the package that should bring that in, but had no luck.
On 3), I'll look more closely tomorrow to try to figure out what's going on.

> Apache Tika fails to extract text for pptx extension.
> -----------------------------------------------------
>
>                 Key: TIKA-3094
>                 URL: https://issues.apache.org/jira/browse/TIKA-3094
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24, 1.24.1
>            Reporter: Abhishek Chauhan
>            Assignee: Bob Paulin
>            Priority: Critical
>         Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx ententions which was earlier working with Apache Tika 1.23 is no longer working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you have updated the POI to 4.1.2. That might be the root cause of this problem. POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)