[jira] [Commented] (TIKA-2461) Wordperfect file identified as Quattro Pro document

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2461) Wordperfect file identified as Quattro Pro document

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158632#comment-16158632 ]

Nick Burch commented on TIKA-2461:

Assuming you have the Tika App jar to hand, you can just run it with {{java -classpath tika-app-1.16.jar org.apache.poi.poifs.dev.POIFSLister FullFile.wpd}}

> Wordperfect file identified as Quattro Pro document
> ---------------------------------------------------
>                 Key: TIKA-2461
>                 URL: https://issues.apache.org/jira/browse/TIKA-2461
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.16
>         Environment: Linux Mint 17
>            Reporter: Johan van der Knijff
>            Priority: Minor
> While running Tika 1.16 in detect mode over some legacy files from our repository system, I came across one file with a .wpd extension for which Tika reported the following mimetype:
> {code}
> application/x-quattro-pro; version=7-8
> {code}
> Opening the file in LibreOffice reveals this is actually a WordPerfect document (not sure about which version; the .WPD extension suggests WP 6 or later). I had a look at the Quattro Pro entry in tika-mimetypes.xml:
> {code}
>       <mime-type type="application/x-quattro-pro">
>         <_comment>
>           Quattro Pro - Corel Spreadsheet (part of WordPerfect Office suite)
>         </_comment>
>         <!-- qp2 and wb3 are currently detected by POIFSContainerDetector
>             TODO: add detection for wb2 and wb1 -->
>         <glob pattern="*.qpw"/>
>         <glob pattern="*.wb1"/>
>         <glob pattern="*.wb2"/>
>         <glob pattern="*.wb3"/>
>       </mime-type>
> {code}
> This suggests that the problem originates from POIFSContainerDetector.
> For legal reasons I cannot share the original file. However I was able to create a derived file by truncating the original file after 18 kB, and this derived file shows the same behaviour. The file is available at this link:
> [tika-identified-as-quattro-pro-truncated.wpd|https://github.com/bitsgalore/shared/raw/master/tika-identified-as-quattro-pro-truncated.wpd]

This message was sent by Atlassian JIRA