[jira] [Commented] (TIKA-2461) Wordperfect file identified as Quattro Pro document

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2461) Wordperfect file identified as Quattro Pro document

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158539#comment-16158539 ]

Nick Burch commented on TIKA-2461:

Could you try running {{org.apache.poi.poifs.dev.POIFSLister}} against the full file, and then post the listing?

Looking at the header, the file uses the OLE2 container wrapper. To update the detection logic, we'll need to know which entries there are inside that, so we can find something uniquely different

> Wordperfect file identified as Quattro Pro document
> ---------------------------------------------------
>                 Key: TIKA-2461
>                 URL: https://issues.apache.org/jira/browse/TIKA-2461
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.16
>         Environment: Linux Mint 17
>            Reporter: Johan van der Knijff
>            Priority: Minor
> While running Tika 1.16 in detect mode over some legacy files from our repository system, I came across one file with a .wpd extension for which Tika reported the following mimetype:
> {code}
> application/x-quattro-pro; version=7-8
> {code}
> Opening the file in LibreOffice reveals this is actually a WordPerfect document (not sure about which version; the .WPD extension suggests WP 6 or later). I had a look at the Quattro Pro entry in tika-mimetypes.xml:
> {code}
>       <mime-type type="application/x-quattro-pro">
>         <_comment>
>           Quattro Pro - Corel Spreadsheet (part of WordPerfect Office suite)
>         </_comment>
>         <!-- qp2 and wb3 are currently detected by POIFSContainerDetector
>             TODO: add detection for wb2 and wb1 -->
>         <glob pattern="*.qpw"/>
>         <glob pattern="*.wb1"/>
>         <glob pattern="*.wb2"/>
>         <glob pattern="*.wb3"/>
>       </mime-type>
> {code}
> This suggests that the problem originates from POIFSContainerDetector.
> For legal reasons I cannot share the original file. However I was able to create a derived file by truncating the original file after 18 kB, and this derived file shows the same behaviour. The file is available at this link:
> [tika-identified-as-quattro-pro-truncated.wpd|https://github.com/bitsgalore/shared/raw/master/tika-identified-as-quattro-pro-truncated.wpd]

This message was sent by Atlassian JIRA