[jira] [Created] (TIKA-2461) Wordperfect file identified as

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Created] (TIKA-2461) Wordperfect file identified as

JIRA jira@apache.org
Johan van der Knijff created TIKA-2461:

             Summary: Wordperfect file identified as
                 Key: TIKA-2461
                 URL: https://issues.apache.org/jira/browse/TIKA-2461
             Project: Tika
          Issue Type: Bug
          Components: detector
    Affects Versions: 1.16
         Environment: Linux Mint 17
            Reporter: Johan van der Knijff
            Priority: Minor

While running Tika 1.16 in detect mode over some legacy files from our repository system, I came across one file with a .wpd extension for which Tika reported the following mimetype:
application/x-quattro-pro; version=7-8

Opening the file in LibreOffice reveals this is actually a WordPerfect document (not sure about which version; the .WPD extension suggests WP 6 or later). I had a look at the Quattro Pro entry in tika-mimetypes.xml:

      <mime-type type="application/x-quattro-pro">
          Quattro Pro - Corel Spreadsheet (part of WordPerfect Office suite)
        <!-- qp2 and wb3 are currently detected by POIFSContainerDetector
            TODO: add detection for wb2 and wb1 -->
        <glob pattern="*.qpw"/>
        <glob pattern="*.wb1"/>
        <glob pattern="*.wb2"/>
        <glob pattern="*.wb3"/>

This suggests that the problem originates from POIFSContainerDetector.

For legal reasons I cannot share the original file. However I was able to create a derived file by truncating the original file after 18 kB, and this derived file shows the same behaviour. The file is available at this link:


This message was sent by Atlassian JIRA