[jira] Created: (TIKA-95) Pluggable magic header detectors

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-95) Pluggable magic header detectors

JIRA jira@apache.org
Pluggable magic header detectors
--------------------------------

                 Key: TIKA-95
                 URL: https://issues.apache.org/jira/browse/TIKA-95
             Project: Tika
          Issue Type: New Feature
            Reporter: Jukka Zitting
            Priority: Minor


Some file formats like MS Office files or specific XML schemas don't have simple magic marker bytes that could be used to easily identify the type of the document. However, it would in many cases be possible to detect such formats with more complex parsing logic.

Also, there are some external libraries (like Sanselan as mentioned in TIKA-92) that contain their own magic header rules. Instead of duplicating such rules in Tika, it would be better if Tika could just invoke the existing external functionality.

To support these cases Tika should provide a mechanism to plug in custom magic header detector components in addition to the traditional configured magic patterns.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-95) Pluggable magic header detectors

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-95:
----------------------------------

    Component/s: mime

> Pluggable magic header detectors
> --------------------------------
>
>                 Key: TIKA-95
>                 URL: https://issues.apache.org/jira/browse/TIKA-95
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Some file formats like MS Office files or specific XML schemas don't have simple magic marker bytes that could be used to easily identify the type of the document. However, it would in many cases be possible to detect such formats with more complex parsing logic.
> Also, there are some external libraries (like Sanselan as mentioned in TIKA-92) that contain their own magic header rules. Instead of duplicating such rules in Tika, it would be better if Tika could just invoke the existing external functionality.
> To support these cases Tika should provide a mechanism to plug in custom magic header detector components in addition to the traditional configured magic patterns.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.