[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

David Pilato (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977615#comment-16977615 ]

Nick Burch commented on TIKA-2986:

How do we know which ones are a _must_ though? Many we expect, but for some formats we've had to guess, and some formats are actually more flexible than the spec suggests... That's why I thought a different detector mode might be cleaner / more obvious

> Edge case (?) in file type detection
> ------------------------------------
>                 Key: TIKA-2986
>                 URL: https://issues.apache.org/jira/browse/TIKA-2986
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
> I recently came across a file that was identified as an Acrobat fdf file.  The particular file was some kind of binary file with a ".fdf" extension, but not an Acrobat fdf.  
> Our current MimeTypes algorithm runs magic first, and then it tries to use the file extension.  If the file extension suggests a child mime type of what was found via magic, that is used.  The problem with this file was that the magic {{%FDF-}} was not found, so from the magic step, it was {{application/octet}}, and then the file extension, which was ".fdf", was selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
> If feels like we might want to add a rule that if a mime definition has a defined magic and that magic is not found, we should not then fall back to the file extension. Or, is there a better way to prevent this from happening? Or, is this just an edge case that we should ignore?

This message was sent by Atlassian Jira