[
https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-2986:
------------------------------
Description:
One of my colleagues, Philip Southam, recently came across a file that was identified as an Acrobat fdf file. The particular file was some kind of binary file with a ".fdf" extension, but not an Acrobat fdf.
Our current MimeTypes algorithm runs magic first, and then it tries to use the file extension. If the file extension suggests a child mime type of what was found via magic, that is used. The problem with this file was that the magic {{%FDF-}} was not found, so from the magic step, it was {{application/octet}}, and then the file extension, which was ".fdf", was selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
If feels like we might want to add a rule that if a mime definition has a defined magic and that magic is not found, we should not then fall back to the file extension. Or, is there a better way to prevent this from happening? Or, is this just an edge case that we should ignore?
was:
I recently came across a file that was identified as an Acrobat fdf file. The particular file was some kind of binary file with a ".fdf" extension, but not an Acrobat fdf.
Our current MimeTypes algorithm runs magic first, and then it tries to use the file extension. If the file extension suggests a child mime type of what was found via magic, that is used. The problem with this file was that the magic {{%FDF-}} was not found, so from the magic step, it was {{application/octet}}, and then the file extension, which was ".fdf", was selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
If feels like we might want to add a rule that if a mime definition has a defined magic and that magic is not found, we should not then fall back to the file extension. Or, is there a better way to prevent this from happening? Or, is this just an edge case that we should ignore?
> Edge case (?) in file type detection
> ------------------------------------
>
> Key: TIKA-2986
> URL:
https://issues.apache.org/jira/browse/TIKA-2986> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Trivial
>
> One of my colleagues, Philip Southam, recently came across a file that was identified as an Acrobat fdf file. The particular file was some kind of binary file with a ".fdf" extension, but not an Acrobat fdf.
> Our current MimeTypes algorithm runs magic first, and then it tries to use the file extension. If the file extension suggests a child mime type of what was found via magic, that is used. The problem with this file was that the magic {{%FDF-}} was not found, so from the magic step, it was {{application/octet}}, and then the file extension, which was ".fdf", was selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
> If feels like we might want to add a rule that if a mime definition has a defined magic and that magic is not found, we should not then fall back to the file extension. Or, is there a better way to prevent this from happening? Or, is this just an edge case that we should ignore?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)