[jira] Created: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
Mime type detection fails with upper case file extensions such as "PDF".
------------------------------------------------------------------------

                 Key: TIKA-56
                 URL: https://issues.apache.org/jira/browse/TIKA-56
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 0.1-incubator
            Reporter: Keith R. Bennett
            Priority: Critical
             Fix For: 0.1-incubator


Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.

To test this, add the following method to TestParsers:

    public void testGetParsers() throws TikaException, MalformedURLException {
        assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
        assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
        // more?
    }


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534157 ]

Chris A. Mattmann commented on TIKA-56:
---------------------------------------

Hi Keith:

I'm not necessary sure that this is a bug, no? If you're doing mime detection, with magic turned off, and it has to use the file extension, is it ever the case (no pun intended ;) ), where the "case" of the file extension matters? If so, then I would suggest we not change the mime system to be case insensitive.

Know of any cases where this is true?

> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Priority: Critical
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
In reply to this post by Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534160 ]

Keith R. Bennett commented on TIKA-56:
--------------------------------------

Chris -

I don't know of any such cases, but then we've reached the limits of my knowledge of MIME types. ;)

However, if we have a utility that determines the MIME type from an extension, my sense is that is reasonable to make the extension comparisons case insensitive.  Especially in the Windows world, there are huge numbers of files out there with upper case extensions.  To me, it makes sense for the default to be to consider "PDF" equal to "pdf"; otherwise, we will get lots of "bugs" reported. ;)

If there are any obscure cases where case matters, I think it may be reasonable to require the user to use other means of determining the MIME type (have the user determine it himself, or use "magic"?).  

- Keith



> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Priority: Critical
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
In reply to this post by Chris Mattmann (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-56:
-------------------------------------

    Assignee: Chris A. Mattmann

> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Chris A. Mattmann
>            Priority: Critical
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
In reply to this post by Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534194 ]

Chris A. Mattmann commented on TIKA-56:
---------------------------------------

Keith: agreed. If no one else has any objections, I'll get a patch together that adapts the mime type repository to handle file extensions in a case-insensitive fashion.


> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Chris A. Mattmann
>            Priority: Critical
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
In reply to this post by Chris Mattmann (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-56:
------------------------------

    Priority: Minor  (was: Critical)

Dropping priority to minor (a client can already work around the issue by explicitly lower-casing the name).

+1 to fixing this as discussed.

I guess the simplest fix is to use name.toLowerCase() in MimeTypes.getMimeType(String name).

> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
In reply to this post by Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534673 ]

Chris A. Mattmann commented on TIKA-56:
---------------------------------------

+1 to using name.toLowerCase() Jukka. I'll commit a fix for this shortly.

> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
In reply to this post by Chris Mattmann (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann closed TIKA-56.
---------------------------------


- fixed in r584602.

> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-56) Mime type detection fails with upper case file extensions such as "PDF".

Chris Mattmann (Jira)
In reply to this post by Chris Mattmann (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved TIKA-56.
-----------------------------------

    Resolution: Fixed

- fix implemented as suggested by Jukka (use of .toLowerCase in getMimeType(String filename)
- added unit test to test different cases of ".pdf" for regression purposes


> Mime type detection fails with upper case file extensions such as "PDF".
> ------------------------------------------------------------------------
>
>                 Key: TIKA-56
>                 URL: https://issues.apache.org/jira/browse/TIKA-56
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.1-incubator
>
>
> Mime type detection only seems to work when the file extension is lower case.  Both PDF and DOC extensions failed.
> To test this, add the following method to TestParsers:
>     public void testGetParsers() throws TikaException, MalformedURLException {
>         assertNotNull(ParseUtils.getParser(new URL("file:x.pdf"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PDF"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.doc"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.DOC"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.txt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.TXT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.html"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTML"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HtMl"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.htm"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.HTM"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.ppt"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.PPT"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.xls"), tc));
>         assertNotNull(ParseUtils.getParser(new URL("file:x.XLS"), tc));
>         // more?
>     }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.