[jira] Created: (TIKA-225) [PATCH] Various bugfixes for MIME detection

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-225) [PATCH] Various bugfixes for MIME detection

JIRA jira@apache.org
[PATCH] Various bugfixes for MIME detection
-------------------------------------------

                 Key: TIKA-225
                 URL: https://issues.apache.org/jira/browse/TIKA-225
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 0.4
            Reporter: Jeremias Maerki
             Fix For: 0.4
         Attachments: detection-bugfixes.diff, test-files.zip

Here's a patch that solves the following issues:
- text/plain's priority is too high. The BOMs are also used by XML so it must be ensured that text/plain is not found too soon.
- *.xsl, *.xslt and *.xsd are not text/plain but they are actually XML files. XSLT has its own MIME type.
- Consolidated the two XHTML entries.
- Fixed a bug in the existing XML magics which cause plain XML files to be detected as text/plain.
- Added magics for UTF-16 encoding. (Some magics are still missing: http://www.w3.org/TR/xml/#sec-guessing)
- Added entry for XSLT
- XML namespace detection didn't work if namespace prefixes are used (Examples: XSLT Stylesheets or SVG graphics). Corrected this by adding an additional detection step that fires up an XML parser to determine the root element. Of course, this could probably be done without an XML parser but I had limited time available.
- Added a test case for some files (test files in separate ZIP, to be placed under tika-core\src\test\resources\org\apache\tika\mime)

HTH

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-225) [PATCH] Various bugfixes for MIME detection

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremias Maerki updated TIKA-225:
---------------------------------

    Attachment: test-files.zip

> [PATCH] Various bugfixes for MIME detection
> -------------------------------------------
>
>                 Key: TIKA-225
>                 URL: https://issues.apache.org/jira/browse/TIKA-225
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.4
>            Reporter: Jeremias Maerki
>             Fix For: 0.4
>
>         Attachments: detection-bugfixes.diff, test-files.zip
>
>
> Here's a patch that solves the following issues:
> - text/plain's priority is too high. The BOMs are also used by XML so it must be ensured that text/plain is not found too soon.
> - *.xsl, *.xslt and *.xsd are not text/plain but they are actually XML files. XSLT has its own MIME type.
> - Consolidated the two XHTML entries.
> - Fixed a bug in the existing XML magics which cause plain XML files to be detected as text/plain.
> - Added magics for UTF-16 encoding. (Some magics are still missing: http://www.w3.org/TR/xml/#sec-guessing)
> - Added entry for XSLT
> - XML namespace detection didn't work if namespace prefixes are used (Examples: XSLT Stylesheets or SVG graphics). Corrected this by adding an additional detection step that fires up an XML parser to determine the root element. Of course, this could probably be done without an XML parser but I had limited time available.
> - Added a test case for some files (test files in separate ZIP, to be placed under tika-core\src\test\resources\org\apache\tika\mime)
> HTH

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-225) [PATCH] Various bugfixes for MIME detection

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremias Maerki updated TIKA-225:
---------------------------------

    Attachment: detection-bugfixes.diff

> [PATCH] Various bugfixes for MIME detection
> -------------------------------------------
>
>                 Key: TIKA-225
>                 URL: https://issues.apache.org/jira/browse/TIKA-225
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.4
>            Reporter: Jeremias Maerki
>             Fix For: 0.4
>
>         Attachments: detection-bugfixes.diff, test-files.zip
>
>
> Here's a patch that solves the following issues:
> - text/plain's priority is too high. The BOMs are also used by XML so it must be ensured that text/plain is not found too soon.
> - *.xsl, *.xslt and *.xsd are not text/plain but they are actually XML files. XSLT has its own MIME type.
> - Consolidated the two XHTML entries.
> - Fixed a bug in the existing XML magics which cause plain XML files to be detected as text/plain.
> - Added magics for UTF-16 encoding. (Some magics are still missing: http://www.w3.org/TR/xml/#sec-guessing)
> - Added entry for XSLT
> - XML namespace detection didn't work if namespace prefixes are used (Examples: XSLT Stylesheets or SVG graphics). Corrected this by adding an additional detection step that fires up an XML parser to determine the root element. Of course, this could probably be done without an XML parser but I had limited time available.
> - Added a test case for some files (test files in separate ZIP, to be placed under tika-core\src\test\resources\org\apache\tika\mime)
> HTH

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-225) [PATCH] Various bugfixes for MIME detection

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-225.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Thanks! Patch and test cases committed in revision 776859.

> [PATCH] Various bugfixes for MIME detection
> -------------------------------------------
>
>                 Key: TIKA-225
>                 URL: https://issues.apache.org/jira/browse/TIKA-225
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.4
>            Reporter: Jeremias Maerki
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>         Attachments: detection-bugfixes.diff, test-files.zip
>
>
> Here's a patch that solves the following issues:
> - text/plain's priority is too high. The BOMs are also used by XML so it must be ensured that text/plain is not found too soon.
> - *.xsl, *.xslt and *.xsd are not text/plain but they are actually XML files. XSLT has its own MIME type.
> - Consolidated the two XHTML entries.
> - Fixed a bug in the existing XML magics which cause plain XML files to be detected as text/plain.
> - Added magics for UTF-16 encoding. (Some magics are still missing: http://www.w3.org/TR/xml/#sec-guessing)
> - Added entry for XSLT
> - XML namespace detection didn't work if namespace prefixes are used (Examples: XSLT Stylesheets or SVG graphics). Corrected this by adding an additional detection step that fires up an XML parser to determine the root element. Of course, this could probably be done without an XML parser but I had limited time available.
> - Added a test case for some files (test files in separate ZIP, to be placed under tika-core\src\test\resources\org\apache\tika\mime)
> HTH

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.