[jira] Created: (TIKA-285) Update media type registry to the latest httpd mime type database

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-285) Update media type registry to the latest httpd mime type database

Prajeeth Emanuel (Jira)
Update media type registry to the latest httpd mime type database
-----------------------------------------------------------------

                 Key: TIKA-285
                 URL: https://issues.apache.org/jira/browse/TIKA-285
             Project: Tika
          Issue Type: Improvement
          Components: mime
            Reporter: Jukka Zitting


The MIME type database included in the Apache HTTP Server is one of the more complete and accurate media type and file extension resources out there.

Their magic byte settings don't seem to be as complete as the ones in Tika, but it would be good to check also those settings for extra information.

... and we should contribute any of the recent Tika settings back to httpd where they don't already know of those details.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-285) Update media type registry to the latest httpd mime type database

Prajeeth Emanuel (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760023#action_12760023 ]

Ken Krugler commented on TIKA-285:
----------------------------------

The "file" command line utility also has a pretty good set of magic byte settings - we'd looked at it when working on Krugle. FWIR, it also has a slightly more sophisticated method for processing magic bytes than what Nutch (and I guess now Tika) has.

One of the issues we'd run into was the need to be able to use a regex against the header bytes to determine true file type, versus fixed offsets/values.


> Update media type registry to the latest httpd mime type database
> -----------------------------------------------------------------
>
>                 Key: TIKA-285
>                 URL: https://issues.apache.org/jira/browse/TIKA-285
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>
> The MIME type database included in the Apache HTTP Server is one of the more complete and accurate media type and file extension resources out there.
> Their magic byte settings don't seem to be as complete as the ones in Tika, but it would be good to check also those settings for extra information.
> ... and we should contribute any of the recent Tika settings back to httpd where they don't already know of those details.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-285) Update media type registry to the latest httpd mime type database

Prajeeth Emanuel (Jira)
In reply to this post by Prajeeth Emanuel (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-285.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Yes, the file(1) command comes with a pretty impressive set of magic byte patterns. I'll file a separate issue for getting those included in Tika.

Meanwhile I've now updated the Tika type registry to contain everything included in the mime.types and magic files in the latest Apache HTTP Server trunk. The summary is pretty impressive:

 * The media type registry in Tika was synchronized with the MIME type
   configuration in the Apache HTTP Server. Tika now knows about 1274
   different media types and can detect 672 of those using 927 file
   extension and 280 magic byte patterns. (TIKA-285)


> Update media type registry to the latest httpd mime type database
> -----------------------------------------------------------------
>
>                 Key: TIKA-285
>                 URL: https://issues.apache.org/jira/browse/TIKA-285
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> The MIME type database included in the Apache HTTP Server is one of the more complete and accurate media type and file extension resources out there.
> Their magic byte settings don't seem to be as complete as the ones in Tika, but it would be good to check also those settings for extra information.
> ... and we should contribute any of the recent Tika settings back to httpd where they don't already know of those details.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.