FW: [jira] Resolved: (NUTCH-562) Port mime type framework to use Tika mime detection framework

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

FW: [jira] Resolved: (NUTCH-562) Port mime type framework to use Tika mime detection framework

chrismattmann
Folks, apologies for those also subscribed to nutch-dev, but just wanted to
let ya know as an FYI: Nutch now relies on Tika to handle its mime type
detection...

Cheers,
  Chris

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

------ Forwarded Message
From: "Chris A. Mattmann (JIRA)" <[hidden email]>
Reply-To: <[hidden email]>
Date: Mon, 8 Oct 2007 17:24:50 -0700 (PDT)
To: <[hidden email]>
Subject: [jira] Resolved: (NUTCH-562) Port mime type framework to use Tika
mime detection framework


     [
https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plug
in.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-562.
-------------------------------------

    Resolution: Fixed

- Applied patch, with minor changes to use static version of MimeUtils Tika
interface, and to only instantiate once per object family
- Tested on small crawl of apache.org sites, mime type set appropriately

> Port mime type framework to use Tika mime detection framework
> -------------------------------------------------------------
>
>                 Key: NUTCH-562
>                 URL: https://issues.apache.org/jira/browse/NUTCH-562
>             Project: Nutch
>          Issue Type: Improvement
>          Components: mime_type_detector
>    Affects Versions: 1.0.0
>         Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS
X 10.4 although improvement is indep of env
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-562.Mattmann.patch.txt, tika-0.1-dev.jar
>
>
> With Tika (http://incubator.apache.org/tika/) nearing  a stable 0.1 release
candidate, I think it would be a good time to patch Nutch to use Tika's mime
detection system (an improvement over the existing Nutch one written primarily
by Jerome). Tika's mime system is based on the mime system from Freedesktop.org
and includes several improvements over the existing Nutch mime system such as:
> 1. reliable XML-based content detection (a clear issue plaguing Nutch for some
time now), ability to delineate between RSS, XML, ATOM, etc.
> 2. mime magic pattern matching, including support for multiple patterns
> 3. glob pattern matches (ability to support > 1)
> I'll get together a patch and then attach it to the list once it's relatively
stable.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


------ End of Forwarded Message


Reply | Threaded
Open this post in threaded view
|

Re: FW: [jira] Resolved: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Bertrand Delacretaz-2
On 10/9/07, Chris Mattmann <[hidden email]> wrote:
> ...as an FYI: Nutch now relies on Tika to handle its mime type
> detection...

Cool - we have users now!

-Bertrand