[jira] Created: (TIKA-121) MimeType.clean method no longer exists as a capability

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-121) MimeType.clean method no longer exists as a capability

JIRA jira@apache.org
MimeType.clean method no longer exists as a capability
------------------------------------------------------

                 Key: TIKA-121
                 URL: https://issues.apache.org/jira/browse/TIKA-121
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 0.1-incubating
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
             Fix For: 0.2-incubating


For some reason, in r591743 (http://svn.apache.org/viewvc?rev=591743&view=rev), the MimeType.clean functionality was removed and never replaced. This is a problem because that functionality was somewhat necessary as I'm running into the problem of trying to upgrade Nutch to tika-0.1-incubating and Nutch relied on MimeType.clean.

I've been scratching my head trying to determine an appropriate workaround for the same capability within the tika-0.1-incubating code, but have yet to find one. This functionality needs to be replaced in some form or fashion, or, if someone knows of a simple way to achieve the same functionality, please let me know.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-121) MimeType.clean method no longer exists as a capability

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567445#action_12567445 ]

Jukka Zitting commented on TIKA-121:
------------------------------------

I dropped the method under the "Simplified type name handling and validation" label.

Within (at least current) Tika we only need support for MIME types of the form "type/subtype" with no additional parameters, so I didn't see any need for the clean() feature.

> MimeType.clean method no longer exists as a capability
> ------------------------------------------------------
>
>                 Key: TIKA-121
>                 URL: https://issues.apache.org/jira/browse/TIKA-121
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.1-incubating
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.2-incubating
>
>
> For some reason, in r591743 (http://svn.apache.org/viewvc?rev=591743&view=rev), the MimeType.clean functionality was removed and never replaced. This is a problem because that functionality was somewhat necessary as I'm running into the problem of trying to upgrade Nutch to tika-0.1-incubating and Nutch relied on MimeType.clean.
> I've been scratching my head trying to determine an appropriate workaround for the same capability within the tika-0.1-incubating code, but have yet to find one. This functionality needs to be replaced in some form or fashion, or, if someone knows of a simple way to achieve the same functionality, please let me know.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-121) MimeType.clean method no longer exists as a capability

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567450#action_12567450 ]

Chris A. Mattmann commented on TIKA-121:
----------------------------------------

Hi Jukka:

Thanks for the explanation. Well, in order for Tika to be useful in Nutch in an out-of-the-box fashion (without having any special mime utility code in Nutch), we need it to have the ability to handle mime types returned from the server, in the form of:

<primary type>/<sub type> ; <optional additional parameters>

So, a perfect example of this would to to take the string:

"text/html; charset=UTF-8"

And then "clean" it to parse out the mime type portion, "text/html", and drop the optional params. I suppose one could make the argument that this is a web-specific feature, so it belongs in Nutch, however, I'm not positive that this only occurs in the web. Thoughts?

Cheers,
 Chris


> MimeType.clean method no longer exists as a capability
> ------------------------------------------------------
>
>                 Key: TIKA-121
>                 URL: https://issues.apache.org/jira/browse/TIKA-121
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.1-incubating
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.2-incubating
>
>
> For some reason, in r591743 (http://svn.apache.org/viewvc?rev=591743&view=rev), the MimeType.clean functionality was removed and never replaced. This is a problem because that functionality was somewhat necessary as I'm running into the problem of trying to upgrade Nutch to tika-0.1-incubating and Nutch relied on MimeType.clean.
> I've been scratching my head trying to determine an appropriate workaround for the same capability within the tika-0.1-incubating code, but have yet to find one. This functionality needs to be replaced in some form or fashion, or, if someone knows of a simple way to achieve the same functionality, please let me know.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-121) MimeType.clean method no longer exists as a capability

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-121:
-------------------------------

    Attachment: AutoDetectParser.patch

The current mime type registry in Tika is tightly integrated with parser configuration, and for now I'd prefer to avoid coupling it too tightly with client code.

I assume you're using the incoming ContentType header to select (either manually or via AutoDetectParser) which parser to use, so I'd prefer to put the relevant code there. See the attached patch (AutoDetectParser.patch) for the required changes to AutoDetectParser.

Looking forward it might be good to factor such generic code into a standalone media type package, but as long as our current media type code is tightly coupled with Tika configuration, I'd prefer to avoid MimeType dependencies outside configuration code.

> MimeType.clean method no longer exists as a capability
> ------------------------------------------------------
>
>                 Key: TIKA-121
>                 URL: https://issues.apache.org/jira/browse/TIKA-121
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.1-incubating
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.2-incubating
>
>         Attachments: AutoDetectParser.patch
>
>
> For some reason, in r591743 (http://svn.apache.org/viewvc?rev=591743&view=rev), the MimeType.clean functionality was removed and never replaced. This is a problem because that functionality was somewhat necessary as I'm running into the problem of trying to upgrade Nutch to tika-0.1-incubating and Nutch relied on MimeType.clean.
> I've been scratching my head trying to determine an appropriate workaround for the same capability within the tika-0.1-incubating code, but have yet to find one. This functionality needs to be replaced in some form or fashion, or, if someone knows of a simple way to achieve the same functionality, please let me know.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.