Default MIME Type?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Default MIME Type?

Keith R. Bennett
All -

I tested Tika with a bunch of miscellaneous text files (shell scripts, etc.), and found that an unknown (or nonexistent) extension results in the failure to get a parser using ParseUtils.getParser(URL, TikaConfig).  I think that means that a MIME type could not be determined from the URL.  Should an unknown file type default to text/plain and use the text parser?

Also, I believe there was code added to determine the MIME type from the stream of bytes itself, wasn't there?  How would that be used?

Thanks,
- Keith
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

chrismattmann
Hi Keith,

 The default mime type in TIKA is application/octet-stream. It gets set when
the mime type can't be determined using 3 main means (url resolution,
extension resolution, or magic chars). This is in the MimeTypes.java file
within the mime package. The reason no parser gets called is because there
is no parser registered to handle that mime type.

 Are you suggesting that there is another, more sensible default?

Thanks!

Cheers,
  Chris



On 10/11/07 2:06 PM, "Keith R. Bennett" <[hidden email]> wrote:

>
> All -
>
> I tested Tika with a bunch of miscellaneous text files (shell scripts,
> etc.), and found that an unknown (or nonexistent) extension results in the
> failure to get a parser using ParseUtils.getParser(URL, TikaConfig).  I
> think that means that a MIME type could not be determined from the URL.
> Should an unknown file type default to text/plain and use the text parser?
>
> Also, I believe there was code added to determine the MIME type from the
> stream of bytes itself, wasn't there?  How would that be used?
>
> Thanks,
> - Keith

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Keith R. Bennett
Chris -

I'm not sure...on the one hand, since Tika is basically a text parsing tool, we might want to make plain text the default MIME type.  We couldn't really do anything with an octet stream anyway, right?  

On the other hand, we wouldn't want to attempt to parse something that does not have text, so a nonparseable MIME type such as octet stream as default might make more sense.

Isn't our framework supposed to determine the MIME type based on the content?  Is there perhaps just a configuration or code change that needs to be made?  If so, then this is not an issue.

- Keith

Chris Mattmann wrote
Hi Keith,

 The default mime type in TIKA is application/octet-stream. It gets set when
the mime type can't be determined using 3 main means (url resolution,
extension resolution, or magic chars). This is in the MimeTypes.java file
within the mime package. The reason no parser gets called is because there
is no parser registered to handle that mime type.

 Are you suggesting that there is another, more sensible default?

Thanks!

Cheers,
  Chris
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

chrismattmann
Hi Keith,

 The system can determine the mime type based on the content, if the magic
parameter is enabled. This enables the mime system to use magic chars and
examine the contents of the stream to determine its mime type. However, for
efficiency reasons, this is typically turned off by default because it's not
as fast as simply doing filename or URL comparisons.

 This parameter is controlled by the attribute "magic" within the
tika-config.xml file. Take a look at the mimeTypeRepository tag, and check
the magic attribute. If set to "true", then magic resolution is done. That
should get rid of the default "application/octet-stream" issue you're
having.

 We should also have a look at the default mime types available within the
tika-mimetypes.xml file. We may need to add some more in there. What forms
of content did you test on? Which specific mime types did you see trouble
with? Could you post them to the list? I'll look through them and add in the
gaps to the tika-mimetypes.xml file.

 Thanks!

Cheers,
  Chris



On 10/11/07 3:18 PM, "Keith R. Bennett" <[hidden email]> wrote:

>
> Chris -
>
> I'm not sure...on the one hand, since Tika is basically a text parsing tool,
> we might want to make plain text the default MIME type.  We couldn't really
> do anything with an octet stream anyway, right?
>
> On the other hand, we wouldn't want to attempt to parse something that does
> not have text, so a nonparseable MIME type such as octet stream as default
> might make more sense.
>
> Isn't our framework supposed to determine the MIME type based on the
> content?  Is there perhaps just a configuration or code change that needs to
> be made?  If so, then this is not an issue.
>
> - Keith
>
>
> Chris Mattmann wrote:
>>
>> Hi Keith,
>>
>>  The default mime type in TIKA is application/octet-stream. It gets set
>> when
>> the mime type can't be determined using 3 main means (url resolution,
>> extension resolution, or magic chars). This is in the MimeTypes.java file
>> within the mime package. The reason no parser gets called is because there
>> is no parser registered to handle that mime type.
>>
>>  Are you suggesting that there is another, more sensible default?
>>
>> Thanks!
>>
>> Cheers,
>>   Chris
>>
>>

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Bertrand Delacretaz-2
In reply to this post by Keith R. Bennett
On 10/12/07, Keith R. Bennett <[hidden email]> wrote:

> ...We couldn't really
> do anything with an octet stream anyway, right?...

We could have an UnknownBinaryParser that outputs...not much: maybe
just the filename, size and mime-type as metadata.

To me this seems better than trying to handle unknown files with the
text parser and outputting junk when they are really binary files.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Jukka Zitting
Hi,

On 10/12/07, Bertrand Delacretaz <[hidden email]> wrote:
> On 10/12/07, Keith R. Bennett <[hidden email]> wrote:
> > ...We couldn't really
> > do anything with an octet stream anyway, right?...
>
> We could have an UnknownBinaryParser that outputs...not much: maybe
> just the filename, size and mime-type as metadata.

We could perhaps do something like the Unix strings(1) command does,
i.e. detect sequences of printable (ASCII) characters within a binary
and outputs just those sequences. It's surprising how much useful data
you can extract even from weird binary formats with a tool like that.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Bertrand Delacretaz-2
On 10/12/07, Jukka Zitting <[hidden email]> wrote:

> ...We could perhaps do something like the Unix strings(1) command does,..

+1, good idea. I'm a big fan of strings(1).

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Keith R. Bennett
+1 for me too.  That could be really useful.

BTW, I recently inspected bytes from a Word Document, and it contained text that I had already deleted, even though I didn't use Fast Save to save the changes.  If you try this:

Create a new document containing:

____ is a dork.

Save it, then change "dork" to "genius".  Save the file again.  Inspect the file using strings or other utility.  You still see "dork"! :)

Regarding the properties that the parser can output, the parser will only see the InputStream, not the File or URL from which it came.  So we can have it output size and some other information, but not the name.

- Keith


Bertrand Delacretaz wrote
On 10/12/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:

> ...We could perhaps do something like the Unix strings(1) command does,..

+1, good idea. I'm a big fan of strings(1).

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Keith R. Bennett
In reply to this post by Bertrand Delacretaz-2
By the way, strings detects sequences of ASCII characters, so it would not work at all in many locales (unless ICU or someone else has figured out how to do this).

As long as this limitation is documented, however, I think it would still be extremely useful, and trivial to implement.

- Keith
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

chrismattmann
Hi Folks,

 Thinking this through more, it probably makes a lot of sense for the
Default MIME TYPE in Tika to be application/octet-stream. Essentially what
this is saying is: "this is content for which I could not discern a mime
type". Well, application/octet-stream indicates that the content coming is
a sequence of bits. Well, following suit, essentially it is the super mime
type of all mime types in reality.

 I like the idea about implementing a UNIX strings style parser: there was
discussion on the Nutch list a year or so ago regarding this same issue. If
there isn't exact consensus on this issue; however, we could always make the
default mime type a settable parameter in the tika-config.xml file
mimeTypeRepository tag. That way, we could ship Tika with a default mime
type of application/octet-stream, and then if that doesn't work for users,
they simply update their attribute in their xml file (and perhaps
additionally turn on magic detection) and that would probably solve the
issue.

 Thoughts?  If you all agree, I'll create an issue about this in JIRA. Come
to think of it: I'll probably do that anyways.


Cheers,
 Chris



On 10/12/07 6:12 PM, "Keith R. Bennett" <[hidden email]> wrote:

>
> By the way, strings detects sequences of ASCII characters, so it would not
> work at all in many locales (unless ICU or someone else has figured out how
> to do this).
>
> As long as this limitation is documented, however, I think it would still be
> extremely useful, and trivial to implement.
>
> - Keith

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Keith R. Bennett
Chris -

I agree.  I now see the wisdom of making application/octet-stream the default mime type, possibly with the ability to override that.

In addition, though, I think we want to consider what Tika should do with such a byte stream.  One option is to run it through strings to get ASCII text.  Another is to have it fail the parse so that the user can be notified that Tika could not find a (definitely) suitable parser.  Another might be to parse it as an empty string (if, for example, the text is known to be in Chinese, and the output of strings would be meaningless random garbage).  In the future, maybe the user would consider it important enough to write a custom parser for application/octet-stream, and plug it into Tika.

- Keith

Chris Mattmann wrote
Hi Folks,

 Thinking this through more, it probably makes a lot of sense for the
Default MIME TYPE in Tika to be application/octet-stream.
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Bertrand Delacretaz-2
On 10/13/07, Keith R. Bennett <[hidden email]> wrote:

> ...I agree.  I now see the wisdom of making application/octet-stream the
> default mime type, possibly with the ability to override that....

Same here, +1 on that.

> ...I think we want to consider what Tika should do with
> such a byte stream.  One option is to run it through strings to get ASCII
> text.  Another is to have it fail the parse so that the user can be notified
> that Tika could not find a (definitely) suitable parser....

Agreed, this should probably be a "fail on unknown mime-type"
configuration switch.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Bertrand Delacretaz-2
In reply to this post by Keith R. Bennett
On 10/12/07, Keith R. Bennett <[hidden email]> wrote:

> ...Regarding the properties that the parser can output, the parser will only
> see the InputStream, not the File or URL from which it came.  So we can have
> it output size and some other information, but not the name....

Unless the filename is included in the input metadata.

Besides possibly giving hints to the parsers, this input metadata
could also contain any useful information that the user wants to
include in the generated metadata.

The simplest thing to do might be to copy the input metadata to the
output, unless its value is overwritten by the Tika parsing.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

robert burrell donkin-2
On 10/13/07, Bertrand Delacretaz <[hidden email]> wrote:
> On 10/12/07, Keith R. Bennett <[hidden email]> wrote:
>
> > ...Regarding the properties that the parser can output, the parser will only
> > see the InputStream, not the File or URL from which it came.  So we can have
> > it output size and some other information, but not the name....
>
> Unless the filename is included in the input metadata.

IMHO this would be generally useful

- robert
Reply | Threaded
Open this post in threaded view
|

Re: Default MIME Type?

Jukka Zitting
In reply to this post by Bertrand Delacretaz-2
Hi,

On 10/13/07, Bertrand Delacretaz <[hidden email]> wrote:
> The simplest thing to do might be to copy the input metadata to the
> output, unless its value is overwritten by the Tika parsing.

+1 This is exactly how I envisioned the Metadata parameter in
Parser.parse() to be used.

BR,

Jukka Zitting