[jira] Created: (TIKA-309) Mime type application/rdf+xml not correctly detected

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
Mime type application/rdf+xml not correctly detected
----------------------------------------------------

                 Key: TIKA-309
                 URL: https://issues.apache.org/jira/browse/TIKA-309
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 0.5
            Reporter: Yuan-Fang Li
            Priority: Minor


Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.

P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-309.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Added improved RDF/XML type metadata and test cases to verify that the type is correctly detected.

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuan-Fang Li reopened TIKA-309:
-------------------------------


This fix had worked for me till yesterday. When I updated to the latest version (829668) from svn, my test cases on application/rdf+xml mimetype failed again, for URLs "http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl" and "http://www.w3.org/2002/07/owl#". The mimetype returned is "application/xml" for the first one and "text/html" for the second one. Hence I'm reopening this issue.

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774551#action_12774551 ]

Chris A. Mattmann commented on TIKA-309:
----------------------------------------

Hey Guys, I think we just need another line in the tika-mimetypes.xml file for this. I'll take a crack at it, if there are no objections. Thanks!

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-309:
--------------------------------------

    Assignee: Chris A. Mattmann  (was: Jukka Zitting)

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777564#action_12777564 ]

Chris A. Mattmann edited comment on TIKA-309 at 11/13/09 5:16 PM:
------------------------------------------------------------------

This ended up turning out to be a tricky nightmare. Yuan-Fang,

#. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
#. the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

# use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{code}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{code}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.

      was (Author: chrismattmann):
    This ended up turning out to be a tricky nightmare. Yuan-Fang,

# the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
# the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

# use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{code}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{code}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.
 

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777564#action_12777564 ]

Chris A. Mattmann commented on TIKA-309:
----------------------------------------

This ended up turning out to be a tricky nightmare. Yuan-Fang,

# the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
# the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

# use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{code}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{code}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777564#action_12777564 ]

Chris A. Mattmann edited comment on TIKA-309 at 11/13/09 5:18 PM:
------------------------------------------------------------------

This ended up turning out to be a tricky nightmare. Yuan-Fang,

1. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
2. the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

1. use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{noformat}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{noformat}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.

      was (Author: chrismattmann):
    This ended up turning out to be a tricky nightmare. Yuan-Fang,

#. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
#. the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

# use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{code}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{code}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.
 

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777564#action_12777564 ]

Chris A. Mattmann edited comment on TIKA-309 at 11/13/09 5:18 PM:
------------------------------------------------------------------

This ended up turning out to be a tricky nightmare. Yuan-Fang,

1. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
2. the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

1. use the o.a.tika.detect.NameDetector and set the Metadata.RESOURCE_NAME_KEY value before calling (pseudo-code):

AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.

      was (Author: chrismattmann):
    This ended up turning out to be a tricky nightmare. Yuan-Fang,

1. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
2. the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

1. use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{noformat}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{noformat}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.
 

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved TIKA-309.
------------------------------------

    Resolution: Fixed

- fixed in r836035:

* was able to correctly identify RDF/OWL mime types using magic by changing regex pattern for localName in MimeTypes.java (in the case where only the <ns:localName..... is read, but there is no ">" at the end since we only read N first bytes of the magic header)

* added unit tests and URLs from this issue for regression
* refactored o.a.tika.mime.MimeDetectionTest to support URLs as InputStreams (as well as Files)
* took out <match value="&lt;!--" type="string" offset="0"/> for HTML detection since comments can appear in HTML, XML, etc., and aren't specific to HTML



> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuan-Fang Li reopened TIKA-309:
-------------------------------


Hi Chris,

Thanks a lot for the fix. However, I have to reopen the ticket due to some problems with InputStream, and some other issues.

1. In your comment you suggested that I do the following (pseudo code):

AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);

Since NameDetector takes a map as the parameter for the constructor, I have to do the following:

parser.setDetector(new NameDetector(new HashMap<Pattern, MediaType>()));

Doing so invalidates my tests because the map in NameDetector is empty,  the mime type returned will always be "application/octet-stream". Is there another way to initialize the NameDetector?

2. The detection for the 2 URLs works perfectly now based on your suggestion (not adding NameDetector to the parser but adding met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file"); ). However, if my input is an input stream, the test still fails since the parser doesn't have the hint from file/URL names.

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779798#action_12779798 ]

Chris A. Mattmann commented on TIKA-309:
----------------------------------------

Yuan-Fang,

There is a unit test that should correctly determine if this is working on your system or not. Does:
/lucene/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java  

Pass on your system?

Regarding #1 and #2 above, I was assuming you could pass in a regex pattern->MediaType map to NameDetector. If you didn't want to pass that in, you may want to take a look at the other Detectors in the o.a.t.detect package. For #2, if the test above passes, it should prove that InputStream detection properly works?

Cheers,
Chris


> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-309.
--------------------------------

    Resolution: Fixed

Re-resolving this as Fixed, as the test case we have works. Please file a new issue with a clear test case in case the current behaviour does not work for you.

Note that with the fix Chris made, you should be able to auto-detect the mentioned RDF files with the normal AutoDetectParser even without any setDetector() customizations.

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781311#action_12781311 ]

Yuan-Fang Li commented on TIKA-309:
-----------------------------------

Hi Chris, Jukka,

Yes, the Tika tests are passing for me. However, my test for one of the ontologies ("http://www.w3.org/2002/07/owl#") is still failing, and here is why.

In test tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java, the method testUrl(String expected, String url, String file) is actually testing the content in the file named "file" with the url being a clue for the detection. My test, however, opens an input stream on the actual url and use that to detect the mime type. For the above URL, tika is testing against the file named "test-difficult-rdf2.xml". The only difference I can see between this file and the actual content of the URl is the one line at the top: "<?xml version='1.0' encoding='ISO-8859-1'?>". This line is present in the tika test file but not in the URL.

So. if you remove/comment out that line from "test-difficult-rdf2.xml" and run the following maven command to run the test: mvn -Dtest=MimeDetectionTest test, it will fail. Or, you could use the following test case to test against the real URL.

    @Test
    public void testRDFStreamMimeType() throws IOException {
        URL url = new URL("http://www.w3.org/2002/07/owl#");
        final InputStream stream = new BufferedInputStream(url.openStream());
        try {
            MimeTypes mimeTypes = TikaConfig.getDefaultConfig().getMimeRepository();
            Metadata metadata = new Metadata();
            String mime = mimeTypes.detect(stream, metadata).toString();
            assertEquals("application/rdf+xml", mime);
        } finally {
            stream.close();
        }
    }

Cheers
Yuan-Fang

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-309) Mime type application/rdf+xml not correctly detected

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782663#action_12782663 ]

Chris A. Mattmann commented on TIKA-309:
----------------------------------------

Yuang-Fang:

I've confirmed what you mentioned. When the XML header first-line is taken out of the test-difficult-rdf2.xml (as the remote URL exists), I get this:

[chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest clean test
[INFO] Scanning for projects...
[INFO] Reactor build order:
[INFO]   Apache Tika parent
[INFO]   Apache Tika core
[INFO]   Apache Tika parsers
[INFO]   Apache Tika application
[INFO]   Apache Tika
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika parent
[INFO]    task-segment: [clean, test]
[INFO] ------------------------------------------------------------------------
[INFO] [clean:clean]
[INFO] Setting property: classpath.resource.loader.class => 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'.
[INFO] Setting property: velocimacro.messages.on => 'false'.
[INFO] Setting property: resource.loader => 'classpath'.
[INFO] Setting property: resource.manager.logwhenfound => 'false'.
[INFO] [remote-resources:process {execution: default}]
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika core
[INFO]    task-segment: [clean, test]
[INFO] ------------------------------------------------------------------------
[INFO] [clean:clean]
[INFO] [remote-resources:process {execution: default}]
[INFO] [resources:resources]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 20 resources
[INFO] Copying 3 resources
[INFO] [compiler:compile]
[INFO] Compiling 86 source files to /Users/mattmann/src/tika/trunk/tika-core/target/classes
[INFO] [resources:testResources]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 24 resources
[INFO] Copying 3 resources
[INFO] [compiler:testCompile]
[INFO] Compiling 19 source files to /Users/mattmann/src/tika/trunk/tika-core/target/test-classes
[INFO] [surefire:test]
[INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.tika.mime.MimeDetectionTest
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.568 sec <<< FAILURE!

Results :

Failed tests:
  testDetection(org.apache.tika.mime.MimeDetectionTest)

Tests run: 2, Failures: 1, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] There are test failures.

Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results.
[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8 seconds
[INFO] Finished at: Wed Nov 25 14:45:52 PST 2009
[INFO] Final Memory: 15M/31M
[INFO] ------------------------------------------------------------------------
[chipotle:~/src/tika/trunk] mattmann%

[chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.tika.mime.MimeDetectionTest
-------------------------------------------------------------------------------
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.573 sec <<< FAILURE!
testDetection(org.apache.tika.mime.MimeDetectionTest)  Time elapsed: 0.44 sec  <<< FAILURE!
junit.framework.ComparisonFailure: http://www.w3.org/2002/07/owl# is not properly detected. expected:<application/rdf+xml> but w
as:<text/plain>
        at junit.framework.Assert.assertEquals(Assert.java:81)
        at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:87)
        at org.apache.tika.mime.MimeDetectionTest.testUrl(MimeDetectionTest.java:71)
        at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:54)

I'm looking into this right now...I'll file another issue for this..
I'm looking into this now:

> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.