HTML mime-types

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML mime-types

kkrugler
Currently the tika-config.xml file maps three mime-types to the  
HtmlParser:

         <parser name="parse-html"  
class="org.apache.tika.parser.html.HtmlParser">
                 <mime>text/html</mime>
                 <mime>application/xhtml+xml</mime>
                 <mime>application/x-asp</mime>
         </parser>

I notice that facebook.com, if you don't specify an Accept: value in  
the request header, returns this for the mime-type:

application/vnd.wap.xhtml+xml

Wondering if this should be added to the set, and if so then what  
other variants like this are floating around.

Or if we need something like "application/*.xhtml.xml" so that  
wildcards can be used in mimetype patterns.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply | Threaded
Open this post in threaded view
|

Re: HTML mime-types

Jukka Zitting
Hi,

On Mon, Dec 7, 2009 at 3:47 AM, Ken Krugler <[hidden email]> wrote:

> Currently the tika-config.xml file maps three mime-types to the HtmlParser:
>
>        <parser name="parse-html"
> class="org.apache.tika.parser.html.HtmlParser">
>                <mime>text/html</mime>
>                <mime>application/xhtml+xml</mime>
>                <mime>application/x-asp</mime>
>        </parser>
>
> I notice that facebook.com, if you don't specify an Accept: value in the
> request header, returns this for the mime-type:
>
> application/vnd.wap.xhtml+xml
>
> Wondering if this should be added to the set, and if so then what other
> variants like this are floating around.

Sounds good to me. For now we can add more types as we encounter them.

> Or if we need something like "application/*.xhtml.xml" so that wildcards can
> be used in mimetype patterns.

Ideally I'd like to see the media type registry be smart enough to
resolve such type relationships and the CompositeParser class improved
to take advantage of that when choosing the best parser for an
incoming document.

BR,

Jukka Zitting