[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494098#comment-16494098 ]

Hudson commented on TIKA-2100:

FAILURE: Integrated in Jenkins build tika-2.x-windows #260 (See [https://builds.apache.org/job/tika-2.x-windows/260/])
TIKA-2100 extract content language from html lang attribute (gbouchar: rev 7536ed91afd9a3fe744464e34f95f3108c6bd5a2)
* (edit) tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java

> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>                 Key: TIKA-2100
>                 URL: https://issues.apache.org/jira/browse/TIKA-2100
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Gerard Bouchar
>            Priority: Major
>             Fix For: 1.19, 2.0.0
> Parsing a very simple html like
>  <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html>
> you won't be able to access the html tag's attributes (here lang="en") in the ContentHandler :
> *in the method startElement(String ns, String localName, String name,
>       Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the HtmlMapper.mapSafeAttribute method too.

This message was sent by Atlassian JIRA