[jira] Created: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data

Jason Grey (Jira)
HTMLParser ommits necessary space-characters when parsing table-data
---------------------------------------------------------------------

                 Key: TIKA-268
                 URL: https://issues.apache.org/jira/browse/TIKA-268
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
         Environment: Win, Mac, Lin; Java 5+
            Reporter: Joachim Zittmayr
            Priority: Critical


When an HTML file with a table structure is given to the TIKA-ecosystem, then HTML parser doesn't output space characters between table cells.

Example:

Input
------------------------------
<table>
  <tr>
    <td>Apache LUCENE<td><td>is f****** amazing!</td>
 </tr>
 <tr>
    <td>Apache TIKA</td><td>freaks you out!</td>
 </tr>
<table>
------------------------------

Output
------------------------------

Apache LUCENEis f****** amazing!

Apache TIKAfreaks you out!

------------------------------

unfortuantely i didnt have the time to do some investigation within HTMLParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data

Jason Grey (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joachim Zittmayr updated TIKA-268:
----------------------------------

    Affects Version/s:     (was: 0.5)
                       0.3

> HTMLParser ommits necessary space-characters when parsing table-data
> ---------------------------------------------------------------------
>
>                 Key: TIKA-268
>                 URL: https://issues.apache.org/jira/browse/TIKA-268
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3, 0.4
>         Environment: Win, Mac, Lin; Java 5+
>            Reporter: Joachim Zittmayr
>            Priority: Critical
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When an HTML file with a table structure is given to the TIKA-ecosystem, then HTML parser doesn't output space characters between table cells.
> Example:
> Input
> ------------------------------
> <table>
>   <tr>
>     <td>Apache LUCENE<td><td>is f****** amazing!</td>
>  </tr>
>  <tr>
>     <td>Apache TIKA</td><td>freaks you out!</td>
>  </tr>
> <table>
> ------------------------------
> Output
> ------------------------------
> Apache LUCENEis f****** amazing!
> Apache TIKAfreaks you out!
> ------------------------------
> unfortuantely i didnt have the time to do some investigation within HTMLParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data

Jason Grey (Jira)
In reply to this post by Jason Grey (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744436#action_12744436 ]

Uwe Schindler commented on TIKA-268:
------------------------------------

The problem is, that the HTML parser strips all tags, that are not in SAFE_ELEMENTS. <TABLE> tags are replaced by <P> and all inner tags simply ignored and not passed through. As all other ContentHandlers (like OOXML, OpenXML,..) produce XHTML table tags, the HTML parser should preserve the table. This can be achieved by modifying the SAFE_ELEMENTS map.

If you then convert the output to text-only, the output will contain tabs and NLs, as XHTMLContentHandler adds ignorableWhiteSpace between table tags and newlines after HTML block tags.

> HTMLParser ommits necessary space-characters when parsing table-data
> ---------------------------------------------------------------------
>
>                 Key: TIKA-268
>                 URL: https://issues.apache.org/jira/browse/TIKA-268
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3, 0.4
>         Environment: Win, Mac, Lin; Java 5+
>            Reporter: Joachim Zittmayr
>            Priority: Critical
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When an HTML file with a table structure is given to the TIKA-ecosystem, then HTML parser doesn't output space characters between table cells.
> Example:
> Input
> ------------------------------
> <table>
>   <tr>
>     <td>Apache LUCENE<td><td>is f****** amazing!</td>
>  </tr>
>  <tr>
>     <td>Apache TIKA</td><td>freaks you out!</td>
>  </tr>
> <table>
> ------------------------------
> Output
> ------------------------------
> Apache LUCENEis f****** amazing!
> Apache TIKAfreaks you out!
> ------------------------------
> unfortuantely i didnt have the time to do some investigation within HTMLParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data

Jason Grey (Jira)
In reply to this post by Jason Grey (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-268.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Fixed in revision 806887 based on Uwe's suggestion.

> HTMLParser ommits necessary space-characters when parsing table-data
> ---------------------------------------------------------------------
>
>                 Key: TIKA-268
>                 URL: https://issues.apache.org/jira/browse/TIKA-268
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3, 0.4
>         Environment: Win, Mac, Lin; Java 5+
>            Reporter: Joachim Zittmayr
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.5
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When an HTML file with a table structure is given to the TIKA-ecosystem, then HTML parser doesn't output space characters between table cells.
> Example:
> Input
> ------------------------------
> <table>
>   <tr>
>     <td>Apache LUCENE<td><td>is f****** amazing!</td>
>  </tr>
>  <tr>
>     <td>Apache TIKA</td><td>freaks you out!</td>
>  </tr>
> <table>
> ------------------------------
> Output
> ------------------------------
> Apache LUCENEis f****** amazing!
> Apache TIKAfreaks you out!
> ------------------------------
> unfortuantely i didnt have the time to do some investigation within HTMLParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.