[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471358#comment-16471358 ]

Geoff Baskwill commented on TIKA-2479:

Hi [~gagravarr] ... what I found with rendering HTML tables when they didn't have the right number of cells was they looked really bad (and are semantically incorrect) without the missing rows and right-cells. I suppose a post-processing step could go through and fill in the missing columns for people who knew about this behaviour, but we can't fix the missing rows in post-processing as the knowledge that the rows are missing is lost.

I would agree that it would be preferable not to add config options, but perhaps there's no other way to balance between "I'd like to get an HTML table that accurately represents the sheet content so I can properly extract meaning from it" and "I'd like to have something close to the old behaviour and amount of output when my sheet has sparse data"?

The motivation for trying to get an accurate representation comes from an accessibility project I was working on with sheets that had merged cells (another problem that I didn't manage to fully solve in the time I had available) – without the correct number of cells the merging gets really wrong really quickly.


> Handle empty cells in tables uniformly
> --------------------------------------
>                 Key: TIKA-2479
>                 URL: https://issues.apache.org/jira/browse/TIKA-2479
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: patch.diff
> It looks like we output a <td/> for empty cells in xls, and tables in doc, docx and pptx.  However, we don't retain empty cells in xlsx or tables in ppt.  We should make this handling uniform.

This message was sent by Atlassian JIRA