[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471940#comment-16471940 ]

Nick Burch commented on TIKA-2479:
----------------------------------

As long as we add in missing left / mid cells, your table should look fine-ish, and certainly semantically meaningful, as you'd just have some rows not extend as far to the right as others. We don't want to have to do two passes to find the widest row to pad for, plus with just one "accidentally broken" row where someone put something in column ZZ your Tika output would suddenly explode! Specifying different minimum row cell counts on a per-sheet basis (the only way to avoid the double-pass) seems fiddly / brittle to me.

You'd then end up with something like:
||A||B||C||
|12|22|42|
| |24|42|
| | |42|
|1|2|3|
|42|
|1|2|3|

I see the point about missing rows though, I think we'd need an option to turn that on for when someone cares about them.

What if we hard-coded the XLS and XLSX parsers to always output missing left/mid cells, and put an option on {{OfficeParserConfig}} to let you request missing rows?

(Merged cell handling would probably be best split out as a new issue!)

> Handle empty cells in tables uniformly
> --------------------------------------
>
>                 Key: TIKA-2479
>                 URL: https://issues.apache.org/jira/browse/TIKA-2479
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: patch.diff
>
>
> It looks like we output a <td/> for empty cells in xls, and tables in doc, docx and pptx.  However, we don't retain empty cells in xlsx or tables in ppt.  We should make this handling uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)