[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470703#comment-16470703 ]

Nick Burch commented on TIKA-2479:
----------------------------------

Having hit a similar thing with TIKA-2641, I'm tempted to make the XLS and XLSX parser output missing left/mid cells up to a limit, but ignore missing rows, and ignore missing right-cells. That would prevent very sparse spreadsheets from suddenly generating loads more text output than they currently do, whilst giving us the correct table layout for files with just the odd missing cell.

I don't want to suddenly make the output from sparse files huge, and I'd rather not add too many config options for people to need to play around with, but equally we want to try to avoid surprises for users.

Anyone have any thoughts / suggestions / objections to that plan, before I apply a slightly modified form of the attached pull request + matching changes for XLS?

> Handle empty cells in tables uniformly
> --------------------------------------
>
>                 Key: TIKA-2479
>                 URL: https://issues.apache.org/jira/browse/TIKA-2479
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: patch.diff
>
>
> It looks like we output a <td/> for empty cells in xls, and tables in doc, docx and pptx.  However, we don't retain empty cells in xlsx or tables in ppt.  We should make this handling uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)