[jira] [Commented] (TIKA-2641) Unit test for consistency between tabular/columnar formats

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2641) Unit test for consistency between tabular/columnar formats

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470671#comment-16470671 ]

Nick Burch commented on TIKA-2641:

I've generated test files for CSV, SAS7BDAT, XLS and XLSX using a small SAS program, and an ODS file manually. No DB format files generated yet

Currently, the XLS and XLSX file tests are part-disabled, because blank cells are being skipped in the sax output. Not sure if we want to enable missing-cells support in POI and output empty td's for these or not?

The ODS test is disabled because we're getting a "org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared" error when trying to generate the XML version of the file with the TikaTest helper. Not sure if this is highlighting a parser bug on simple files, or a unit test helper mistake?

CSV is part-enabled, because we don't have a dedicated CSV parser we just get plain text output.

SAS7BDAT testing is enabled.

> Unit test for consistency between tabular/columnar formats
> ----------------------------------------------------------
>                 Key: TIKA-2641
>                 URL: https://issues.apache.org/jira/browse/TIKA-2641
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, 1.18
>            Reporter: Nick Burch
>            Priority: Minor
> We now have a number of parsers which deal with file formats which are either wholey or optionally "table-based" formats with consistency in the data types held in a given column. This includes multi-table formats like sqlite, single-table formats like sas7bdat, and anything-goes-table formats like csv or xlsx
> We should firstly try to create a simple-ish, small but rich file for each of these formats, similar to what we do for archive formats with the {{test-documents}} archives. Then, we should add unit tests that verified that, as much as formats permit, you get basically the same XHTML out for the "same" input. Oh, and fix up any obvious inconsistencies...

This message was sent by Atlassian JIRA