[jira] [Commented] (TIKA-2641) Unit test for consistency between tabular/columnar formats

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2641) Unit test for consistency between tabular/columnar formats

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463076#comment-16463076 ]

Hudson commented on TIKA-2641:

FAILURE: Integrated in Jenkins build tika-2.x-windows #246 (See [https://builds.apache.org/job/tika-2.x-windows/246/])
Stub a unit test for TIKA-2641 (nick: rev d4719f63ffb381dbbfc53e667379389cb26593c1)
* (add) tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java

> Unit test for consistency between tabular/columnar formats
> ----------------------------------------------------------
>                 Key: TIKA-2641
>                 URL: https://issues.apache.org/jira/browse/TIKA-2641
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, 1.18
>            Reporter: Nick Burch
>            Priority: Minor
> We now have a number of parsers which deal with file formats which are either wholey or optionally "table-based" formats with consistency in the data types held in a given column. This includes multi-table formats like sqlite, single-table formats like sas7bdat, and anything-goes-table formats like csv or xlsx
> We should firstly try to create a simple-ish, small but rich file for each of these formats, similar to what we do for archive formats with the {{test-documents}} archives. Then, we should add unit tests that verified that, as much as formats permit, you get basically the same XHTML out for the "same" input. Oh, and fix up any obvious inconsistencies...

This message was sent by Atlassian JIRA