[jira] [Commented] (TIKA-1332) Create tika-eval module

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-1332) Create tika-eval module

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870492#comment-15870492 ]

Hudson commented on TIKA-1332:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1199 (See [https://builds.apache.org/job/Tika-trunk/1199/])
TIKA-1332 -- fix one report for eval profiler and clean up whitespace (tallison: rev 506b572560f6c7f44270b55877f110719a7d4b1f)
* (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
* (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
* (edit) tika-eval/src/main/resources/comparison-reports.xml
* (edit) tika-eval/src/main/resources/lucene-analyzers.json
* (edit) tika-eval/src/main/resources/profile-reports.xml
* (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
TIKA-1332 -- downgrade Lucene to 5.x to allow for Java 7 (tallison: rev d194ba4022dffa61cad2a12ea0092f6ec00588d2)
* (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java
* (edit) tika-eval/pom.xml
* (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java


> Create tika-eval module
> -----------------------
>
>                 Key: TIKA-1332
>                 URL: https://issues.apache.org/jira/browse/TIKA-1332
>             Project: Tika
>          Issue Type: Sub-task
>          Components: cli, general, server
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 2.0, 1.15
>
>         Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of exceptions per file type, most common exceptions per file type, number of metadata items, total text extracted, etc).  We should also be able to compare one run against another.  Going forward, there's plenty of room to improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)