[jira] [Commented] (TIKA-2750) Update regression corpus

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2750) Update regression corpus

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676812#comment-16676812 ]

floyd commented on TIKA-2750:

I ran another test on the regression VM in the last 4 days. I tried to see how long it would take to narrow down /data1/docs/commoncrawl3/ with 6 worker threads (using nearly 100% CPU on the regression VM):

$ /data1/fuzzing/tools/afl-kit/afl-cmin.py --no-dedup -i '/data1/docs/commoncrawl3/*/' -o /data1/fuzzing/tika-corpus-data1-docs-cmined/ -m none -t 30000 -w 6 /data1/fuzzing/tools/jqf-zip/bin/jqf-afl-target edu.berkeley.cs.jqf.examples.tika.TikaParserTest fuzz @@
Hint: install python module "tqdm" to show progress bar
2018-11-02 17:06:34,070 - INFO - Found 819070 input files in 1024 directories
2018-11-02 17:06:34,071 - INFO - Skipping file deduplication.
2018-11-02 17:06:34,071 - INFO - Sorting files.
2018-11-02 17:06:44,474 - INFO - Testing the target binary
2018-11-02 17:06:52,606 - INFO - ok, 2729 tuples recorded
2018-11-02 17:06:52,689 - INFO - Obtaining trace results{code}

However, it seems that afl-cmin.py is not able to create traces faster than 300 files per hour. After around 4 days I was still nowhere:

$ ls -1 ./tika-corpus-data1-docs-cmined/.traces/ |wc -l

As that commoncrawl folder had 819'0070 files, that would take over 4 months and then only the sorting and finding the best candidates process would start... and then if the data is too big for an operation (e.g. not enough RAM or disc), it probably all fails and the run was useless.

So maybe it would be better to do some manual cleanup first (e.g. remove ASCII-only files) and then do several runs on smaller parts of the entire corpus.

> Update regression corpus
> ------------------------
>                 Key: TIKA-2750
>                 URL: https://issues.apache.org/jira/browse/TIKA-2750
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: CC-MAIN-2018-39-charset_lang_by_tld.zip, CC-MAIN-2018-39-mimes-charsets-by-tld.zip, CC-MAIN-2018-39-mimes-v-detected.zip
> I think we've had great success with the current data on our regression corpus.  I'd like to re-fresh some data from common crawl with three primary goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- CommonCrawl truncates docs at 1 MB.  I think some truncated documents have been quite useful, similar to fuzzing, for identifying serious problems with some of our parsers.  However, it would be useful to have more complete files, esp. for PDFs.  In short, we should keep some truncated documents, but I'd also like to get more complete docs.

This message was sent by Atlassian JIRA