Took top 20k tokens by document frequency from wikipedia dumps per language.
I ignored wikipedia pages that conflicted with Optimaize's language id (e.g. if I was processing the ptwiki, and Optimaize identified it as "es", I ignored the page).
I used some heuristics to try to ignore pages that were link/reference articles or other non-content articles.
I attempted to randomly sample 500k articles. For English, I only pulled the first 10 bzips. For the other languages, I pulled all.
I removed common html markup tokens (e.g. body, html, script). If we allowed those, then if html extraction fails and we get a bunch of markup, we would see an incorrectly inflated "common tokens" count.
I removed terms that were < 4 characters long except for CJK.
I added ___url___ and ___email___ so that those would exist for every language model.
If we change the underlying Lucene analysis chain, we'll have to reprocess the wikidumps.
The files are sorted in descending document frequency. It is clear that the wiki markup stripper wasn't perfect (words for links/references show up frequently), but this seems like a reasonable start.
full list of words removed:
> Add common tokens files for tika-eval
> Key: TIKA-2267
> URL: https://issues.apache.org/jira/browse/TIKA-2267 > Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 2.0, 1.15
> We should add some common tokens files for popular languages for tika-eval so that users don't have to generate their own.
This message was sent by Atlassian JIRA