Here is my draft of the report. Let me know if you guys concur, and I'll
add it to the wiki:
Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser
Libraries. Tika entered incubation on March 22nd, 2007.
There have been a number of positive items within Tika during the last few
months. The traffic on the Tika mailing list has increased significantly
(with typically 2, 3 questions, and 1 or 2 commits every day, or every other
day), and there have been a lot of recent inquiries from external projects
wanting to collaborate with Tika (including Aperture, PDFBox and a fellow
developing a JSon library currently hosted at Google code). In addition,
Tika's architecture has become a recent discussion of interest (as we'll see
We recently elected Keith Bennett as a new committer to Tika. Keith has been
spearheading many of the new patches committed to Tika, as well as
participating in discussions about the architecture, and future direction of
Tika will be represented at the "Fast Feather" track at Apache Con US by
Jukka Zitting. The rest of the community is helping to create the content
for the presentation. The abstract is listed below:
Tika is a new content analysis framework borne from the desire to factor our
commonality from the Apache Nutch search engine framework. Tika provides a
mime detection framework, an extensible parsing framework and metadata
environment for content analysis. Though in its nascent stages, progress on
Tika has recently taken shape and the project is nearing a stable 0.1
In this talk, we'll describe the core APIs of Tika and discuss its use in
several distinct domains including search engines, scientific data
dissemination and an industrial setting.
There have been a flurry of JIRA issues and code activity  including 47
issues currently in JIRA, with 32 resolved issues, 14 closed issues, and 2
open major/minor issues in progress).
Tika's Parser interface (one of its key components) has just undergone a
major overhaul led by Jukka Zitting, and Chris Mattmann has recently
contributed a MimeType system (with help from fellow Apache Nutch committer
Jerome Charron) to Tika. We also cleaned up and refactored large parts of
the rest of the code (removing references to LuisLite and branding the
project wherever possible with the Tika name), in preparation for an
upcoming 0.1 release.
Chris Mattmann has led an effort to carve out the existing MimeType
detection system in Apache Nutch  and replace it with Tika's improved
MimeType detection system. There is a patch sitting in JIRA right now ,
and barring objections, Nutch will rely on Tika for its MimeType detection
Also active recently were committers Bertrand Delacretaz, Sami Siren and
Rida Benjelloun, committing patches and improvements wherever needed.
Issues before graduation
No changes since our last report: the Tika project is still at an
early stage of incubation. We need to continue bringing in the initial
codebases and are targeting an initial incubating release (0.1) probably
within the next month. We also need to work on growing the community and
figuring out how to best interact with external parser projects.
On 10/8/07, Bertrand Delacretaz <[hidden email]> wrote:
> On 10/8/07, Chris Mattmann <[hidden email]> wrote:
> > ....Let me know what you guys think....
> +1 to the report, thanks Chris!