Metadata Discussion Status

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Metadata Discussion Status

Paul Jakubik
Hi,

A while ago I added the http://wiki.apache.org/tika/MetadataDiscussion page
to the Tika wiki.

Since then, with the help of Jukka Zitting, a solution has been described
for using the current Tika library to capture nested document metadata and
associate that with the text extracted for each nested document.

What hasn't been accomplished is identifying a way to get to both the
metadata and text for nested documents without the user writing a
ContentHandler.

Here are some possibilities for moving forward:

   - Decide that anyone who wants to identify the text and metadata
   associated with each nested document must write their own ContentHandler and
   ParserDecorator that gathers and associates text with the corresponding
   metadata.
   - Point out easier ways to accomplish the same thing with the existing
   Tika libraries.
   - Provide a new Parser and ContentHandler combination that gathers
   subdocument text and metadata together and provides a stream of events
   (maybe something other than XHTML) with easier recursive document and
   metadata handling.
   - Come up with a way to add nested metadata to the XHTML stream without
   violating XHTML

Are there any thoughts on how to move forward? Is it okay if users who want
to extract nested documents with metadata resort to writing their own
content handlers and parser decorators? Or would the Tika team prefer to
offer an easier way for users to extract nested documents with metadata?

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Metadata Discussion Status

Jukka Zitting
Hi,

On Mon, Aug 2, 2010 at 10:36 PM, Paul Jakubik <[hidden email]> wrote:
> A while ago I added the http://wiki.apache.org/tika/MetadataDiscussion page
> to the Tika wiki.
>
> Since then, with the help of Jukka Zitting, a solution has been described
> for using the current Tika library to capture nested document metadata and
> associate that with the text extracted for each nested document.

Thanks for documenting this all on the wiki!

> What hasn't been accomplished is identifying a way to get to both the
> metadata and text for nested documents without the user writing a
> ContentHandler.
> [...]
> Are there any thoughts on how to move forward? Is it okay if users who want
> to extract nested documents with metadata resort to writing their own
> content handlers and parser decorators? Or would the Tika team prefer to
> offer an easier way for users to extract nested documents with metadata?

It would be great if you or someone else could come up with some nice
and clean utility classes for this.

PS. You wondered about how to get the text content of a component
document. That's pretty simple, just extend my earlier example to:

   public void parse(
           InputStream stream, ContentHandler handler,
           Metadata metadata, ParseContext context)
           throws IOException, SAXException, TikaException {
       ContentHandler content = new BodyContentHandler();
       super.parse(stream, content, metadata, context);

       System.out.println("----");
       System.out.println(metadata);
       System.out.println("----");
       System.out.println(content.toString());
   }

PPS. I'm currently writing a chapter about this technique and other
ways to use Tika parsers in our Tika In Action book [1]. This chapter
five should become available on the Manning early access program
within a month or two. We'd love to see comments on the existing
chapters and topics to be covered in future chapters. The book forum
is at [2].

[1] http://www.manning.com/mattmann/
[2] http://www.manning-sandbox.com/forum.jspa?forumID=678

BR,

Jukka Zitting