Packages and attributes

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Packages and attributes

Paul Jakubik
Hi,

I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd like
to get access to the metadata for the individual files inside of the
package.

It looks like there has been some discussion about how to provide the
metadata, and from looking at the code I don't think any of the proposed
solutions have been implemented yet:
http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/200906.mbox/%3C3949E4F8-0ACF-4BA4-8FFC-57AF8A783C69@...%3E
http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/200907.mbox/%3C510143ac0907300409u699a3953t9b2dfbd6bb63367a@...%3E

It looks like the last suggestion was to add attributes to the <div> element
for each file for each metadata entry. Unfortunately I don't think the code
does this today.

I'm left with the following questions:
- Has a consensus been reached for how to provide access to the metadata?
- If consensus has been reached, will this be implemented soon, or can I
help by implementing the preferred solution?
- If consensus has not been reached, how can a consensus be reached so
someone can implement this functionality?

Please let me know how I can help move this forward.

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Nick Burch-4
On Mon, 12 Jul 2010, Paul Jakubik wrote:
> I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd
> like to get access to the metadata for the individual files inside of
> the package.

I believe there are two different tika enhancements for container formats
needed.

The first is for detection of files which are held in a container format,
eg .doc (several named streams in an OLE2 file) or .xlsx (several named
xml files in a zip file). TIKA-391 and TIKA-447 cover these. This is an
area with some ideas for a possible solution, but more work and review
needed.

Secondly, there's the issue of embeded documents, which could be a .zip
file with half a dozen text files in it, but could equally be a .doc file
with two embeded excel spreadsheets in it.

For this latter one, there is a little bit of support in Tika already, but
it's not complete, and certainly needs more work. OutlookExtractor is one
place I know of which uses it

Easy access to embeded document metadata is, I believe, still an
outstanding issue. The solution needs to handle embeded documents,
container formats filled with multiple files (eg get me the metadata of
the 2nd embeded excel file vs get me the metadata on the file of
/foo/bar.jpg in te zip), as well as ideally coping with a single file with
different metadata for different bits of it (I think pdf can do this?)

Assuming I've got all of the above correct, it might be worth creating a
wiki page for this (probably + referencing jira entry), and start trying
to work up a proposed solution that'll handle all the above problems and
use cases.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
On Mon, Jul 12, 2010 at 10:37 AM, Nick Burch <[hidden email]>wrote:

> On Mon, 12 Jul 2010, Paul Jakubik wrote:
>
>> I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd like
>> to get access to the metadata for the individual files inside of the
>> package.
>>
>
> I believe there are two different tika enhancements for container formats
> needed.
>

I've tried to summarize the various use cases mentioned in your email.
Please let me know if I have correctly captured everything.

- *Containers that are conceptually a single document.* eg .doc (several
named streams in an OLE2 file), or .xlsx (several named xml files in a zip
file)

- *Containers that are conceptually containers of many separate documents.* eg
a zip file with several text files in it, or a tar file with zip files, doc
files, and text files in it.

- *Containers that are both a single document and separate documents.* eg an
email with multiple parts and/or attachements, or a .doc with embedded
spreadsheets.

- *Single documents with metadata associated with regions of the document. *eg
PDF?

From the point of view of reporting metadata for documents, it might be
useful to group these use cases the following way:

- Single documents with multiple sets of metadata
    - Containers that are conceptually single documents
    - PDF?

- Containers that contain many distinct documents and/or containers
    - Containers that are conceptually containers
    - Containers that are conceptually documents and containers

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Nick Burch-4
On Mon, 12 Jul 2010, Paul Jakubik wrote:
> I've tried to summarize the various use cases mentioned in your email.
> Please let me know if I have correctly captured everything.

You seem to have got all the cases I can think of, but it's quite possible
that someone else will think up another one :)

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Alex Ott
In reply to this post by Paul Jakubik
Re

Paul Jakubik  at "Mon, 12 Jul 2010 11:26:16 -0500" wrote:
 PJ> On Mon, Jul 12, 2010 at 10:37 AM, Nick Burch <[hidden email]>wrote:
 PJ> I've tried to summarize the various use cases mentioned in your email.
 PJ> Please let me know if I have correctly captured everything.

 PJ> - *Containers that are conceptually a single document.* eg .doc (several
 PJ> named streams in an OLE2 file), or .xlsx (several named xml files in a zip
 PJ> file)

 PJ> - *Containers that are conceptually containers of many separate documents.* eg
 PJ> a zip file with several text files in it, or a tar file with zip files, doc
 PJ> files, and text files in it.

 PJ> - *Containers that are both a single document and separate documents.* eg an
 PJ> email with multiple parts and/or attachements, or a .doc with embedded
 PJ> spreadsheets.

 PJ> - *Single documents with metadata associated with regions of the document. *eg
 PJ> PDF?

What about multimedia containers (ASF, OGG, etc.), that could contain data
in different formats? They conceptually look like single file, but with
different metadata

 PJ> From the point of view of reporting metadata for documents, it might be
 PJ> useful to group these use cases the following way:

 PJ> - Single documents with multiple sets of metadata
 PJ>     - Containers that are conceptually single documents
 PJ>     - PDF?

 PJ> - Containers that contain many distinct documents and/or containers
 PJ>     - Containers that are conceptually containers
 PJ>     - Containers that are conceptually documents and containers

May be it worth to separate metadata of top-level objects from metadata of
embedded objects? And allow to traverse through hierarchy of embedded
objects?  And provide several implementations, something like: collector of
metadata for all embedded objects, or collector only of top-level metadata,
etc.  

This could allow to improve performance in some cases (imho), because in
some task people could need only top-level metadata, etc.

--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
On Mon, Jul 12, 2010 at 12:59 PM, Alex Ott <[hidden email]> wrote:

>
> May be it worth to separate metadata of top-level objects from metadata of
> embedded objects? And allow to traverse through hierarchy of embedded
> objects?  And provide several implementations, something like: collector of
> metadata for all embedded objects, or collector only of top-level metadata,
> etc.
>
>
As long as the final solution can also handle containers with thousands or
millions of documents, each document having its own set of metadata.

In other words, I need a streaming way of accessing the metadata. To
me, I think that means that either there is some place in the SAX api
where I can get the metadata for the current subdocument (attributes in
the DIV element?) or there is a point in time in using the SAX api where
I can query a metadata object and know it only contains metadata for
the current subdocument (get the current metadata object from the parse
context at the time startElement is called for a new DIV?).

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
In reply to this post by Nick Burch-4
On Mon, Jul 12, 2010 at 10:37 AM, Nick Burch <[hidden email]>wrote:

> Assuming I've got all of the above correct, it might be worth creating a
> wiki page for this (probably + referencing jira entry), and start trying to
> work up a proposed solution that'll handle all the above problems and use
> cases.
>

I created a wiki page for this discussion (
http://wiki.apache.org/tika/MetadataDiscussion). I don't know if that is
what you were thinking of. Maybe appropriate developers can edit the page I
created into something that is useful to them.

I'm hoping that the developers can quickly reach a consensus on how to
change the metadata handling so users can get to metadata for nested
documents. For my use of Tika, I'll need to add this soon. I'd prefer to add
this in a way that could be submitted as a patch to Tika, or at least in a
way that is close enough to how Tika will eventually handle this problem
that it won't be hard to adapt to Tika's real solution once available.

Please let me know if there is a way I can help move this process along.

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Nick Burch-4
On Wed, 14 Jul 2010, Paul Jakubik wrote:
> I created a wiki page for this discussion (
> http://wiki.apache.org/tika/MetadataDiscussion). I don't know if that is
> what you were thinking of.

Looks good to me!

Having looked through your proposed solutions, I can't see easy ways to
implement these use cases:
* enumerate all the Metadata objects at this depth
   eg top level has one Metadata object (for the parent file), 1 level
    down may have 3 Metadata objects, one for each of the 3 child documents
* get the Metadata for a specific embeded document
   eg I know my zip file has "/foo/bar.doc" in it, give me the metadata
   for that

There should probably be some mention of how users can opt in or out of
the nested metadata extraction. Some people won't want anything from
embeded documents, so they'll set the context appropriately, and the
parser won't touch the embeded files. Some may want text content, but not
care about the metadata (I think someone on the list raised this use
case). Some may want both text and metadata.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Jukka Zitting
In reply to this post by Paul Jakubik
Hi,

On Thu, Jul 15, 2010 at 1:14 AM, Paul Jakubik <[hidden email]> wrote:
> I'm hoping that the developers can quickly reach a consensus on how to
> change the metadata handling so users can get to metadata for nested
> documents.

The way I recommend is to pass a custom Parser implementation through
the ParseContext. This gives you detailed access to each component
document.

You noted that this approach wouldn't work for recursive metadata. Why?

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting <[hidden email]>wrote:

> The way I recommend is to pass a custom Parser implementation through
> the ParseContext. This gives you detailed access to each component
> document.
>
> You noted that this approach wouldn't work for recursive metadata. Why?
>
>
I didn't think of passing in a custom parser as a way to get metadata. Now
that you mention it, for my needs I could clone the AutoDetectParser,
change the code to handle Metadata however I want (e.g. keep a metadata
stack, send notifications, or some other solution I haven't thought of, and
pass this new parser through the ParseContext.

Given this solution, I'm left wondering if capturing the metadata for
nested
documents is an oddball use case that most users don't want, or if this is
a common use case that many users would like to see Tika support for. In
other words, should a new parser type be added to Tika's library of
parsers,
or should this be left as an exercise for the users who want metadata.

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
In reply to this post by Nick Burch-4
On Thu, Jul 15, 2010 at 6:30 AM, Nick Burch <[hidden email]> wrote:

>
> Having looked through your proposed solutions, I can't see easy ways to
> implement these use cases:
> * enumerate all the Metadata objects at this depth
>  eg top level has one Metadata object (for the parent file), 1 level
>   down may have 3 Metadata objects, one for each of the 3 child documents
> * get the Metadata for a specific embeded document
>  eg I know my zip file has "/foo/bar.doc" in it, give me the metadata
>  for that
>
>
I'm new to Tika, and pretty focused on my use case. I don't have
answers for the above use cases, and instead I have a couple of
questions:
* How does Tika support enumerating the text for all documents at
  a certain depth?
* How does Tika support getting the text for a specific embedded
  document?

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
In reply to this post by Jukka Zitting
On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting <[hidden email]>wrote:

> The way I recommend is to pass a custom Parser implementation through
> the ParseContext. This gives you detailed access to each component
> document.
>
>
I looked at the code a little further, and I don't see exactly how I can do
this.

I am using an AutoDetectParser, and in my ParseContext I've placed
another AutoDetectParser. At the top level I might be parsing a tar.gz,
and inside this tar.gz there are text, PDF, and zip files.

As far as I can tell, when I start to parse files embedded in one of the
containers (tar.gz or zip), it is actually PackageExtractor that gets the
parser from the ParseContext, and it is also PackageExtractor that
creates a new Metadata object that it doesn't share, thus keeping me
from being able to look at the metadata.

Does this mean that, to get access to the metadata for subdocuments
I would need to do the following:
* Create a replacements for PackageParser and PackageExtractor
  that do what I want with the metadata
* use get parsers and set parsers on the AutoDetectParser, and
  replace the parser for each of the following MediaTypes

                MediaType.application("x-archive"),
                MediaType.application("x-bzip"),
                MediaType.application("x-bzip2"),
                MediaType.application("x-cpio"),
                MediaType.application("x-gtar"),
                MediaType.application("x-gzip"),
                MediaType.application("x-tar"),
                MediaType.application("zip"))));

I wonder if it would be easier to update PackageExtractor to check if
there is a metadata stack in the ParseContext, and if so, push the
new metadata object just before parsing a subdocument, and pop the
the metadata object just after the parse (maybe just after writing the
end of the <div> section.

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Jukka Zitting
Hi,

On Fri, Jul 16, 2010 at 2:43 AM, Paul Jakubik <[hidden email]> wrote:
> On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting <[hidden email]>wrote:
>> The way I recommend is to pass a custom Parser implementation through
>> the ParseContext. This gives you detailed access to each component
>> document.
>
> I looked at the code a little further, and I don't see exactly how I can do
> this.

Looks like you're approaching this from the wrong perspective. See the
example below (or at http://pastebin.com/ZNfCQ9bk) for a recursive
depth-first traversal that prints out the metadata of all the
component documents.

    public static void main(String[] args) throws Exception {
        Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
        ParseContext context = new ParseContext();
        context.set(Parser.class, parser);

        ContentHandler handler = new DefaultHandler();
        Metadata metadata = new Metadata();

        InputStream stream = TikaInputStream.get(new File(args[0]));
        try {
            parser.parse(stream, handler, metadata, context);
        } finally {
            stream.close();
        }
    }

    private static class RecursiveMetadataParser extends ParserDecorator {

        public RecursiveMetadataParser(Parser parser) {
            super(parser);
        }

        @Override
        public void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException {
            super.parse(stream, handler, metadata, context);

            System.out.println("----");
            System.out.println(metadata);
        }

    }

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
Thank you for this example! Is there any chance this example could be
added to the Tika wiki?

On Fri, Jul 16, 2010 at 1:30 AM, Jukka Zitting <[hidden email]>wrote:

> Hi,
>
> On Fri, Jul 16, 2010 at 2:43 AM, Paul Jakubik <[hidden email]>
> wrote:
> > On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting <[hidden email]
> >wrote:
> >> The way I recommend is to pass a custom Parser implementation through
> >> the ParseContext. This gives you detailed access to each component
> >> document.
> >
> > I looked at the code a little further, and I don't see exactly how I can
> do
> > this.
>
> Looks like you're approaching this from the wrong perspective. See the
> example below (or at http://pastebin.com/ZNfCQ9bk) for a recursive
> depth-first traversal that prints out the metadata of all the
> component documents.
>
>    public static void main(String[] args) throws Exception {
>        Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
>        ParseContext context = new ParseContext();
>        context.set(Parser.class, parser);
>
>        ContentHandler handler = new DefaultHandler();
>        Metadata metadata = new Metadata();
>
>        InputStream stream = TikaInputStream.get(new File(args[0]));
>        try {
>            parser.parse(stream, handler, metadata, context);
>        } finally {
>            stream.close();
>        }
>    }
>
>    private static class RecursiveMetadataParser extends ParserDecorator {
>
>        public RecursiveMetadataParser(Parser parser) {
>            super(parser);
>        }
>
>        @Override
>        public void parse(
>                InputStream stream, ContentHandler handler,
>                Metadata metadata, ParseContext context)
>                throws IOException, SAXException, TikaException {
>            super.parse(stream, handler, metadata, context);
>
>            System.out.println("----");
>            System.out.println(metadata);
>        }
>
>    }
>
> BR,
>
> Jukka Zitting
>
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Mattmann, Chris A (3010)
Hi Paul,

Sure. Feel free to sign up for an account (it's free and pretty simple) and then you can just copy/paste and start a wiki page on your own. We welcome your contribution!

Cheers,
Chris


On 7/16/10 8:29 AM, "Paul Jakubik" <[hidden email]> wrote:

Thank you for this example! Is there any chance this example could be
added to the Tika wiki?

On Fri, Jul 16, 2010 at 1:30 AM, Jukka Zitting <[hidden email]>wrote:

> Hi,
>
> On Fri, Jul 16, 2010 at 2:43 AM, Paul Jakubik <[hidden email]>
> wrote:
> > On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting <[hidden email]
> >wrote:
> >> The way I recommend is to pass a custom Parser implementation through
> >> the ParseContext. This gives you detailed access to each component
> >> document.
> >
> > I looked at the code a little further, and I don't see exactly how I can
> do
> > this.
>
> Looks like you're approaching this from the wrong perspective. See the
> example below (or at http://pastebin.com/ZNfCQ9bk) for a recursive
> depth-first traversal that prints out the metadata of all the
> component documents.
>
>    public static void main(String[] args) throws Exception {
>        Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
>        ParseContext context = new ParseContext();
>        context.set(Parser.class, parser);
>
>        ContentHandler handler = new DefaultHandler();
>        Metadata metadata = new Metadata();
>
>        InputStream stream = TikaInputStream.get(new File(args[0]));
>        try {
>            parser.parse(stream, handler, metadata, context);
>        } finally {
>            stream.close();
>        }
>    }
>
>    private static class RecursiveMetadataParser extends ParserDecorator {
>
>        public RecursiveMetadataParser(Parser parser) {
>            super(parser);
>        }
>
>        @Override
>        public void parse(
>                InputStream stream, ContentHandler handler,
>                Metadata metadata, ParseContext context)
>                throws IOException, SAXException, TikaException {
>            super.parse(stream, handler, metadata, context);
>
>            System.out.println("----");
>            System.out.println(metadata);
>        }
>
>    }
>
> BR,
>
> Jukka Zitting
>



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [hidden email]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Paul Jakubik
I have added Juka Zitting's recursive metadata example to the Tika wiki at
http://wiki.apache.org/tika/RecursiveMetadata. I also added some notes on
what I did so I could get the metadata for a nested document along with the
text for that document.

Finally, I modified the http://wiki.apache.org/tika/MetadataDiscussion page
to point to the RecursiveMetadata page for a solution for ContentHandler
implementers to get access to recursive metadata.

I hope this helps.

Paul
Reply | Threaded
Open this post in threaded view
|

Re: Packages and attributes

Mattmann, Chris A (3010)
Thanks Paul!


On 8/2/10 1:18 PM, "Paul Jakubik" <[hidden email]> wrote:

I have added Juka Zitting's recursive metadata example to the Tika wiki at
http://wiki.apache.org/tika/RecursiveMetadata. I also added some notes on
what I did so I could get the metadata for a nested document along with the
text for that document.

Finally, I modified the http://wiki.apache.org/tika/MetadataDiscussion page
to point to the RecursiveMetadata page for a solution for ContentHandler
implementers to get access to recursive metadata.

I hope this helps.

Paul



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [hidden email]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++