metadata key declarations

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

metadata key declarations

chriscorbell
Hi,

I'm a newb here but I've browsed the source, Jira and list archives so I
think I'm getting a feel for what tika is.  (Also I'll be at ApacheCon next
week & look forward to the tika session).

I have some general questions about metadata keys in tika.  I see some
common metadata key declarations e.g. DublinCore.java and
CreativeCommons.java.  I also see some basic introspection in the Metadata
class to get the list of keys it contains and whether an entry is
mutli-valued.  However for the most part the actual list of defined keys
seems to be compiled knowledge in the conrete parser's source code, which
the client code is presumably expected to be closely aligned with.  I don't
see any pattern of public declaration of key sets apart from the couple of
string-constant files mentioned, nor runtime introspection (or
configuration) of the keys or sets of keys a particular parser extracts.
Also there doesn't appear to be any simply textual approach for
key-name-collision avoidance e.g. with package-style names or other
namespace convention.

Has their been discussion of anything like well-defined metadata "schemae"
where collections of keys supported for particular mime-types are declared
and possibly extended in a predictable way?

I have a few use cases in mind; one involves an configuration/deployment for
a particular vertical (e.g., a records management repository for a specific
industry, or a DAM repository for a specialized broadcast media producer,
etc.) Here the end-user wants to aggregate a lot of "deep" extraction
capability for some formats, some more or less standardized and some very
domain- or organization-specific, and possibly configure which metadata is
extracted from each file type based on workflow details that may change with
business context.  Having some high(er)-level way of determining what keys
different parsers generate from each mime type and what "standard" keys are
implemented may be useful.  Another use case is just for the development
community adding new parsers including possibly subclasses or aggregations
of existing parsers; how do we keep track of all implemented keys for each
mime type, avoid collisions or unintentional overwrites, etc?  Thanks for
discussion, & pointers to other relevant threads are welcome too.

- Chris