Metadata use by Apache Java projects

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Metadata use by Apache Java projects

Jeremias Maerki-2
(I realize this is heavy cross-posting but it's probably the best way to
reach all the players I want to address.)

As you may know, I've started developing an XMP metadata package inside
XML Graphics Commons in order to support XMP metadata (and ultimately
PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.

What is XMP? XMP, for those who don't know about it, is based on a
subset of RDF to provide a flexible and extensible way of
storing/representing document metadata.

Yesterday, I was surprised to discover that Adobe has published an XMP
Toolkit with Java support under the BSD license. In contrast to my
effort, Adobe's toolkit is quite complete if maybe a bit more
complicated to use. That got me thinking:

Every project I'm sending this message to is using document metadata in
some form:
- Apache XML Graphics: embeds document metadata in the generated files
(just FOP at the moment, but Batik is a similar candidate)
- Tika (in incubation): has as one of its main purposes the extraction
of metadata
- Sanselan (in incubation): extracts and embeds metadata from/in bitmap
images
- PDFBox (incubation in discussion): extracts and embeds XMP metadata
from/in PDF files (see also JempBox)

Every one of these projects has its own means to represent metadata in
memory. Wouldn't it make sense to have a common approach? I've worked
with XMP for some time now and I can say it's ideal to work with. It
also defines guidelines to embed XMP metadata in various file formats.
It's also relatively easy to map metadata between different file formats
(Dublin Core, EXIF, PDF Info etc.).

Sanselan and Tika have both chosen a very simple approach but is it
versatile enough for the future? While the simple Map<String, String[]> in
Tika allows for multiple authors, for example, it doesn't support
language alternatives for things such as dc:title or dc:description.

I'm seriously thinking about abandoning most of my XMP package work in
XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
support, tough:
- Metadata merging functionality (which I need for synchronizing the PDF
Info object and the XMP packet for PDF/A)
- Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
easier programming (which both Ben and I have written for JempBox and
XML Graphics Commons). Adobe's toolkit only allows generic access.

Some links:
Adobe XMP website: http://www.adobe.com/products/xmp/
Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
JempBox: http://sourceforge.net/projects/jempbox
Apache XML Graphics Commons:
  http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/

My questions:
- Any interest in converging on a unified model/approach?
- If yes, where shall we develop this? As part of Tika (although it's
still in incubation)? As a seperate project (maybe as Apache Commons
subproject)? If more than XML Graphics uses this, XML Graphics is
probably not the right home.
- Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
the JempBox or XML Graphics Commons approach more interesting?
- Where's the best place to discuss this? We can't keep posting to
several mailing lists.

At any rate, I would volunteer to spearhead this effort, especially
since I have immediate need to have complete XMP functionality. I've
almost finished mapping all XMP structures in XG Commons but I haven't
committed my latest changes (for structured properties) and I may still
not cover all details of XMP.

Thanks for reading this far,
Jeremias Maerki

Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

Jukka Zitting
Hi,

[Responding just on tika-dev@. I guess Jeremias follows all these
forums, and can summarize in the end...]

On Nov 19, 2007 11:26 AM, Jeremias Maerki <[hidden email]> wrote:
> Every one of these projects has its own means to represent metadata in
> memory. Wouldn't it make sense to have a common approach?

+1

> Sanselan and Tika have both chosen a very simple approach but is it
> versatile enough for the future? While the simple Map<String, String[]> in
> Tika allows for multiple authors, for example, it doesn't support
> language alternatives for things such as dc:title or dc:description.

IMHO it would be good to have a more flexible metadata model in Tika.
Better yet if it's a standard used across multiple projects. Best if
we don't need to implement it in Tika. :-)

> My questions:
> - Any interest in converging on a unified model/approach?

Certainly.

> - If yes, where shall we develop this? As part of Tika (although it's
> still in incubation)? As a seperate project (maybe as Apache Commons
> subproject)? If more than XML Graphics uses this, XML Graphics is
> probably not the right home.
> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> the JempBox or XML Graphics Commons approach more interesting?

If there already exists acceptably licensed good code outside the ASF,
then I would prefer using that instead of reinventing the wheel within
the foundation.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

chrismattmann
Hi Folks,
 
>> Sanselan and Tika have both chosen a very simple approach but is it
>> versatile enough for the future? While the simple Map<String, String[]> in
>> Tika allows for multiple authors, for example, it doesn't support
>> language alternatives for things such as dc:title or dc:description.
>
> IMHO it would be good to have a more flexible metadata model in Tika.
> Better yet if it's a standard used across multiple projects. Best if
> we don't need to implement it in Tika. :-)

I'm not quite sure I understand how Tika's metadata model isn't flexible
enough? Of course, I'm a bit bias, but I'm really trying to understand here
and haven't been able to. I think it's important to realize that a balance
must be struck between over-bloating a metadata library (and attaching on
RDF support, inference, synonym support, etc.) and making sure that the
smallest subset of it is actually useful.

Also, I'd be against moving Metadata support out of Tika because that was
one of the project's original goals (Metadata support), and I think it's
advantageous for Tika to be a provider for a Metadata capability (of course,
one related to document/content extraction).

I'm wondering too what it means that Tika doesn't support "language
alternatives"? Do you mean synonyms? Also, you mention it's relatively easy
in other libraries to map between different file format metadata. I think
that this is fairly easy to do in Tika too, seeing as though its primary
purpose is support metadata extraction from different file formats.

>
>> My questions:
>> - Any interest in converging on a unified model/approach?
>
> Certainly.

+1

>
>> - If yes, where shall we develop this? As part of Tika (although it's
>> still in incubation)? As a seperate project (maybe as Apache Commons
>> subproject)? If more than XML Graphics uses this, XML Graphics is
>> probably not the right home.
>> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
>> the JempBox or XML Graphics Commons approach more interesting?
>
> If there already exists acceptably licensed good code outside the ASF,
> then I would prefer using that instead of reinventing the wheel within
> the foundation.

I'm not sure we're "re-inventing the wheel" here Jukka. Tika's Metadata
framework began in Nutch, and at the time based on a short survey that
Jerome Charron and I undertook, there was no easy-to-use, Metadata library
framework, that met the needs of the types of things done in Nutch/Tika --
document extraction of metadata from large corpuses, supporting many values
for keys: mapping between keys, etc. So, in my mind, we're definitely not
re-inventing any wheel and the framework was borne more out of need/ease of
use than anything else.

In any case, the use of a common framework is a good one to discuss and I'm
open to it. So long as people like me can better understand the gaps in the
current Tika Metadata framework and the benefits of addressing those gaps to
all the projects that would need it.

Thanks!

Cheers,
  Chris
 

>
> BR,
>
> Jukka Zitting

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

Jeremias Maerki-2
Hi Chris

On 19.11.2007 18:27:56 Chris Mattmann wrote:

> Hi Folks,
>  
> >> Sanselan and Tika have both chosen a very simple approach but is it
> >> versatile enough for the future? While the simple Map<String, String[]> in
> >> Tika allows for multiple authors, for example, it doesn't support
> >> language alternatives for things such as dc:title or dc:description.
> >
> > IMHO it would be good to have a more flexible metadata model in Tika.
> > Better yet if it's a standard used across multiple projects. Best if
> > we don't need to implement it in Tika. :-)
>
> I'm not quite sure I understand how Tika's metadata model isn't flexible
> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> and haven't been able to. I think it's important to realize that a balance
> must be struck between over-bloating a metadata library (and attaching on
> RDF support, inference, synonym support, etc.) and making sure that the
> smallest subset of it is actually useful.

I'm sorry. I didn't intend to stand on anyone's toes.

At any rate, I'm not talking about full RDF support. I'm talking about
XMP, which uses only a subset of RDF.

> Also, I'd be against moving Metadata support out of Tika because that was
> one of the project's original goals (Metadata support), and I think it's
> advantageous for Tika to be a provider for a Metadata capability (of course,
> one related to document/content extraction).

Metadata capability in the context of content extraction, certainly yes.
Nobody disputes that. But other projects have different needs (like
embedding metadata). So in all this there are certain common needs and
I'm trying to see if we can find a common ground in the form of a
uniform way of manipulating and storing metadata in memory while at the
same time working off a freely available standard.

> I'm wondering too what it means that Tika doesn't support "language
> alternatives"? Do you mean synonyms?

Frankly, I don't know if that's synonyms. Maybe they are in RDF
terminology. The XMP spec talks about "property qualifiers" of which
"language alternatives" (using xml:lang) are a special case. The easiest
way to explain is by example:

<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:creator>
        <rdf:Seq>
          <rdf:li>John Doe</rdf:li>
          <rdf:li>Jane Doe</rdf:li>
        </rdf:Seq>
      </dc:creator>
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Manual</rdf:li>
          <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
          <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:date>2006-06-02T10:36:40+02:00</dc:date>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

You can see that the title is available in three languages. The example
also shows the case with multiple authors.

To access the title using Adobe's XMP tookkit you'd do the following:

XMPMeta meta = XMPMetaFactory.parse(in);
String s;

//Get default title
s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);

//Get title in user language if available
String userLang = System.getProperty("user.language");
s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);

Easy, isn't it? :-) That's the generic access to properties as Adobe's
XMP toolkit provides it. But it can also be useful to have concrete
adapters for easier use and higher type-safety. Here's what I do in XML
Graphics Commons at the moment:

Metadata meta = XMPParser.parseXMP(url);
DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
String s;
s = dc.getTitle();
String userLang = System.getProperty("user.language");
s = dc.getTitle(userLang);

(Obviously, the same could be done for Adobe's XMP toolkit.)

> Also, you mention it's relatively easy
> in other libraries to map between different file format metadata. I think
> that this is fairly easy to do in Tika too, seeing as though its primary
> purpose is support metadata extraction from different file formats.

No argument there. I don't claim I know all the requirements and use
cases of Tika. But I would imagine it's important to preserve as much
metadata as possible. XMP is certainly one of the best containers I've
seen to achieve that goal.

> >
> >> My questions:
> >> - Any interest in converging on a unified model/approach?
> >
> > Certainly.
>
> +1
>
> >
> >> - If yes, where shall we develop this? As part of Tika (although it's
> >> still in incubation)? As a seperate project (maybe as Apache Commons
> >> subproject)? If more than XML Graphics uses this, XML Graphics is
> >> probably not the right home.
> >> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> >> the JempBox or XML Graphics Commons approach more interesting?
> >
> > If there already exists acceptably licensed good code outside the ASF,
> > then I would prefer using that instead of reinventing the wheel within
> > the foundation.
>
> I'm not sure we're "re-inventing the wheel" here Jukka. Tika's Metadata
> framework began in Nutch, and at the time based on a short survey that
> Jerome Charron and I undertook, there was no easy-to-use, Metadata library
> framework, that met the needs of the types of things done in Nutch/Tika --
> document extraction of metadata from large corpuses, supporting many values
> for keys: mapping between keys, etc. So, in my mind, we're definitely not
> re-inventing any wheel and the framework was borne more out of need/ease of
> use than anything else.
>
> In any case, the use of a common framework is a good one to discuss and I'm
> open to it. So long as people like me can better understand the gaps in the
> current Tika Metadata framework and the benefits of addressing those gaps to
> all the projects that would need it.


Jeremias Maerki

Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

Antoni Mylka-2
In reply to this post by Jeremias Maerki-2
Hi Jeremias, tika-dev

My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
which is addressing similar things as Tika, we got your mail on the
tika-dev mailing list. I also work for the Nepomuk Social Semantic
Desktop project, I'm the maintainer of the Nepomuk Information Element
Ontology. More below.

Your mail addresses four more-or-less orthogonal issues.

1. The standardization of schemas, how the metadata should be
represented i.e. URIs of classes and properties.

2. The standardzation of the representational language This means the
conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the
formal semantics.

3. The standardization of the API that will work with the RDF triples
and handle operations such as adding, deleting and querying triples.
(And maybe the inference).

4. The standardization of the RDF storage mechanisms.

XMP provides its answers to all these questions but they aren't the only
ones. I know of at least two such standardization initiatives,

1. Freedesktop.org the XESAM project. A gathering of the major
open-source desktop search engines
http://xesam.org/main

2. Nepomuk Social Semantic Desktop Project. An EU-Funded research
project with the Semantic-Web background.
http://nepomuk.semanticdesktop.org

Many of the issues you are bound to come into have already been
recognized and some answers have been given, naturally the requirements
might have been different and the solutions aren't optimal, but it may
be interesting for you to skim through the output of those projects. To
sum it up:

1.
Freedesktop.org schema:
<http://xesam.org/main/XesamOntology90>

Nepomuk schema: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie/>
Let the pointers take you from there.
There is also an archive of discussions around the drafts of NIE. (there
have been 8 at the moment).
<http://dev.nepomuk.semanticdesktop.org/query?status=new&status=assigned&status=reopened&status=closed&component=ontology-nie&order=priority>

2.
Freedesktop don't use any specific representational language, but they
support property inheritance. They implement it by themselves, without
any general-purpose RDF inference.

Nepomuk uses the Nepomuk Representational Language. It has been
considered better for our purposes, since it employs more intuitive
semantics (so-called closed-world assumption, in normal RDF if you say
that the value if nie:kisses property is a Human, and you write Antoni
nie:kisses Frog - you can infer that the frog is a human, in NRL you can't)

3.
No-one tried to standardize the API, there are many libraries that work
with both in-memory and persistent RDF repositories.

A few pointers:

There are many APIs out there:
* jena.sourceforge.net - big api for rdf by HP
* www.openrdf.org - rdf api optimized for client/server setups
* http://wiki.ontoworld.org/wiki/RDF2Go - Abstraction api of above

There are many APIs generating "Schema Specific Adapters", the well
known in Java are:
* http://wiki.ontoworld.org/wiki/RDFReactor
* elmo
** http://www.openrdf.net/doc/elmo/1.0/user-guide/index.html
**
http://sourceforge.net/project/showfiles.php?group_id=46509&package_id=157314
* https://sommer.dev.java.net/

from the above, elmo is quite stable and advanced.

There are murmurs of standardization of RDF Apis,
Max Völkel (FZI, Maintainer of RDF2Go), Henry Story (www.bblfish.net),
and Leo Sauermann (DFKI, http://leobard.twoday.net) repeatedly thought
about starting a JSR discussion on an RDF api, but that never happened.
The W3C may be interested to do something like this (they did it for DOM
I think and for XML, or?), the contact people would be the deployment group:
http://www.w3.org/2006/07/SWD/

so, to sum it up:
There are many things out there handling RDF in Java, but nothing
dominates yet as a single monopoly. In my sourroundings (my company,
aperture.sourceforge.net) we prefer to use RDF2Go as "the api", its not
perfect but it seems to work quite well.

4.
XMP prescribes that the metadata be contained within the files
themselves. There are many scenarios where this is a limitation. Each
application will have to maintain its indexes by itself and possibly use
a different API to work with XMP storage (in the files) and the common
storage (e.g. an index). There is an ongoing effort to combine the
flexibility of RDF with the search-capabilities of Lucene. Two of the
more prominent ones are

Sesame Lucene Sail
<https://src.aduna-software.org/svn/org.openrdf/projects/sesame2-contrib/openrdf-sail-contrib/openrdf-lucenesail/>
AFAIK there is no project page yet, but this idea has been worked on for
at least two years now, e.g. in the gnowsis project
www.gnowsis.org

Boca TextIndexing feature
Part of the IBM SLRP
<http://ibm-slrp.sourceforge.net/wiki/index.php?title=BocaTextIndexing>

In our opinion, such an initiative deserves at least a separate mailing
list. We have already been working on metadata standardization for some
time now and would be happy to help. Chris Mattman has written that it's
necessary to strike a balance between functionality and over-bloating.
 From my own experience i can say that it is VERY difficult :).

Antoni Mylka
[hidden email]

On Nov 19, 2007 10:26 AM, Jeremias Maerki <[hidden email]> wrote:

> (I realize this is heavy cross-posting but it's probably the best way to
> reach all the players I want to address.)
>
> As you may know, I've started developing an XMP metadata package inside
> XML Graphics Commons in order to support XMP metadata (and ultimately
> PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
>
> What is XMP? XMP, for those who don't know about it, is based on a
> subset of RDF to provide a flexible and extensible way of
> storing/representing document metadata.
>
> Yesterday, I was surprised to discover that Adobe has published an XMP
> Toolkit with Java support under the BSD license. In contrast to my
> effort, Adobe's toolkit is quite complete if maybe a bit more
> complicated to use. That got me thinking:
>
> Every project I'm sending this message to is using document metadata in
> some form:
> - Apache XML Graphics: embeds document metadata in the generated files
> (just FOP at the moment, but Batik is a similar candidate)
> - Tika (in incubation): has as one of its main purposes the extraction
> of metadata
> - Sanselan (in incubation): extracts and embeds metadata from/in bitmap
> images
> - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> from/in PDF files (see also JempBox)
>
> Every one of these projects has its own means to represent metadata in
> memory. Wouldn't it make sense to have a common approach? I've worked
> with XMP for some time now and I can say it's ideal to work with. It
> also defines guidelines to embed XMP metadata in various file formats.
> It's also relatively easy to map metadata between different file formats
> (Dublin Core, EXIF, PDF Info etc.).
>
> Sanselan and Tika have both chosen a very simple approach but is it
> versatile enough for the future? While the simple Map<String, String[]> in
> Tika allows for multiple authors, for example, it doesn't support
> language alternatives for things such as dc:title or dc:description.
>
> I'm seriously thinking about abandoning most of my XMP package work in
> XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> support, tough:
> - Metadata merging functionality (which I need for synchronizing the PDF
> Info object and the XMP packet for PDF/A)
> - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
> easier programming (which both Ben and I have written for JempBox and
> XML Graphics Commons). Adobe's toolkit only allows generic access.
>
> Some links:
> Adobe XMP website: http://www.adobe.com/products/xmp/
> Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> JempBox: http://sourceforge.net/projects/jempbox
> Apache XML Graphics Commons:
>   http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
>
> My questions:
> - Any interest in converging on a unified model/approach?
> - If yes, where shall we develop this? As part of Tika (although it's
> still in incubation)? As a seperate project (maybe as Apache Commons
> subproject)? If more than XML Graphics uses this, XML Graphics is
> probably not the right home.
> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> the JempBox or XML Graphics Commons approach more interesting?
> - Where's the best place to discuss this? We can't keep posting to
> several mailing lists.
>
> At any rate, I would volunteer to spearhead this effort, especially
> since I have immediate need to have complete XMP functionality. I've
> almost finished mapping all XMP structures in XG Commons but I haven't
> committed my latest changes (for structured properties) and I may still
> not cover all details of XMP.
>
> Thanks for reading this far,
> Jeremias Maerki
>
>



--
Antoni Myłka
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

chrismattmann
In reply to this post by Jeremias Maerki-2
Hi Jeremias,

>> I'm not quite sure I understand how Tika's metadata model isn't flexible
>> enough? Of course, I'm a bit bias, but I'm really trying to understand here
>> and haven't been able to. I think it's important to realize that a balance
>> must be struck between over-bloating a metadata library (and attaching on
>> RDF support, inference, synonym support, etc.) and making sure that the
>> smallest subset of it is actually useful.
>
> I'm sorry. I didn't intend to stand on anyone's toes.
>
> At any rate, I'm not talking about full RDF support. I'm talking about
> XMP, which uses only a subset of RDF.

Great, and I wouldn't worry about stepping on anyone's toes. You certainly
didn't step on mine. My point was, at some point, we're just building
libraries on top of libraries on top of...well you get the picture. What I'm
interested in is building the smallest metadata library that's actually
useful and can be built upon to add higher level capabilities, just as Solr
builds on top of Lucene to provide faceted search, etc. Lucene itself
doesn't provide a means for understanding facets/etc., but provides a
library for text/indexing: Solr adds that understanding. Similarly here, I
think it would be great for Tika to provide a library to handle Metadata
representation/access, and then for others, to build on top of it to provide
higher level library support (RDF access/etc.).

>
>> Also, I'd be against moving Metadata support out of Tika because that was
>> one of the project's original goals (Metadata support), and I think it's
>> advantageous for Tika to be a provider for a Metadata capability (of course,
>> one related to document/content extraction).
>
> Metadata capability in the context of content extraction, certainly yes.
> Nobody disputes that. But other projects have different needs (like
> embedding metadata). So in all this there are certain common needs and
> I'm trying to see if we can find a common ground in the form of a
> uniform way of manipulating and storing metadata in memory while at the
> same time working off a freely available standard.

Yep I get that. I'm all for that. Could you explain what you mean by
"embedding" metadata? Within a document?

>
>> I'm wondering too what it means that Tika doesn't support "language
>> alternatives"? Do you mean synonyms?
>
[..snip..]
>       <dc:title>
>         <rdf:Alt>
>           <rdf:li xml:lang="x-default">Manual</rdf:li>
>           <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
>           <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
>         </rdf:Alt>
>       </dc:title>
[..snip..]

>
> You can see that the title is available in three languages. The example
> also shows the case with multiple authors.
>
> To access the title using Adobe's XMP tookkit you'd do the following:
>
> XMPMeta meta = XMPMetaFactory.parse(in);
> String s;
>
> //Get default title
> s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);
>
> //Get title in user language if available
> String userLang = System.getProperty("user.language");
> s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);
>
> Easy, isn't it? :-) That's the generic access to properties as Adobe's
> XMP toolkit provides it. But it can also be useful to have concrete
> adapters for easier use and higher type-safety. Here's what I do in XML
> Graphics Commons at the moment:
>
> Metadata meta = XMPParser.parseXMP(url);
> DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
> String s;
> s = dc.getTitle();
> String userLang = System.getProperty("user.language");
> s = dc.getTitle(userLang);

Great example Jeremias. I think that the same type of thing could be built
into Tika, and Tika currently supports some of the functionality that you
mention above. Instead of meta.getLocalizedText, you could make a call to
Tika like:

/* pseudo code of course */
Metadata meta = new Metadata();
TikaParser p = ParserFactory.createParser();
ContentHandler hander;
p.parse(stream, handler, meta);

String s;

s = meta.getMetadata(DublinCore.TITLE);

/* or if you want back all the titles parsed (if more than one) */
List<String> titles = meta.getAllMetadata(DublinCore.TITLE);

So, then you could build a DublinCoreAdapter on top of Tika's Metadata class
too.

>> Also, you mention it's relatively easy
>> in other libraries to map between different file format metadata. I think
>> that this is fairly easy to do in Tika too, seeing as though its primary
>> purpose is support metadata extraction from different file formats.
>
> No argument there. I don't claim I know all the requirements and use
> cases of Tika. But I would imagine it's important to preserve as much
> metadata as possible. XMP is certainly one of the best containers I've
> seen to achieve that goal.

Yep exactly. That's one of the key requirements of Tika's Metadata
framework. So yeah, long story short, it would be great to collaborate: I
just want to make sure that there is proper understanding of all the pieces
going forward so we know where there are gaps, and where there are not.

Cheers,
  Chris

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

chrismattmann
In reply to this post by Antoni Mylka-2
Hi Antoni,

> Chris Mattman has written that it's
necessary to
> strike a balance between functionality and over-bloating.
 From my own
> experience i can say that it is VERY difficult :).


Well from my own experience I can tell you that it *is* difficult, but
certainly doable.

I've been working with different forms of metadata (Dublin Core, ISO 11179,
RDF, OWL/etc.), been involved in international standards organizations
(CCSDS, ISO) who are developing metadata standards, and worked on several
projects that deal with metadata (Object Oriented Data Technology [OODT],
Semantic Web for Earth and Environmental Terminology [SWEET]) in different
domains (earth science, planetary science, space science, cancer
research/etc.) for almost 7 years now.

Sure, there are a lot of standards and people can talk about coming up with
a one-size-fits-all cookie cutter type library for these capabilities,
however, I think it's important to understand that developing such libraries
(rather than striking the balance) in my mind is the most difficult problem
to tackle. I think that in the end, all we can do as software developers, as
people who are trying to standardize metadata, is to try and develop core
libraries and functions that others can build upon for their own needs. I
don't think the Tika folks should be in the business of trying to develop
high capability metadata libraries, because in the end, just as everyone is
saying, those need to be tailored to a specific use-case or domain. On the
other hand, I think it's a much-more attainable goal to come up with a
simple, easy-to-use metadata library, that folks who need higher level
capability (inference, multi-language support, representation/etc.) can
build upon for their own needs. In other words, someone shouldn't have to
rewrite the ability to have met keys, with multiple values associated with
them, with ways to map between the keys, etc., however, it's reasonable that
someone may need to rewrite the ability to represent metadata in RDF (versus
OWL), to rewrite the ability to do language translation (e.g., using XMP
versus Adobe's toolkit), that type of thing.

In any case, I'm happy to participate in any standardization efforts wearing
my Tika hat, with the understanding that whatever gets developed needs to
"fit in" the right place, be architected for extensibility, and have
cognizance of what was done previously, what the gaps are, and why the gaps
should be addressed.

Thanks!

Cheers,
  Chris

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

Jeremias Maerki-2
In reply to this post by Antoni Mylka-2
Hi Antoni

Thanks for the interesting information. Frankly, you've scared me there
just a bit. It's interesting to see that there are so encompassing
efforts underway in some places. To me, full RDF still has a scare
factor. At least the subset XMP provides is "manageable" for mere
mortals. :-) At least, that's my impression. Maybe I still just know too
little about RDF. IMO, XMP finds a good compromise between
expressiveness and simplicity. The positive points for Adobe's XMP
toolkit: it is in Java, available now and under a license we can easily
use in Apache projects.

In your point 4, you mention some restrictions you see for XMP. But XMP
is a subset of RDF, so does RDF really restrict you from an RDF point of
view? I didn't really understand that point.

We'll see how this works out.

Jeremias Maerki



On 20.11.2007 15:25:44 Antoni Mylka wrote:

> Hi Jeremias, tika-dev
>
> My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
> which is addressing similar things as Tika, we got your mail on the
> tika-dev mailing list. I also work for the Nepomuk Social Semantic
> Desktop project, I'm the maintainer of the Nepomuk Information Element
> Ontology. More below.
>
> Your mail addresses four more-or-less orthogonal issues.
>
> 1. The standardization of schemas, how the metadata should be
> represented i.e. URIs of classes and properties.
>
> 2. The standardzation of the representational language This means the
> conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the
> formal semantics.
>
> 3. The standardization of the API that will work with the RDF triples
> and handle operations such as adding, deleting and querying triples.
> (And maybe the inference).
>
> 4. The standardization of the RDF storage mechanisms.
>
> XMP provides its answers to all these questions but they aren't the only
> ones. I know of at least two such standardization initiatives,
>
> 1. Freedesktop.org the XESAM project. A gathering of the major
> open-source desktop search engines
> http://xesam.org/main
>
> 2. Nepomuk Social Semantic Desktop Project. An EU-Funded research
> project with the Semantic-Web background.
> http://nepomuk.semanticdesktop.org
>
> Many of the issues you are bound to come into have already been
> recognized and some answers have been given, naturally the requirements
> might have been different and the solutions aren't optimal, but it may
> be interesting for you to skim through the output of those projects. To
> sum it up:
>
> 1.
> Freedesktop.org schema:
> <http://xesam.org/main/XesamOntology90>
>
> Nepomuk schema: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie/>
> Let the pointers take you from there.
> There is also an archive of discussions around the drafts of NIE. (there
> have been 8 at the moment).
> <http://dev.nepomuk.semanticdesktop.org/query?status=new&status=assigned&status=reopened&status=closed&component=ontology-nie&order=priority>
>
> 2.
> Freedesktop don't use any specific representational language, but they
> support property inheritance. They implement it by themselves, without
> any general-purpose RDF inference.
>
> Nepomuk uses the Nepomuk Representational Language. It has been
> considered better for our purposes, since it employs more intuitive
> semantics (so-called closed-world assumption, in normal RDF if you say
> that the value if nie:kisses property is a Human, and you write Antoni
> nie:kisses Frog - you can infer that the frog is a human, in NRL you can't)
>
> 3.
> No-one tried to standardize the API, there are many libraries that work
> with both in-memory and persistent RDF repositories.
>
> A few pointers:
>
> There are many APIs out there:
> * jena.sourceforge.net - big api for rdf by HP
> * www.openrdf.org - rdf api optimized for client/server setups
> * http://wiki.ontoworld.org/wiki/RDF2Go - Abstraction api of above
>
> There are many APIs generating "Schema Specific Adapters", the well
> known in Java are:
> * http://wiki.ontoworld.org/wiki/RDFReactor
> * elmo
> ** http://www.openrdf.net/doc/elmo/1.0/user-guide/index.html
> **
> http://sourceforge.net/project/showfiles.php?group_id=46509&package_id=157314
> * https://sommer.dev.java.net/
>
> from the above, elmo is quite stable and advanced.
>
> There are murmurs of standardization of RDF Apis,
> Max Völkel (FZI, Maintainer of RDF2Go), Henry Story (www.bblfish.net),
> and Leo Sauermann (DFKI, http://leobard.twoday.net) repeatedly thought
> about starting a JSR discussion on an RDF api, but that never happened.
> The W3C may be interested to do something like this (they did it for DOM
> I think and for XML, or?), the contact people would be the deployment group:
> http://www.w3.org/2006/07/SWD/
>
> so, to sum it up:
> There are many things out there handling RDF in Java, but nothing
> dominates yet as a single monopoly. In my sourroundings (my company,
> aperture.sourceforge.net) we prefer to use RDF2Go as "the api", its not
> perfect but it seems to work quite well.
>
> 4.
> XMP prescribes that the metadata be contained within the files
> themselves. There are many scenarios where this is a limitation. Each
> application will have to maintain its indexes by itself and possibly use
> a different API to work with XMP storage (in the files) and the common
> storage (e.g. an index). There is an ongoing effort to combine the
> flexibility of RDF with the search-capabilities of Lucene. Two of the
> more prominent ones are
>
> Sesame Lucene Sail
> <https://src.aduna-software.org/svn/org.openrdf/projects/sesame2-contrib/openrdf-sail-contrib/openrdf-lucenesail/>
> AFAIK there is no project page yet, but this idea has been worked on for
> at least two years now, e.g. in the gnowsis project
> www.gnowsis.org
>
> Boca TextIndexing feature
> Part of the IBM SLRP
> <http://ibm-slrp.sourceforge.net/wiki/index.php?title=BocaTextIndexing>
>
> In our opinion, such an initiative deserves at least a separate mailing
> list. We have already been working on metadata standardization for some
> time now and would be happy to help. Chris Mattman has written that it's
> necessary to strike a balance between functionality and over-bloating.
>  From my own experience i can say that it is VERY difficult :).
>
> Antoni Mylka
> [hidden email]
>
> On Nov 19, 2007 10:26 AM, Jeremias Maerki <[hidden email]> wrote:
> > (I realize this is heavy cross-posting but it's probably the best way to
> > reach all the players I want to address.)
> >
> > As you may know, I've started developing an XMP metadata package inside
> > XML Graphics Commons in order to support XMP metadata (and ultimately
> > PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
> >
> > What is XMP? XMP, for those who don't know about it, is based on a
> > subset of RDF to provide a flexible and extensible way of
> > storing/representing document metadata.
> >
> > Yesterday, I was surprised to discover that Adobe has published an XMP
> > Toolkit with Java support under the BSD license. In contrast to my
> > effort, Adobe's toolkit is quite complete if maybe a bit more
> > complicated to use. That got me thinking:
> >
> > Every project I'm sending this message to is using document metadata in
> > some form:
> > - Apache XML Graphics: embeds document metadata in the generated files
> > (just FOP at the moment, but Batik is a similar candidate)
> > - Tika (in incubation): has as one of its main purposes the extraction
> > of metadata
> > - Sanselan (in incubation): extracts and embeds metadata from/in bitmap
> > images
> > - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> > from/in PDF files (see also JempBox)
> >
> > Every one of these projects has its own means to represent metadata in
> > memory. Wouldn't it make sense to have a common approach? I've worked
> > with XMP for some time now and I can say it's ideal to work with. It
> > also defines guidelines to embed XMP metadata in various file formats.
> > It's also relatively easy to map metadata between different file formats
> > (Dublin Core, EXIF, PDF Info etc.).
> >
> > Sanselan and Tika have both chosen a very simple approach but is it
> > versatile enough for the future? While the simple Map<String, String[]> in
> > Tika allows for multiple authors, for example, it doesn't support
> > language alternatives for things such as dc:title or dc:description.
> >
> > I'm seriously thinking about abandoning most of my XMP package work in
> > XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> > support, tough:
> > - Metadata merging functionality (which I need for synchronizing the PDF
> > Info object and the XMP packet for PDF/A)
> > - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
> > easier programming (which both Ben and I have written for JempBox and
> > XML Graphics Commons). Adobe's toolkit only allows generic access.
> >
> > Some links:
> > Adobe XMP website: http://www.adobe.com/products/xmp/
> > Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> > JempBox: http://sourceforge.net/projects/jempbox
> > Apache XML Graphics Commons:
> >   http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
> >
> > My questions:
> > - Any interest in converging on a unified model/approach?
> > - If yes, where shall we develop this? As part of Tika (although it's
> > still in incubation)? As a seperate project (maybe as Apache Commons
> > subproject)? If more than XML Graphics uses this, XML Graphics is
> > probably not the right home.
> > - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> > the JempBox or XML Graphics Commons approach more interesting?
> > - Where's the best place to discuss this? We can't keep posting to
> > several mailing lists.
> >
> > At any rate, I would volunteer to spearhead this effort, especially
> > since I have immediate need to have complete XMP functionality. I've
> > almost finished mapping all XMP structures in XG Commons but I haven't
> > committed my latest changes (for structured properties) and I may still
> > not cover all details of XMP.
> >
> > Thanks for reading this far,
> > Jeremias Maerki
> >
> >
>
>
>
> --
> Antoni Myłka
> [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Metadata use by Apache Java projects

Jeremias Maerki-2
In reply to this post by chrismattmann
Hi Chris

On 20.11.2007 18:06:25 Chris Mattmann wrote:

> Hi Jeremias,
>
> >> I'm not quite sure I understand how Tika's metadata model isn't flexible
> >> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> >> and haven't been able to. I think it's important to realize that a balance
> >> must be struck between over-bloating a metadata library (and attaching on
> >> RDF support, inference, synonym support, etc.) and making sure that the
> >> smallest subset of it is actually useful.
> >
> > I'm sorry. I didn't intend to stand on anyone's toes.
> >
> > At any rate, I'm not talking about full RDF support. I'm talking about
> > XMP, which uses only a subset of RDF.
>
> Great, and I wouldn't worry about stepping on anyone's toes. You certainly
> didn't step on mine. My point was, at some point, we're just building
> libraries on top of libraries on top of...well you get the picture. What I'm
> interested in is building the smallest metadata library that's actually
> useful and can be built upon to add higher level capabilities, just as Solr
> builds on top of Lucene to provide faceted search, etc. Lucene itself
> doesn't provide a means for understanding facets/etc., but provides a
> library for text/indexing: Solr adds that understanding. Similarly here, I
> think it would be great for Tika to provide a library to handle Metadata
> representation/access, and then for others, to build on top of it to provide
> higher level library support (RDF access/etc.).

I think Adobe's XMP toolkit accomplishes exactly that, at least for the
generic part. Every project will certainly have some extra needs like
XML Graphics needs metadata merging and concrete adapters (like in my
previous example) for easier programming. Other projects might need
other tools, or the same. If we find common parts we can put those in a
little metadata library (Commons?!).

You keep saying that Tika should be providing a library to handle
Metadata representation/access. But is Tika really the right container?
Tika's goal is clearly metadata extraction while the requirements for
such a library go a little beyond that focus. I don't think I'd have a
hard time selling Tika with all its dependencies to the XML Graphics
project for just metadata handling (but not extraction). However, if
that library would be a separate product of the Tika project, fine. Then,
we only have the problem with Tika being in the incubator at the moment.
Can we use incubator releases in non-incubator projects? I don't really
know.

> >
> >> Also, I'd be against moving Metadata support out of Tika because that was
> >> one of the project's original goals (Metadata support), and I think it's
> >> advantageous for Tika to be a provider for a Metadata capability (of course,
> >> one related to document/content extraction).
> >
> > Metadata capability in the context of content extraction, certainly yes.
> > Nobody disputes that. But other projects have different needs (like
> > embedding metadata). So in all this there are certain common needs and
> > I'm trying to see if we can find a common ground in the form of a
> > uniform way of manipulating and storing metadata in memory while at the
> > same time working off a freely available standard.
>
> Yep I get that. I'm all for that. Could you explain what you mean by
> "embedding" metadata? Within a document?

Again, an example is probably best: Document production in FOP.
Imagine a workflow, where some application generates XML files which are
formatted to PDF by FOP. The XSLT stylesheet will build up an XMP
packet besides the actual document content from the XML data that is
embedded in the fo:declarations element of the resulting XSL-FO document.
The PDFs are generated with the PDF/A-1b profile for long-term storage.
The PDFs go into a searchable archive, so metadata, especially
application-specific metadata (for example, patent bibliographic data
like a subset of ST.36 from WIPO for patent documents), needs to be
provided. During the formatting FOP needs to add its own metadata
(production time of the document, PDF producer, required PDF/A
indicators). That's where I do the merging: the XMP packet from XSL-FO
gets merged with a packet generated by FOP. The end result is an XMP
document that will be embedded in the PDF file.

> >
> >> I'm wondering too what it means that Tika doesn't support "language
> >> alternatives"? Do you mean synonyms?
> >
> [..snip..]
> >       <dc:title>
> >         <rdf:Alt>
> >           <rdf:li xml:lang="x-default">Manual</rdf:li>
> >           <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
> >           <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
> >         </rdf:Alt>
> >       </dc:title>
> [..snip..]
>
> >
> > You can see that the title is available in three languages. The example
> > also shows the case with multiple authors.
> >
> > To access the title using Adobe's XMP tookkit you'd do the following:
> >
> > XMPMeta meta = XMPMetaFactory.parse(in);
> > String s;
> >
> > //Get default title
> > s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);
> >
> > //Get title in user language if available
> > String userLang = System.getProperty("user.language");
> > s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);
> >
> > Easy, isn't it? :-) That's the generic access to properties as Adobe's
> > XMP toolkit provides it. But it can also be useful to have concrete
> > adapters for easier use and higher type-safety. Here's what I do in XML
> > Graphics Commons at the moment:
> >
> > Metadata meta = XMPParser.parseXMP(url);
> > DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
> > String s;
> > s = dc.getTitle();
> > String userLang = System.getProperty("user.language");
> > s = dc.getTitle(userLang);
>
> Great example Jeremias. I think that the same type of thing could be built
> into Tika, and Tika currently supports some of the functionality that you
> mention above. Instead of meta.getLocalizedText, you could make a call to
> Tika like:
>
> /* pseudo code of course */
> Metadata meta = new Metadata();
> TikaParser p = ParserFactory.createParser();
> ContentHandler hander;
> p.parse(stream, handler, meta);
>
> String s;
>
> s = meta.getMetadata(DublinCore.TITLE);
>
> /* or if you want back all the titles parsed (if more than one) */
> List<String> titles = meta.getAllMetadata(DublinCore.TITLE);

Ah, so you do get multiple titles but you probably still lose the
information which title is in which language, right?

> So, then you could build a DublinCoreAdapter on top of Tika's Metadata class
> too.
>
> >> Also, you mention it's relatively easy
> >> in other libraries to map between different file format metadata. I think
> >> that this is fairly easy to do in Tika too, seeing as though its primary
> >> purpose is support metadata extraction from different file formats.
> >
> > No argument there. I don't claim I know all the requirements and use
> > cases of Tika. But I would imagine it's important to preserve as much
> > metadata as possible. XMP is certainly one of the best containers I've
> > seen to achieve that goal.
>
> Yep exactly. That's one of the key requirements of Tika's Metadata
> framework. So yeah, long story short, it would be great to collaborate: I
> just want to make sure that there is proper understanding of all the pieces
> going forward so we know where there are gaps, and where there are not.

Me happy!

Jeremias Maerki