XML as Only Route to TikaConfig

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

XML as Only Route to TikaConfig

Keith R. Bennett
Hi, all.  Whenever we talk about the TikaConfig object, we talk of configuring it using XML.

I'd like to suggest that we provide an API as well.  In general, it would be easier to use in many cases.  Also:

* The default configuration may be suitable most of the time, but if a user wants to change only 1 or 2 options, having to create a new XML file is overkill.  It would be much easier to load the default and call a method or two to deviate from the default.

* The desired options may not be known until runtime.  Having to build a Document in memory, or write an XML file, seems like more work than should be necessary.

* One might want to keep a single instance in memory and modify it over the life of the program as necessary.

* In general, I think one shouldn't need to know about an external representation of an object (TikaConfig's XML representation in this case) if working with the object directly is simpler.

What do you think?  Should I create a JIRA issue for this?

- Keith
Reply | Threaded
Open this post in threaded view
|

Re: XML as Only Route to TikaConfig

robert burrell donkin-2
On 10/13/07, Keith R. Bennett <[hidden email]> wrote:
>
> Hi, all.  Whenever we talk about the TikaConfig object, we talk of
> configuring it using XML.
>
> I'd like to suggest that we provide an API as well.  In general, it would be
> easier to use in many cases.

and essential in some use cases :-)

here's a couple of mine (i've been meaning to jump in with this for a
while now - glad to see that keith beat me to it ;-)

1. mime guessing antlib (to allow filtering on mime types rather than
just extension)
2. improved support for mime types in RAT

(BTW IMHO describing some typical use cases would be a good way to
kick off the documentation for tika. the code's easy to understand but
i've found it tough to see the bigger picture. use cases might be a
good way in for new developers.)

> Also:
>
> * The default configuration may be suitable most of the time, but if a user
> wants to change only 1 or 2 options, having to create a new XML file is
> overkill.  It would be much easier to load the default and call a method or
> two to deviate from the default.
>
> * The desired options may not be known until runtime.  Having to build a
> Document in memory, or write an XML file, seems like more work than should
> be necessary.
>
> * One might want to keep a single instance in memory and modify it over the
> life of the program as necessary.
>
> * In general, I think one shouldn't need to know about an external
> representation of an object (TikaConfig's XML representation in this case)
> if working with the object directly is simpler.

+1

IMHO IoC typically works best in the long run: lift out an interface
and switch to factories creating implementations

- robert
Reply | Threaded
Open this post in threaded view
|

Re: XML as Only Route to TikaConfig

Bertrand Delacretaz-2
In reply to this post by Keith R. Bennett
On 10/13/07, Keith R. Bennett <[hidden email]> wrote:

> ...I'd like to suggest that we provide an API as well.  In general, it would be
> easier to use in many cases....

> ...What do you think?  Should I create a JIRA issue for this?...

We might want to discuss the design here first, can you give an
example of how your configuration API would work?

Me, I like to use Properties for configuration whenever possible, as
they are easy to create, using default values and easy to override
individual settings. But there are other ways, of course.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: XML as Only Route to TikaConfig

Jukka Zitting
In reply to this post by Keith R. Bennett
Hi,

On 10/13/07, Keith R. Bennett <[hidden email]> wrote:
> Hi, all.  Whenever we talk about the TikaConfig object, we talk of
> configuring it using XML.
>
> I'd like to suggest that we provide an API as well.  In general, it would be
> easier to use in many cases.

My proposal for the Parser interface was to use the JavaBean
conventions for any static configuration.

We can have an optional XML configuration mechanism in Tika, but you
could just as well configure the parsers with your favourite IoC
container or with explicit Java/Groovy/etc. code.

In fact, at the moment the XML parser configuration in Tika does
basically nothing more than associates the parser class name with the
set of mime types that it supports. We could (and perhaps even should)
easily drop the whole config file.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: XML as Only Route to TikaConfig

robert burrell donkin-2
On 10/13/07, Jukka Zitting <[hidden email]> wrote:

> Hi,
>
> On 10/13/07, Keith R. Bennett <[hidden email]> wrote:
> > Hi, all.  Whenever we talk about the TikaConfig object, we talk of
> > configuring it using XML.
> >
> > I'd like to suggest that we provide an API as well.  In general, it would be
> > easier to use in many cases.
>
> My proposal for the Parser interface was to use the JavaBean
> conventions for any static configuration.
>
> We can have an optional XML configuration mechanism in Tika, but you
> could just as well configure the parsers with your favourite IoC
> container or with explicit Java/Groovy/etc. code.

+1

> In fact, at the moment the XML parser configuration in Tika does
> basically nothing more than associates the parser class name with the
> set of mime types that it supports. We could (and perhaps even should)
> easily drop the whole config file.

+1

- robert
Reply | Threaded
Open this post in threaded view
|

Re: XML as Only Route to TikaConfig

Keith R. Bennett
In reply to this post by Bertrand Delacretaz-2
Bertrand -

The bean approach, as Jukka suggests, would work fine.  I just wanted there to be a way to simply turn on or off byte header MIME type detection, add a parser to the configuration, etc, without having to do it via XML.

Properties are often fine.  However, having methods for access ensures compile time correctness (i.e. does not suffer from the risk that the programmer mispelled the property name).  Also,if we use Properties and can name them anything we want, then we may lose the predictability that could be helpful.  For example, in the Metadata issue described elsewhere, if the file name property is called "filename" in one place and "FileName" in another, then we lose the ability to reliably find out the file name from any Metadata instance.   If, instead, it is exposed as setFilename() and getFilename(), we're fine.

Regards,
Keith

Bertrand Delacretaz wrote
On 10/13/07, Keith R. Bennett <kbennett@bbsinc.biz> wrote:

> ...I'd like to suggest that we provide an API as well.  In general, it would be
> easier to use in many cases....

> ...What do you think?  Should I create a JIRA issue for this?...

We might want to discuss the design here first, can you give an
example of how your configuration API would work?

Me, I like to use Properties for configuration whenever possible, as
they are easy to create, using default values and easy to override
individual settings. But there are other ways, of course.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: XML as Only Route to TikaConfig

chrismattmann
Hi Keith,

> we want, then we may lose the predictability that could
> be helpful.  For example, in the Metadata issue described elsewhere, if the
> file name property is called "filename" in one place and "FileName" in
> another, then we lose the ability to reliably find out the file name from
> any Metadata instance.

This is the whole point of the interfaces that the Metadata object
implements. The interface classes are meant to aggregate these common string
met keys that are used over and over again in something like Tika.

Then, if desired, you could always write wrapper functions around things
like Metadata.FILENAME such as:

public String getFilename(){ return Metadata.FILENAME;}

If you absolutely had to. Or you could make the class that implements
getFilename() implement the java interface that defines FILENAME. Writing
methods such as the above ;however, are typically only practical in
interfaces that don't change often: otherwise, you'll be adding more get and
set methods as you go along, as opposed to simply adding a key in a table
(which is what the public static final Strings in the java interface classes
are equivalent to.) and then using the addMetadata and getMetadata methods
in the Metadata object.

Cheers,
 Chris



> If, instead, it is exposed as setFilename() and
> getFilename(), we're fine.
>
> Regards,
> Keith
>
>
> Bertrand Delacretaz wrote:
>>
>> On 10/13/07, Keith R. Bennett <[hidden email]> wrote:
>>
>>> ...I'd like to suggest that we provide an API as well.  In general, it
>>> would be
>>> easier to use in many cases....
>>
>>> ...What do you think?  Should I create a JIRA issue for this?...
>>
>> We might want to discuss the design here first, can you give an
>> example of how your configuration API would work?
>>
>> Me, I like to use Properties for configuration whenever possible, as
>> they are easy to create, using default values and easy to override
>> individual settings. But there are other ways, of course.
>>
>> -Bertrand
>>
>>

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.