Moving Functionality from CLI to ParseUtils

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Moving Functionality from CLI to ParseUtils

Keith R. Bennett
Hi, all.  Long time no talk.  I had been working part time and on a kind of
sabbatical during which I abandoned Java in favor of studying Ruby and Clojure,
and attending and organizing BarCamp's.

About three months ago, I started a new job, working with Java again.  The need
to extract structured data from Excel spreadsheets arose, and I wrote a JRuby
script that called Tika to manage the parsing.

In the process, I think I identified some possible improvements to Tika. It
would be nice to simplify one of the simplest use cases, where you want Tika to
parse a document using default configurations, and specify its output stream.

There is a very general mechanism for parsing in CLI, but it is not possible to
override the output stream default (stdout), and awkward to call it from a
program rather than on the command line.  I have two suggestions:

1) Make the output destination a configuration option (a command line parameter)
that defaults to stdout (perhaps "-o").  Although it's easy to redirect output
on the command line, it's not quite so simple when that command is called within
a script that itself may be redirected.  Also, when the command is executed from
within another program, there may be issues as well.

2) Move the methods that do the work to ParseUtils, and leave only a thin
command line wrapper around them in CLI.  It would be helpful for scripts and
Java programs to have these easy to use methods available too.   It seems
wasteful to force the caller to construct a command line to do this.

What do you think?

Cheers,
Keith
Reply | Threaded
Open this post in threaded view
|

Re: Moving Functionality from CLI to ParseUtils

Jukka Zitting
Hi,

On Sat, Jul 4, 2009 at 9:56 PM, Keith R. Bennett<[hidden email]> wrote:
> In the process, I think I identified some possible improvements to Tika. It
> would be nice to simplify one of the simplest use cases, where you want
> Tika to parse a document using default configurations, and specify its output
> stream.
> [...]
> What do you think?

Instead of a fixed facade like ParseUtils I personally prefer a set of
components that I can combine in different ways to solve all kinds of
use cases. For example your case would be easy to solve like this:

    InputStream input = ...; // Where your input is coming from
    OutputStream output = ...; // Where your output is going to
    new AutoDetectParser().parse(
        input, new BodyContentHandler(output), new Metadata());

Of course a static facade method like ParseUtils.parse(File input,
File output) might be easier for occasional users.

Did you have some specific method signatures in mind?

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Moving Functionality from CLI to ParseUtils

keithrbennett
Jukka -

Having pluggable parts, as you suggest, is definitely the
way to go for optimum power and flexibility.  However, IMHO,
for the simplest use cases, and for beginning users,
this approach may discourage and complicate Tika's use.
I suggest an alternate simplified interface (see below)
for these uses/users.

Renovating the entrance gate to Tika-land in this way
could result in an increase in the number of
beginning users, who continue on to be advanced users,
and hopefully developers. A larger installed base could
then result in attracting more resources to the project, human
and otherwise.

* * *

It's been awhile since I worked on Tika, and it's evolved in the
meantime, so I'm not very adept at it these days.

As such, let me use this to the project's advantage, and let you know
what I would value in Tika as a new user.

For the simple cases, I would suggest hiding things like parser
implementations, metadata objects, and content handlers.  The simplest
cases with document type autodetection could be handled by:

parse(InputStream inputStream, OutputStream outputStream)

Then, to specify the document type, we could add a MimeType string
argument:

parse(InputStream inputStream, OutputStream outputStream,
        String mimeType)

I realize that this approach is not very efficient with multiple
documents, since there is setup work that needs to be done for each
document, but it is probably not an issue for most casual users.

Another question...I used Tika to parse an Excel spreadsheet. and it
created an XML file.  How could I insert a handler for parsing
documents with multiple records (such as an Excel spreadsheets, so
that I could, for example, insert the record into a data base instead
of writing XML to a file?  Rather than writing a full blown XML
content handler, I wonder if we could simplify it to something like
this:

public interface RecordProcessor {  
    void processRecord(Object [] fields); // or List
}

... and then have a method like:

parseSpreadsheet(InputStream inputStream,
        RecordProcessor recordProcessor)

For the above methods, we might also provide convenience methods for
Files, URLs, Strings, etc.

IMHO, having extremely simple methods like these would make it more
likely for new users to attempt to use Tika, and to succeed in using
it.

I realize everyone's busy, and my time is limited too; this is just a
wish list.  Also, to the extent that these suggestions are based on a lack
of understanding of how Tika works, I apologize for that and welcome
any clarification.

Regards,
Keith

Jukka Zitting wrote
Instead of a fixed facade like ParseUtils I personally prefer a set of
components that I can combine in different ways to solve all kinds of
use cases. For example your case would be easy to solve like this:

    InputStream input = ...; // Where your input is coming from
    OutputStream output = ...; // Where your output is going to
    new AutoDetectParser().parse(
        input, new BodyContentHandler(output), new Metadata());

Of course a static facade method like ParseUtils.parse(File input,
File output) might be easier for occasional users.

Did you have some specific method signatures in mind?

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Moving Functionality from CLI to ParseUtils

Jukka Zitting
Hi,

2009/7/11 keithrbennett <[hidden email]>:
> Having pluggable parts, as you suggest, is definitely the
> way to go for optimum power and flexibility.  However, IMHO,
> for the simplest use cases, and for beginning users,
> this approach may discourage and complicate Tika's use.
> I suggest an alternate simplified interface (see below)
> for these uses/users.

Agreed, the more I think about this the more I think having something
like this would be useful.

My proposal would be to add a org.apache.tika.Tika facade class with
static methods for the most important simple use cases.

> For the simple cases, I would suggest hiding things like parser
> implementations, metadata objects, and content handlers.  The simplest
> cases with document type autodetection could be handled by:
>
> parse(InputStream inputStream, OutputStream outputStream)

I guess the most important parsing use case is to produce a Reader for
use in Lucene indexing. Thus I would add a method like this:

    Reader parse(InputStream);

Some clients may prefer to have it all in a simple string (with all
the caveats of large inputs, perhaps we should have some built-in
output size limit), so we could also do:

    String parseToString(InputStream);

The XHTML output is probably only useful in more sophisticated use
cases, where the Parser interface and an appropriate ContentHandler
can be used directly.

> Then, to specify the document type, we could add a MimeType string
> argument:
>
> parse(InputStream inputStream, OutputStream outputStream,
>        String mimeType)

Tika is already pretty good at auto-detecting the document type, and
in my experience the file name is much more useful in helping type
detection than any externally provided type information. Tika likely
has a much more complete set of file name glob patterns than what
probably was used to produce the external type information.

Thus I'd rather give the proposed parse method information about the
file name when available. And instead of adding an explicit argument,
we could just as well add overloaded methods that also take care of
correctly opening and closing the file (or URL resource) as needed.
Something like this:

    Reader parse(File);
    Reader parse(URL);

Similarly for the parseToString method. In more complex cases (e.g. if
the file is inside a database field) one can always use the Parser
interface directly.

And while we're at it, there are many cases where an application needs
to figure out the type of a given document. Instead of coming up with
its own glob patterns and the like, an application could use Tika
functionality through potential facade methods like the following that
would return the auto-detected media type of the given document:

    String detect(InputStream);
    String detect(File);
    String detect(URL);

WDYT?

> Another question...I used Tika to parse an Excel spreadsheet. and it
> created an XML file.  How could I insert a handler for parsing
> documents with multiple records (such as an Excel spreadsheets, so
> that I could, for example, insert the record into a data base instead
> of writing XML to a file?

That's a big can of worms as each document type comes with it's own
structure and semantics. Tika avoids this problem by focusing on just
the contained text and some very generic structural information.

If you need more detailed structural information, you'll inevitably
hit type-specific features and my recommendation would be to directly
use the appropriate parser library. For example, I'd use POI directly
for pulling specific information out of Excel spreadsheets.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Moving Functionality from CLI to ParseUtils

keithrbennett
Jukka and All -

I think a Tika facade would be awesome.

I guess where I mentioned streams, I should be mentioning readers
and writers instead.

BTW, how can I insert new text into quoted sections of a message
in Nabble?

Regarding having a method that returns a Reader rather than
taking a Writer being better for Lucene, for other use cases a
Writer might be more convenient (for writing to files, for
example).  Having a method that takes a Writer would, I think, be
more useful than having a method returning a string because it
could 1) support sizes larger than memory capacity, 2) easily
support output to files, and 3) still support strings (by using a
StringWriter).

Speaking of Lucene, I have never used Lucene directly, so I lack
the context to understand the Tika/Lucene integration.  All my
input is from the point of view of someone who just wants to
parse text from documents and do things other than text search.
So if I neglect to include Lucene in my outlook, rest assured
that it is just ignorance and nothing more. ;)

Regarding XHTML, we already support it on the command line. My
sense is that Excel spreadsheet parsing would be used more often
for structured data than for raw text (that's certainly true for
me), so I hope we could keep that.  I understand your suggestion
to use Poi directly for more sophisticated document handling,
though.

Everything else sounded good to me.

Regards,
Keith