Multiple documents per input stream

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple documents per input stream

kkrugler
Hi all,

I recently wrote an mbox parser for Tika, since I need that for my  
Bixo web crawler.

One issue I ran into - a single mbox file logically decomposes into  
multiple documents. I can and do currently treat it as a single  
document, where I use XHTML <ul> lists for each message's headers. But  
it would work better from the client perspective if the metadata being  
returned by the parse() call could be used as expected - e.g.  
DublinCore's SUBJECT, DATE, and CREATOR match up with each email's  
subject, date and author header fields.

Currently though the metadata can't be used this way, at least from  
what I see, due to the parse() call being a single call that returns a  
single instance.

Has this been discussed previously? Just curious, as I'd thought about  
changing my mbox parser to handle incremental calls to parse(), and  
save state in the context object being passed in. This would require a  
small change to how I call the parser, as it would then be a loop  
(while (is.available() > 0) { parser.parse(is, xxx); })

Thanks,

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply | Threaded
Open this post in threaded view
|

Re: Multiple documents per input stream

Jukka Zitting
Hi,

On Mon, Sep 14, 2009 at 3:53 PM, Ken Krugler
<[hidden email]> wrote:
> Has this been discussed previously? Just curious, as I'd thought about
> changing my mbox parser to handle incremental calls to parse(), and save
> state in the context object being passed in. This would require a small
> change to how I call the parser, as it would then be a loop (while
> (is.available() > 0) { parser.parse(is, xxx); })

See TIKA-252 [1] for a related feature request.

Tika has been designed to deal with documents as single entities,
since there is no comprehensive composite document abstraction that we
could easily use. Trying to solve that problem you quickly end up with
questions about whether an inline image should be treated the same as
a file attachment, or whether things like <img> tags in HTML documents
should be resolved and the images included in the parse output. It's
not an unsolvable problem, but it's complex enough that so far we've
scoped the issue outside Tika.

However, within the current Tika design there are a couple of options
that you could do:

* As suggested in TIKA-252, you could extend the PackageParser to
embed per-component metadata into the produced XHTML output. Your
application would then need to detect the component boundaries and the
included metadata from the XHTML output.

* Alternatively you could inject a custom delegate parser that
intercepts each component stream and handles it separately without
producing output to be included in the top-level parse result.

[1] https://issues.apache.org/jira/browse/TIKA-252

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Multiple documents per input stream

kkrugler
Hi Jukka,

> On Mon, Sep 14, 2009 at 3:53 PM, Ken Krugler
> <[hidden email]> wrote:
>> Has this been discussed previously? Just curious, as I'd thought  
>> about
>> changing my mbox parser to handle incremental calls to parse(), and  
>> save
>> state in the context object being passed in. This would require a  
>> small
>> change to how I call the parser, as it would then be a loop (while
>> (is.available() > 0) { parser.parse(is, xxx); })
>
> See TIKA-252 [1] for a related feature request.
>
> Tika has been designed to deal with documents as single entities,
> since there is no comprehensive composite document abstraction that we
> could easily use. Trying to solve that problem you quickly end up with
> questions about whether an inline image should be treated the same as
> a file attachment, or whether things like <img> tags in HTML documents
> should be resolved and the images included in the parse output. It's
> not an unsolvable problem, but it's complex enough that so far we've
> scoped the issue outside Tika.

OK, and I agree that trying to deal with embedded documents is a tough  
problem.

My particular issue is that I'm using Tika in Bixo as the general  
parser, via the AutoDetectParser.

Which means I need to be able to generically extract the title,  
author, last modified date, etc. from the metadata, without having to  
know any specific details about the XHTML output.

So one way to slice the above problem would be to only worry about  
correct handling of "container" document formats, where sub-docs are  
all peers and typically contain standard metadata such as title,  
author, last modified date, etc.

I'll look into the options you outline below, for current releases.

Longer term it would be great to not have to worry about handling two  
different cases - e.g. by being able to call

while (parser.parse(is, handler, metadata, context)) {
        <process the doc>
}

Though I think this would also require passing in metadata like  
RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context, to  
avoid having to worry about selectively clearing out metadata. But I  
think that would be better anyway, versus the co-mingling of input &  
output data in the metadata container.

Thanks,

-- Ken

> However, within the current Tika design there are a couple of options
> that you could do:
>
> * As suggested in TIKA-252, you could extend the PackageParser to
> embed per-component metadata into the produced XHTML output. Your
> application would then need to detect the component boundaries and the
> included metadata from the XHTML output.
>
> * Alternatively you could inject a custom delegate parser that
> intercepts each component stream and handles it separately without
> producing output to be included in the top-level parse result.
>
> [1] https://issues.apache.org/jira/browse/TIKA-252
>
> BR,
>
> Jukka Zitting

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply | Threaded
Open this post in threaded view
|

Re: Multiple documents per input stream

Jukka Zitting
Hi,

On Wed, Sep 23, 2009 at 7:38 PM, Ken Krugler
<[hidden email]> wrote:

> Longer term it would be great to not have to worry about handling two
> different cases - e.g. by being able to call
>
> while (parser.parse(is, handler, metadata, context)) {
>        <process the doc>
> }
>
> Though I think this would also require passing in metadata like
> RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context, to avoid
> having to worry about selectively clearing out metadata. But I think that
> would be better anyway, versus the co-mingling of input & output data in the
> metadata container.

The second option I gave in my earlier message is now a bit more
straightforward with the parsing context option introduced recently in
Tika trunk. You can now explicitly pass a delegate parser to be used
to process any component documents:

    Parser myComponentParser = new Parser() {
        public void parse(...) throws ... {
            // Process the component document stream
            // in any way you like, optionally passing the
            // extracted text also to the top level parser
            // through the given ContentHandler
        }
    };

    Map<String, Object> context = new HashMap<String, Object>();
    context.put(Parser.class.getName(), myComponentParser);
    parser.parse(stream, handler, metadata, context);

In this example myComponentParser.parse() would get called once for
each component document inside a package.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Multiple documents per input stream

kkrugler
Hi Jukka,

> On Wed, Sep 23, 2009 at 7:38 PM, Ken Krugler
> <[hidden email]> wrote:
>> Longer term it would be great to not have to worry about handling two
>> different cases - e.g. by being able to call
>>
>> while (parser.parse(is, handler, metadata, context)) {
>>        <process the doc>
>> }
>>
>> Though I think this would also require passing in metadata like
>> RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context,  
>> to avoid
>> having to worry about selectively clearing out metadata. But I  
>> think that
>> would be better anyway, versus the co-mingling of input & output  
>> data in the
>> metadata container.
>
> The second option I gave in my earlier message is now a bit more
> straightforward with the parsing context option introduced recently in
> Tika trunk. You can now explicitly pass a delegate parser to be used
> to process any component documents:
>
>    Parser myComponentParser = new Parser() {
>        public void parse(...) throws ... {
>            // Process the component document stream
>            // in any way you like, optionally passing the
>            // extracted text also to the top level parser
>            // through the given ContentHandler
>        }
>    };
>
>    Map<String, Object> context = new HashMap<String, Object>();
>    context.put(Parser.class.getName(), myComponentParser);
>    parser.parse(stream, handler, metadata, context);
>
> In this example myComponentParser.parse() would get called once for
> each component document inside a package.

OK, thanks.

Though I don't think this would address the fundamental question of  
how to generically extract metadata like the title from compound  
documents, right?

You'd still have to know something about how the delegate parser  
embeds this information in the actual XHTML output.

Thanks,

-- Ken


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply | Threaded
Open this post in threaded view
|

Re: Multiple documents per input stream

Jukka Zitting
Hi,

On Sun, Sep 27, 2009 at 2:59 PM, Ken Krugler
<[hidden email]> wrote:
> Though I don't think this would address the fundamental question of how to
> generically extract metadata like the title from compound documents, right?
>
> You'd still have to know something about how the delegate parser embeds this
> information in the actual XHTML output.

Not necessarily, as the delegate parser could well decide to process
the document in some other way (create a separate Lucene index entry,
etc.) than simply reporting the extracted text back to the top-level
parser.

Such use does bend the Parser interface contract, but it does allow
you to do pretty much anything you want with the component documents.

BR,

Jukka Zitting