Not-yet-broken breaking changes for Tika 2?

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Not-yet-broken breaking changes for Tika 2?

Nick Burch-3
Hi All

Based on the plan on the wiki
<https://wiki.apache.org/tika/Tika2_0RoadMap>
<https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a
major breaking change or two planned for Tika 2 that we haven't yet
"broken". (There's also removing some deprecated stuff etc)


As I understand it, the biggest breaking TODO change is around having
multiple parsers available + active for a given format. This could be to
support fallback parsers, eg "try this fancy new parser, but if it falls
retry with this simpler one" or "try this xml parser, if that fails just
try strings". A related but different case is to cleanly support multiple
parsers covering different aspects, eg OCR an image plus extract metadata,
or NER on the contents of a scientific PDF + text + metadata + NER of the
OCR of embedded images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly)
handled via one parser (eg OCR or NER) having an embedded hard-code
reference to another (eg Image or PDF).


We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers
interact with the SAX content handler. For the fallback case, that's how
to say "sorry, ignore all that XML we already sent, we're starting again
with this XML now". For the multiple parser case, it's how we could have
the image parser "finish" the (empty) XHTML but then have the OCR one send
some text, or have the NER parser get at the XHTML text of the PDF + OCR
of embedded images to enhance with the entities.


What do we think for this? Can we come up with a solution to let this go
forward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and
do this stuff in Tika 3 instead?

Nick
Reply | Threaded
Open this post in threaded view
|

RE: Not-yet-broken breaking changes for Tika 2?

Allison, Timothy B.
At this point, I'm willing to punt to 3.x, unless there's momentum for either of these two.  They would be great to have!

1) chaining multiple parsers -- additive
This shouldn't be too bad, except where there's conflicting metadata -- parser1 says author is 'bob', parser2 says author is 'alice'.  We would break some uniqueness guarantees for some Properties that should only allow a single value if we added those values...  Overwriting feels like a bad idea.  Perhaps we remove the uniqueness guarantees when in "additive" mode ... or let users select additive/overwrite?

2) fallback parsers
>The biggest stumbling block, as I see it, is how to let multiple parsers interact with the SAX content handler. For the fallback case, that's how to say "sorry, ignore all that XML we already sent, we're starting again with this XML now".

Y, this has been what's holding me back.  How do we create a resettable handler that doesn't have us mucking too much with all of our current handlers.  For those with outputstreams/writers,  I imagine we'd require a resettable OutputStream...TikaOutputStream(?)

TikaOutputStream() --underling stringwriter, when reset, would just be a new stringwriter on reset() ??? Not quite right...
TikaOutputStream.get(Path/File) -- would hold the underlying file/path, close the writer, and just rewrite on reset()
TikaOutputStream.get(ByteArrayOutputStream)  baos has a reset() so that should work...

What other use cases?




-----Original Message-----
From: Nick Burch [mailto:[hidden email]]
Sent: Thursday, October 26, 2017 6:57 AM
To: [hidden email]
Subject: Not-yet-broken breaking changes for Tika 2?

Hi All

Based on the plan on the wiki
<https://wiki.apache.org/tika/Tika2_0RoadMap>
<https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a major breaking change or two planned for Tika 2 that we haven't yet "broken". (There's also removing some deprecated stuff etc)


As I understand it, the biggest breaking TODO change is around having multiple parsers available + active for a given format. This could be to support fallback parsers, eg "try this fancy new parser, but if it falls retry with this simpler one" or "try this xml parser, if that fails just try strings". A related but different case is to cleanly support multiple parsers covering different aspects, eg OCR an image plus extract metadata, or NER on the contents of a scientific PDF + text + metadata + NER of the OCR of embedded images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly) handled via one parser (eg OCR or NER) having an embedded hard-code reference to another (eg Image or PDF).


We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers interact with the SAX content handler. For the fallback case, that's how to say "sorry, ignore all that XML we already sent, we're starting again with this XML now". For the multiple parser case, it's how we could have the image parser "finish" the (empty) XHTML but then have the OCR one send some text, or have the NER parser get at the XHTML text of the PDF + OCR of embedded images to enhance with the entities.


What do we think for this? Can we come up with a solution to let this go forward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and do this stuff in Tika 3 instead?

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Chris Mattmann
In reply to this post by Nick Burch-3
Why don’t we just store N copies of the stream, and parse it twice?

Of course that’s the ugly way, but currently the way I’ve hacked this in all of
my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just use
that as the weakest baseline and work backwards from there?

Chris




On 10/26/17, 3:56 AM, "Nick Burch" <[hidden email]> wrote:

    Hi All
   
    Based on the plan on the wiki
    <https://wiki.apache.org/tika/Tika2_0RoadMap>
    <https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a
    major breaking change or two planned for Tika 2 that we haven't yet
    "broken". (There's also removing some deprecated stuff etc)
   
   
    As I understand it, the biggest breaking TODO change is around having
    multiple parsers available + active for a given format. This could be to
    support fallback parsers, eg "try this fancy new parser, but if it falls
    retry with this simpler one" or "try this xml parser, if that fails just
    try strings". A related but different case is to cleanly support multiple
    parsers covering different aspects, eg OCR an image plus extract metadata,
    or NER on the contents of a scientific PDF + text + metadata + NER of the
    OCR of embedded images in the PDF.
   
    Currently, we can't cleanly do the former, and the latter is (badly)
    handled via one parser (eg OCR or NER) having an embedded hard-code
    reference to another (eg Image or PDF).
   
   
    We've got some details on the proposed plans and ideas on the wiki:
    https://wiki.apache.org/tika/CompositeParserDiscussion
   
    The biggest stumbling block, as I see it, is how to let multiple parsers
    interact with the SAX content handler. For the fallback case, that's how
    to say "sorry, ignore all that XML we already sent, we're starting again
    with this XML now". For the multiple parser case, it's how we could have
    the image parser "finish" the (empty) XHTML but then have the OCR one send
    some text, or have the NER parser get at the XHTML text of the PDF + OCR
    of embedded images to enhance with the entities.
   
   
    What do we think for this? Can we come up with a solution to let this go
    forward? Is there a pattern from elsewhere we can follow?
   
    Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and
    do this stuff in Tika 3 instead?
   
    Nick
   


Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Nick Burch-2
On Thu, 26 Oct 2017, Chris Mattmann wrote:
> Why don’t we just store N copies of the stream, and parse it twice?

I'm not sure that's the challenge though? Using TikaInputStream we can
buffer to a temp file if needed to re-read the input

> Of course that’s the ugly way, but currently the way I’ve hacked this in
> all of my projects is simply to call Tika N times OUTSIDE of Tika. Why
> don’t we just use that as the weakest baseline and work backwards from
> there?

I think our main challenge right now is on the output end. How do you deal
with multiple different Metadata results that might clash after running
Tika server times? How do you deal with multiple (some potentially empty,
some overlapping) XHTML outputs from multiple parses? Can we copy those
approaches?

Thanks
Nick
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Chris Mattmann
Thanks Nick.

My general approach to conflicting metadata is simply to define precedence orders.

For example here is one documented from OODT:

https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence 

We can do similar things with Tika, e.g.,

[CoreMetadata.PROPERTIES]
[ImageParser.METADATA]
[TikaOCR.METADATA]


And then start with the top, and then overlay heading downwards. Make sense?

Cheers,
Chris

P.S. The metadata key/value merging principles could be configurable, but a default base one of
overlay according to some configured precedence order maybe in tika-config.xml would be a fine
start.




On 10/26/17, 9:14 AM, "Nick Burch" <[hidden email]> wrote:

    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    > Why don’t we just store N copies of the stream, and parse it twice?
   
    I'm not sure that's the challenge though? Using TikaInputStream we can
    buffer to a temp file if needed to re-read the input
   
    > Of course that’s the ugly way, but currently the way I’ve hacked this in
    > all of my projects is simply to call Tika N times OUTSIDE of Tika. Why
    > don’t we just use that as the weakest baseline and work backwards from
    > there?
   
    I think our main challenge right now is on the output end. How do you deal
    with multiple different Metadata results that might clash after running
    Tika server times? How do you deal with multiple (some potentially empty,
    some overlapping) XHTML outputs from multiple parses? Can we copy those
    approaches?
   
    Thanks
    Nick


Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Nick Burch-2
On Thu, 26 Oct 2017, Chris Mattmann wrote:

> My general approach to conflicting metadata is simply to define
> precedence orders.
>
> For example here is one documented from OODT:
>
> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>
> We can do similar things with Tika, e.g.,
>
> [CoreMetadata.PROPERTIES]
> [ImageParser.METADATA]
> [TikaOCR.METADATA]

What happens if two different parsers both output the same bit of metadata
though? eg Tim's example of one giving dc:creator of Tim and the second
giving dc:creator of Chris?


Secondly, what about the XHTML sax events stream? I think that's probably
the harder case...

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Chris Mattmann
On collision, the precedence order defines what key takes precedence and _overwrites_ the
other. Overwrite is but one option (you could save *all* the values it’s a multi-valued key structure
so…)

Cheers,
Chris




On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:

    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    > My general approach to conflicting metadata is simply to define
    > precedence orders.
    >
    > For example here is one documented from OODT:
    >
    > https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >
    > We can do similar things with Tika, e.g.,
    >
    > [CoreMetadata.PROPERTIES]
    > [ImageParser.METADATA]
    > [TikaOCR.METADATA]
   
    What happens if two different parsers both output the same bit of metadata
    though? eg Tim's example of one giving dc:creator of Tim and the second
    giving dc:creator of Chris?
   
   
    Secondly, what about the XHTML sax events stream? I think that's probably
    the harder case...
   
    Nick
   


Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Nick Burch-2
Sorry to ignore this for so long...

On Thu, 26 Oct 2017, Chris Mattmann wrote:
> On collision, the precedence order defines what key takes precedence and
> _overwrites_ the other. Overwrite is but one option (you could save
> *all* the values it’s a multi-valued key structure so…)

OK, I think that's fine. I've had a go at updating the wiki for the
metadata case:
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
And example Tika Config settings for it
https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
If people are happy with how that sounds/looks, I can have a stab at
implementing it, as I *think* it's quite easy


However... that still leaves the Context (XHTML SAX events) case to solve!

Anyone have any ideas on how we can append to or cancel/reset the Content
Handler series of SAX events when we move onto a second+ parser for a
file?

Thanks
Nick

> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
>
>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>    > My general approach to conflicting metadata is simply to define
>    > precedence orders.
>    >
>    > For example here is one documented from OODT:
>    >
>    > https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>    >
>    > We can do similar things with Tika, e.g.,
>    >
>    > [CoreMetadata.PROPERTIES]
>    > [ImageParser.METADATA]
>    > [TikaOCR.METADATA]
>
>    What happens if two different parsers both output the same bit of metadata
>    though? eg Tim's example of one giving dc:creator of Tim and the second
>    giving dc:creator of Chris?
>
>
>    Secondly, what about the XHTML sax events stream? I think that's probably
>    the harder case...
>
>    Nick
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Nick Burch-2
Ping - anyone got any thoughts on the proposed metadata parser stuff, and
any ideas on the content part?

On Tue, 2 Jan 2018, Nick Burch wrote:

> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes precedence and
>> _overwrites_ the other. Overwrite is but one option (you could save *all*
>> the values it’s a multi-valued key structure so…)
>
> OK, I think that's fine. I've had a go at updating the wiki for the metadata
> case:
> https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
> And example Tika Config settings for it
> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
> If people are happy with how that sounds/looks, I can have a stab at
> implementing it, as I *think* it's quite easy
>
>
> However... that still leaves the Context (XHTML SAX events) case to solve!
>
> Anyone have any ideas on how we can append to or cancel/reset the Content
> Handler series of SAX events when we move onto a second+ parser for a file?
>
> Thanks
> Nick
>
>> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
>>
>>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>    > My general approach to conflicting metadata is simply to define
>>    > precedence orders.
>>    >
>>    > For example here is one documented from OODT:
>>    >
>>    >
>> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>>    >
>>    > We can do similar things with Tika, e.g.,
>>    >
>>    > [CoreMetadata.PROPERTIES]
>>    > [ImageParser.METADATA]
>>    > [TikaOCR.METADATA]
>>
>>    What happens if two different parsers both output the same bit of
>> metadata
>>    though? eg Tim's example of one giving dc:creator of Tim and the second
>>    giving dc:creator of Chris?
>>
>>
>>    Secondly, what about the XHTML sax events stream? I think that's
>> probably
>>    the harder case...
>>
>>    Nick
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Chris Mattmann
Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\



On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:

    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
    any ideas on the content part?
   
    On Tue, 2 Jan 2018, Nick Burch wrote:
    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >> On collision, the precedence order defines what key takes precedence and
    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
    >> the values it’s a multi-valued key structure so…)
    >
    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
    > case:
    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    > And example Tika Config settings for it
    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    > If people are happy with how that sounds/looks, I can have a stab at
    > implementing it, as I *think* it's quite easy
    >
    >
    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >
    > Anyone have any ideas on how we can append to or cancel/reset the Content
    > Handler series of SAX events when we move onto a second+ parser for a file?
    >
    > Thanks
    > Nick
    >
    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
    >>
    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >>    > My general approach to conflicting metadata is simply to define
    >>    > precedence orders.
    >>    >
    >>    > For example here is one documented from OODT:
    >>    >
    >>    >
    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >>    >
    >>    > We can do similar things with Tika, e.g.,
    >>    >
    >>    > [CoreMetadata.PROPERTIES]
    >>    > [ImageParser.METADATA]
    >>    > [TikaOCR.METADATA]
    >>
    >>    What happens if two different parsers both output the same bit of
    >> metadata
    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >>    giving dc:creator of Chris?
    >>
    >>
    >>    Secondly, what about the XHTML sax events stream? I think that's
    >> probably
    >>    the harder case...
    >>
    >>    Nick


Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Nick Burch-2
On Mon, 5 Feb 2018, Chris Mattmann wrote:
> Let's have a go at implementing it! You know my thoughts (make it like
> OODT ;) )\

I'm still keen to hear how we can do the text content like OODT!

I have tried to copy the OODT model for the proposed metadata case though
:)

Nick

> On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
>
>    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
>    any ideas on the content part?
>
>    On Tue, 2 Jan 2018, Nick Burch wrote:
>    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
>    >> On collision, the precedence order defines what key takes precedence and
>    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
>    >> the values it’s a multi-valued key structure so…)
>    >
>    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
>    > case:
>    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
>    > And example Tika Config settings for it
>    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
>    > If people are happy with how that sounds/looks, I can have a stab at
>    > implementing it, as I *think* it's quite easy
>    >
>    >
>    > However... that still leaves the Context (XHTML SAX events) case to solve!
>    >
>    > Anyone have any ideas on how we can append to or cancel/reset the Content
>    > Handler series of SAX events when we move onto a second+ parser for a file?
>    >
>    > Thanks
>    > Nick
>    >
>    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
>    >>
>    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>    >>    > My general approach to conflicting metadata is simply to define
>    >>    > precedence orders.
>    >>    >
>    >>    > For example here is one documented from OODT:
>    >>    >
>    >>    >
>    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>    >>    >
>    >>    > We can do similar things with Tika, e.g.,
>    >>    >
>    >>    > [CoreMetadata.PROPERTIES]
>    >>    > [ImageParser.METADATA]
>    >>    > [TikaOCR.METADATA]
>    >>
>    >>    What happens if two different parsers both output the same bit of
>    >> metadata
>    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
>    >>    giving dc:creator of Chris?
>    >>
>    >>
>    >>    Secondly, what about the XHTML sax events stream? I think that's
>    >> probably
>    >>    the harder case...
>    >>
>    >>    Nick
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Mattmann, Chris A (3010)
Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
In short just run through the stream 2x....

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO
Manager, Advanced IT Research and Open Source Projects Office (1761)
Manager, NSF and Open Source Programs and Applications Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:

    On Mon, 5 Feb 2018, Chris Mattmann wrote:
    > Let's have a go at implementing it! You know my thoughts (make it like
    > OODT ;) )\
   
    I'm still keen to hear how we can do the text content like OODT!
   
    I have tried to copy the OODT model for the proposed metadata case though
    :)
   
    Nick
   
    > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
    >
    >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
    >    any ideas on the content part?
    >
    >    On Tue, 2 Jan 2018, Nick Burch wrote:
    >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >> On collision, the precedence order defines what key takes precedence and
    >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
    >    >> the values it’s a multi-valued key structure so…)
    >    >
    >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
    >    > case:
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    >    > And example Tika Config settings for it
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    >    > If people are happy with how that sounds/looks, I can have a stab at
    >    > implementing it, as I *think* it's quite easy
    >    >
    >    >
    >    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >    >
    >    > Anyone have any ideas on how we can append to or cancel/reset the Content
    >    > Handler series of SAX events when we move onto a second+ parser for a file?
    >    >
    >    > Thanks
    >    > Nick
    >    >
    >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
    >    >>
    >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >>    > My general approach to conflicting metadata is simply to define
    >    >>    > precedence orders.
    >    >>    >
    >    >>    > For example here is one documented from OODT:
    >    >>    >
    >    >>    >
    >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >    >>    >
    >    >>    > We can do similar things with Tika, e.g.,
    >    >>    >
    >    >>    > [CoreMetadata.PROPERTIES]
    >    >>    > [ImageParser.METADATA]
    >    >>    > [TikaOCR.METADATA]
    >    >>
    >    >>    What happens if two different parsers both output the same bit of
    >    >> metadata
    >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >    >>    giving dc:creator of Chris?
    >    >>
    >    >>
    >    >>    Secondly, what about the XHTML sax events stream? I think that's
    >    >> probably
    >    >>    the harder case...
    >    >>
    >    >>    Nick
    >
    >
    >

Reply | Threaded
Open this post in threaded view
|

RE: Not-yet-broken breaking changes for Tika 2?

Allison, Timothy B.
Spool to temp file?

-----Original Message-----
From: Mattmann, Chris A (1761) [mailto:[hidden email]]
Sent: Monday, February 5, 2018 12:29 PM
To: [hidden email]
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
In short just run through the stream 2x....

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:

    On Mon, 5 Feb 2018, Chris Mattmann wrote:
    > Let's have a go at implementing it! You know my thoughts (make it like
    > OODT ;) )\
   
    I'm still keen to hear how we can do the text content like OODT!
   
    I have tried to copy the OODT model for the proposed metadata case though
    :)
   
    Nick
   
    > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
    >
    >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
    >    any ideas on the content part?
    >
    >    On Tue, 2 Jan 2018, Nick Burch wrote:
    >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >> On collision, the precedence order defines what key takes precedence and
    >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
    >    >> the values it’s a multi-valued key structure so…)
    >    >
    >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
    >    > case:
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    >    > And example Tika Config settings for it
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    >    > If people are happy with how that sounds/looks, I can have a stab at
    >    > implementing it, as I *think* it's quite easy
    >    >
    >    >
    >    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >    >
    >    > Anyone have any ideas on how we can append to or cancel/reset the Content
    >    > Handler series of SAX events when we move onto a second+ parser for a file?
    >    >
    >    > Thanks
    >    > Nick
    >    >
    >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
    >    >>
    >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >>    > My general approach to conflicting metadata is simply to define
    >    >>    > precedence orders.
    >    >>    >
    >    >>    > For example here is one documented from OODT:
    >    >>    >
    >    >>    >
    >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >    >>    >
    >    >>    > We can do similar things with Tika, e.g.,
    >    >>    >
    >    >>    > [CoreMetadata.PROPERTIES]
    >    >>    > [ImageParser.METADATA]
    >    >>    > [TikaOCR.METADATA]
    >    >>
    >    >>    What happens if two different parsers both output the same bit of
    >    >> metadata
    >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >    >>    giving dc:creator of Chris?
    >    >>
    >    >>
    >    >>    Secondly, what about the XHTML sax events stream? I think that's
    >    >> probably
    >    >>    the harder case...
    >    >>
    >    >>    Nick
    >
    >
    >

Reply | Threaded
Open this post in threaded view
|

RE: Not-yet-broken breaking changes for Tika 2?

Allison, Timothy B.
In reply to this post by Mattmann, Chris A (3010)
To my mind, the real challenge is what to do with content that should be ignored...

If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?

Or do we just say, hey, now we're trying a different parser...


-----Original Message-----
From: Mattmann, Chris A (1761) [mailto:[hidden email]]
Sent: Monday, February 5, 2018 12:29 PM
To: [hidden email]
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
In short just run through the stream 2x....

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:

    On Mon, 5 Feb 2018, Chris Mattmann wrote:
    > Let's have a go at implementing it! You know my thoughts (make it like
    > OODT ;) )\
   
    I'm still keen to hear how we can do the text content like OODT!
   
    I have tried to copy the OODT model for the proposed metadata case though
    :)
   
    Nick
   
    > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
    >
    >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
    >    any ideas on the content part?
    >
    >    On Tue, 2 Jan 2018, Nick Burch wrote:
    >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >> On collision, the precedence order defines what key takes precedence and
    >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
    >    >> the values it’s a multi-valued key structure so…)
    >    >
    >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
    >    > case:
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    >    > And example Tika Config settings for it
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    >    > If people are happy with how that sounds/looks, I can have a stab at
    >    > implementing it, as I *think* it's quite easy
    >    >
    >    >
    >    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >    >
    >    > Anyone have any ideas on how we can append to or cancel/reset the Content
    >    > Handler series of SAX events when we move onto a second+ parser for a file?
    >    >
    >    > Thanks
    >    > Nick
    >    >
    >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
    >    >>
    >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >>    > My general approach to conflicting metadata is simply to define
    >    >>    > precedence orders.
    >    >>    >
    >    >>    > For example here is one documented from OODT:
    >    >>    >
    >    >>    >
    >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >    >>    >
    >    >>    > We can do similar things with Tika, e.g.,
    >    >>    >
    >    >>    > [CoreMetadata.PROPERTIES]
    >    >>    > [ImageParser.METADATA]
    >    >>    > [TikaOCR.METADATA]
    >    >>
    >    >>    What happens if two different parsers both output the same bit of
    >    >> metadata
    >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >    >>    giving dc:creator of Chris?
    >    >>
    >    >>
    >    >>    Secondly, what about the XHTML sax events stream? I think that's
    >    >> probably
    >    >>    the harder case...
    >    >>
    >    >>    Nick
    >
    >
    >

Reply | Threaded
Open this post in threaded view
|

RE: Not-yet-broken breaking changes for Tika 2?

Allison, Timothy B.
In reply to this post by Nick Burch-2
On the metadata stuff, I'm coming around to Ray Gauss's proposal.  I wanted too much back then, and his solution is super elegant, IIRC.

-----Original Message-----
From: Nick Burch [mailto:[hidden email]]
Sent: Monday, February 5, 2018 11:37 AM
To: [hidden email]
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part?

On Tue, 2 Jan 2018, Nick Burch wrote:

> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes precedence
>> and _overwrites_ the other. Overwrite is but one option (you could
>> save *all* the values it’s a multi-valued key structure so…)
>
> OK, I think that's fine. I've had a go at updating the wiki for the
> metadata
> case:
> https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2
> FAdditive And example Tika Config settings for it
> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
> If people are happy with how that sounds/looks, I can have a stab at
> implementing it, as I *think* it's quite easy
>
>
> However... that still leaves the Context (XHTML SAX events) case to solve!
>
> Anyone have any ideas on how we can append to or cancel/reset the
> Content Handler series of SAX events when we move onto a second+ parser for a file?
>
> Thanks
> Nick
>
>> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
>>
>>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>    > My general approach to conflicting metadata is simply to define
>>    > precedence orders.
>>    >
>>    > For example here is one documented from OODT:
>>    >
>>    >
>> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>>    >
>>    > We can do similar things with Tika, e.g.,
>>    >
>>    > [CoreMetadata.PROPERTIES]
>>    > [ImageParser.METADATA]
>>    > [TikaOCR.METADATA]
>>
>>    What happens if two different parsers both output the same bit of
>> metadata
>>    though? eg Tim's example of one giving dc:creator of Tim and the second
>>    giving dc:creator of Chris?
>>
>>
>>    Secondly, what about the XHTML sax events stream? I think that's
>> probably
>>    the harder case...
>>
>>    Nick
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Chris Mattmann
In reply to this post by Allison, Timothy B.
I think we should just say, OK now we're trying  a different parser....



On 2/5/18, 9:51 AM, "Allison, Timothy B." <[hidden email]> wrote:

    To my mind, the real challenge is what to do with content that should be ignored...
   
    If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?
   
    Or do we just say, hey, now we're trying a different parser...
   
   
    -----Original Message-----
    From: Mattmann, Chris A (1761) [mailto:[hidden email]]
    Sent: Monday, February 5, 2018 12:29 PM
    To: [hidden email]
    Subject: Re: Not-yet-broken breaking changes for Tika 2?
   
    Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
    In short just run through the stream 2x....
   
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-502
    Email: [hidden email]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     
     
    On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:
   
        On Mon, 5 Feb 2018, Chris Mattmann wrote:
        > Let's have a go at implementing it! You know my thoughts (make it like
        > OODT ;) )\
       
        I'm still keen to hear how we can do the text content like OODT!
       
        I have tried to copy the OODT model for the proposed metadata case though
        :)
       
        Nick
       
        > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
        >
        >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
        >    any ideas on the content part?
        >
        >    On Tue, 2 Jan 2018, Nick Burch wrote:
        >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >> On collision, the precedence order defines what key takes precedence and
        >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
        >    >> the values it’s a multi-valued key structure so…)
        >    >
        >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
        >    > case:
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
        >    > And example Tika Config settings for it
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
        >    > If people are happy with how that sounds/looks, I can have a stab at
        >    > implementing it, as I *think* it's quite easy
        >    >
        >    >
        >    > However... that still leaves the Context (XHTML SAX events) case to solve!
        >    >
        >    > Anyone have any ideas on how we can append to or cancel/reset the Content
        >    > Handler series of SAX events when we move onto a second+ parser for a file?
        >    >
        >    > Thanks
        >    > Nick
        >    >
        >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]> wrote:
        >    >>
        >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >>    > My general approach to conflicting metadata is simply to define
        >    >>    > precedence orders.
        >    >>    >
        >    >>    > For example here is one documented from OODT:
        >    >>    >
        >    >>    >
        >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
        >    >>    >
        >    >>    > We can do similar things with Tika, e.g.,
        >    >>    >
        >    >>    > [CoreMetadata.PROPERTIES]
        >    >>    > [ImageParser.METADATA]
        >    >>    > [TikaOCR.METADATA]
        >    >>
        >    >>    What happens if two different parsers both output the same bit of
        >    >> metadata
        >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
        >    >>    giving dc:creator of Chris?
        >    >>
        >    >>
        >    >>    Secondly, what about the XHTML sax events stream? I think that's
        >    >> probably
        >    >>    the harder case...
        >    >>
        >    >>    Nick
        >
        >
        >
   
   


Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Luís Filipe Nassif
From a forensic use case it is better just saying we are trying another
parser and not resetting the content handler, because the first parser can
extract relevant content before the exception.

To not spool everything to temp files to re-read the stream, I think we can
create an optional setinputstreamfactory() method in TikaInputStream, so
the user can implement an InputStreamFactory interface with a
getInputStream method, if he does not want to pay a performance hit with
temp files for everything.

Luis

Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <[hidden email]>
escreveu:

I think we should just say, OK now we're trying  a different parser....



On 2/5/18, 9:51 AM, "Allison, Timothy B." <[hidden email]> wrote:

    To my mind, the real challenge is what to do with content that should
be ignored...

    If the strategy is back-off-on-exception (try the DOCX parser, but if
there's an exception, use the Zip parser), what do we do with the sax
elements that have already been written?  Do we need a new handler type
that has a reset() method?

    Or do we just say, hey, now we're trying a different parser...


    -----Original Message-----
    From: Mattmann, Chris A (1761) [mailto:[hidden email]]
    Sent: Monday, February 5, 2018 12:29 PM
    To: [hidden email]
    Subject: Re: Not-yet-broken breaking changes for Tika 2?

    Our solution is just to run the parser 2x....yes I get it will induce
overhead, but as a start, why not?
    In short just run through the stream 2x....

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Chris Mattmann, Ph.D.
    Associate Chief Technology and Innovation Officer, OCIO Manager,
Advanced IT Research and Open Source Projects Office (1761) Manager, NSF
and Open Source Programs and Applications Office (8212) NASA Jet Propulsion
Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-502
    Email: [hidden email]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct
Associate Professor, Computer Science Department University of Southern
California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++


    On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:

        On Mon, 5 Feb 2018, Chris Mattmann wrote:
        > Let's have a go at implementing it! You know my thoughts (make it
like
        > OODT ;) )\

        I'm still keen to hear how we can do the text content like OODT!

        I have tried to copy the OODT model for the proposed metadata case
though
        :)

        Nick

        > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
        >
        >    Ping - anyone got any thoughts on the proposed metadata parser
stuff, and
        >    any ideas on the content part?
        >
        >    On Tue, 2 Jan 2018, Nick Burch wrote:
        >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >> On collision, the precedence order defines what key takes
precedence and
        >    >> _overwrites_ the other. Overwrite is but one option (you
could save *all*
        >    >> the values it’s a multi-valued key structure so…)
        >    >
        >    > OK, I think that's fine. I've had a go at updating the wiki
for the metadata
        >    > case:
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
Supplementary.2FAdditive
        >    > And example Tika Config settings for it
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
line-20
        >    > If people are happy with how that sounds/looks, I can have a
stab at
        >    > implementing it, as I *think* it's quite easy
        >    >
        >    >
        >    > However... that still leaves the Context (XHTML SAX events)
case to solve!
        >    >
        >    > Anyone have any ideas on how we can append to or
cancel/reset the Content
        >    > Handler series of SAX events when we move onto a second+
parser for a file?
        >    >
        >    > Thanks
        >    > Nick
        >    >
        >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]>
wrote:
        >    >>
        >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >>    > My general approach to conflicting metadata is simply
to define
        >    >>    > precedence orders.
        >    >>    >
        >    >>    > For example here is one documented from OODT:
        >    >>    >
        >    >>    >
        >    >> https://cwiki.apache.org/confluence/display/OODT/
Understanding+CAS-PGE+Metadata+Precendence
        >    >>    >
        >    >>    > We can do similar things with Tika, e.g.,
        >    >>    >
        >    >>    > [CoreMetadata.PROPERTIES]
        >    >>    > [ImageParser.METADATA]
        >    >>    > [TikaOCR.METADATA]
        >    >>
        >    >>    What happens if two different parsers both output the
same bit of
        >    >> metadata
        >    >>    though? eg Tim's example of one giving dc:creator of Tim
and the second
        >    >>    giving dc:creator of Chris?
        >    >>
        >    >>
        >    >>    Secondly, what about the XHTML sax events stream? I
think that's
        >    >> probably
        >    >>    the harder case...
        >    >>
        >    >>    Nick
        >
        >
        >
Reply | Threaded
Open this post in threaded view
|

RE: Not-yet-broken breaking changes for Tika 2?

Allison, Timothy B.
Do we worry about properly closing tags on an exception?

<body>
        <div parser="parser1">
                <p>
kaboom
        <div parser="parser2>
....

My focus is normally text so broken tags aren't a problem for me...but others?

-----Original Message-----
From: Luís Filipe Nassif [mailto:[hidden email]]
Sent: Monday, February 5, 2018 5:34 PM
To: [hidden email]
Subject: Re: Not-yet-broken breaking changes for Tika 2?

From a forensic use case it is better just saying we are trying another parser and not resetting the content handler, because the first parser can extract relevant content before the exception.

To not spool everything to temp files to re-read the stream, I think we can create an optional setinputstreamfactory() method in TikaInputStream, so the user can implement an InputStreamFactory interface with a getInputStream method, if he does not want to pay a performance hit with temp files for everything.

Luis

Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <[hidden email]>
escreveu:

I think we should just say, OK now we're trying  a different parser....



On 2/5/18, 9:51 AM, "Allison, Timothy B." <[hidden email]> wrote:

    To my mind, the real challenge is what to do with content that should be ignored...

    If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?

    Or do we just say, hey, now we're trying a different parser...


    -----Original Message-----
    From: Mattmann, Chris A (1761) [mailto:[hidden email]]
    Sent: Monday, February 5, 2018 12:29 PM
    To: [hidden email]
    Subject: Re: Not-yet-broken breaking changes for Tika 2?

    Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
    In short just run through the stream 2x....

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Chris Mattmann, Ph.D.
    Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-502
    Email: [hidden email]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++


    On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:

        On Mon, 5 Feb 2018, Chris Mattmann wrote:
        > Let's have a go at implementing it! You know my thoughts (make it like
        > OODT ;) )\

        I'm still keen to hear how we can do the text content like OODT!

        I have tried to copy the OODT model for the proposed metadata case though
        :)

        Nick

        > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
        >
        >    Ping - anyone got any thoughts on the proposed metadata parser
stuff, and
        >    any ideas on the content part?
        >
        >    On Tue, 2 Jan 2018, Nick Burch wrote:
        >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >> On collision, the precedence order defines what key takes
precedence and
        >    >> _overwrites_ the other. Overwrite is but one option (you
could save *all*
        >    >> the values it’s a multi-valued key structure so…)
        >    >
        >    > OK, I think that's fine. I've had a go at updating the wiki
for the metadata
        >    > case:
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
Supplementary.2FAdditive
        >    > And example Tika Config settings for it
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
line-20
        >    > If people are happy with how that sounds/looks, I can have a
stab at
        >    > implementing it, as I *think* it's quite easy
        >    >
        >    >
        >    > However... that still leaves the Context (XHTML SAX events)
case to solve!
        >    >
        >    > Anyone have any ideas on how we can append to or
cancel/reset the Content
        >    > Handler series of SAX events when we move onto a second+
parser for a file?
        >    >
        >    > Thanks
        >    > Nick
        >    >
        >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]>
wrote:
        >    >>
        >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >>    > My general approach to conflicting metadata is simply
to define
        >    >>    > precedence orders.
        >    >>    >
        >    >>    > For example here is one documented from OODT:
        >    >>    >
        >    >>    >
        >    >> https://cwiki.apache.org/confluence/display/OODT/
Understanding+CAS-PGE+Metadata+Precendence
        >    >>    >
        >    >>    > We can do similar things with Tika, e.g.,
        >    >>    >
        >    >>    > [CoreMetadata.PROPERTIES]
        >    >>    > [ImageParser.METADATA]
        >    >>    > [TikaOCR.METADATA]
        >    >>
        >    >>    What happens if two different parsers both output the
same bit of
        >    >> metadata
        >    >>    though? eg Tim's example of one giving dc:creator of Tim
and the second
        >    >>    giving dc:creator of Chris?
        >    >>
        >    >>
        >    >>    Secondly, what about the XHTML sax events stream? I
think that's
        >    >> probably
        >    >>    the harder case...
        >    >>
        >    >>    Nick
        >
        >
        >
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Luís Filipe Nassif
Mine too, but I know it is important for many use cases. Maybe adding to
XHtmlContentHandler some tracking of open tags and a new method to close
them?

2018-02-07 12:59 GMT-02:00 Allison, Timothy B. <[hidden email]>:

> Do we worry about properly closing tags on an exception?
>
> <body>
>         <div parser="parser1">
>                 <p>
> kaboom
>         <div parser="parser2>
> ....
>
> My focus is normally text so broken tags aren't a problem for me...but
> others?
>
> -----Original Message-----
> From: Luís Filipe Nassif [mailto:[hidden email]]
> Sent: Monday, February 5, 2018 5:34 PM
> To: [hidden email]
> Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
> From a forensic use case it is better just saying we are trying another
> parser and not resetting the content handler, because the first parser can
> extract relevant content before the exception.
>
> To not spool everything to temp files to re-read the stream, I think we
> can create an optional setinputstreamfactory() method in TikaInputStream,
> so the user can implement an InputStreamFactory interface with a
> getInputStream method, if he does not want to pay a performance hit with
> temp files for everything.
>
> Luis
>
> Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <[hidden email]>
> escreveu:
>
> I think we should just say, OK now we're trying  a different parser....
>
>
>
> On 2/5/18, 9:51 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
>     To my mind, the real challenge is what to do with content that should
> be ignored...
>
>     If the strategy is back-off-on-exception (try the DOCX parser, but if
> there's an exception, use the Zip parser), what do we do with the sax
> elements that have already been written?  Do we need a new handler type
> that has a reset() method?
>
>     Or do we just say, hey, now we're trying a different parser...
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (1761) [mailto:[hidden email]]
>     Sent: Monday, February 5, 2018 12:29 PM
>     To: [hidden email]
>     Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
>     Our solution is just to run the parser 2x....yes I get it will induce
> overhead, but as a start, why not?
>     In short just run through the stream 2x....
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Associate Chief Technology and Innovation Officer, OCIO Manager,
> Advanced IT Research and Open Source Projects Office (1761) Manager, NSF
> and Open Source Programs and Applications Office (8212) NASA Jet Propulsion
> Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-502
>     Email: [hidden email]
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS) Adjunct
> Associate Professor, Computer Science Department University of Southern
> California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:
>
>         On Mon, 5 Feb 2018, Chris Mattmann wrote:
>         > Let's have a go at implementing it! You know my thoughts (make
> it like
>         > OODT ;) )\
>
>         I'm still keen to hear how we can do the text content like OODT!
>
>         I have tried to copy the OODT model for the proposed metadata case
> though
>         :)
>
>         Nick
>
>         > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
>         >
>         >    Ping - anyone got any thoughts on the proposed metadata parser
> stuff, and
>         >    any ideas on the content part?
>         >
>         >    On Tue, 2 Jan 2018, Nick Burch wrote:
>         >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
>         >    >> On collision, the precedence order defines what key takes
> precedence and
>         >    >> _overwrites_ the other. Overwrite is but one option (you
> could save *all*
>         >    >> the values it’s a multi-valued key structure so…)
>         >    >
>         >    > OK, I think that's fine. I've had a go at updating the wiki
> for the metadata
>         >    > case:
>         >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
> Supplementary.2FAdditive
>         >    > And example Tika Config settings for it
>         >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
> line-20
>         >    > If people are happy with how that sounds/looks, I can have a
> stab at
>         >    > implementing it, as I *think* it's quite easy
>         >    >
>         >    >
>         >    > However... that still leaves the Context (XHTML SAX events)
> case to solve!
>         >    >
>         >    > Anyone have any ideas on how we can append to or
> cancel/reset the Content
>         >    > Handler series of SAX events when we move onto a second+
> parser for a file?
>         >    >
>         >    > Thanks
>         >    > Nick
>         >    >
>         >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]>
> wrote:
>         >    >>
>         >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>         >    >>    > My general approach to conflicting metadata is simply
> to define
>         >    >>    > precedence orders.
>         >    >>    >
>         >    >>    > For example here is one documented from OODT:
>         >    >>    >
>         >    >>    >
>         >    >> https://cwiki.apache.org/confluence/display/OODT/
> Understanding+CAS-PGE+Metadata+Precendence
>         >    >>    >
>         >    >>    > We can do similar things with Tika, e.g.,
>         >    >>    >
>         >    >>    > [CoreMetadata.PROPERTIES]
>         >    >>    > [ImageParser.METADATA]
>         >    >>    > [TikaOCR.METADATA]
>         >    >>
>         >    >>    What happens if two different parsers both output the
> same bit of
>         >    >> metadata
>         >    >>    though? eg Tim's example of one giving dc:creator of Tim
> and the second
>         >    >>    giving dc:creator of Chris?
>         >    >>
>         >    >>
>         >    >>    Secondly, what about the XHTML sax events stream? I
> think that's
>         >    >> probably
>         >    >>    the harder case...
>         >    >>
>         >    >>    Nick
>         >
>         >
>         >
>
Reply | Threaded
Open this post in threaded view
|

Re: Not-yet-broken breaking changes for Tika 2?

Chris Mattmann
In reply to this post by Allison, Timothy B.
IMO, if the parser p1 has an exception and then we move to p2 before p1 is done
creating its SAX we can create a special tag indicating the exception e.g., <span class="tika-exception"
>Message here</span> and have it output that before moving to p2 in the chain...



On 2/7/18, 7:00 AM, "Allison, Timothy B." <[hidden email]> wrote:

    Do we worry about properly closing tags on an exception?
   
    <body>
    <div parser="parser1">
    <p>
    kaboom
    <div parser="parser2>
    ....
   
    My focus is normally text so broken tags aren't a problem for me...but others?
   
    -----Original Message-----
    From: Luís Filipe Nassif [mailto:[hidden email]]
    Sent: Monday, February 5, 2018 5:34 PM
    To: [hidden email]
    Subject: Re: Not-yet-broken breaking changes for Tika 2?
   
    From a forensic use case it is better just saying we are trying another parser and not resetting the content handler, because the first parser can extract relevant content before the exception.
   
    To not spool everything to temp files to re-read the stream, I think we can create an optional setinputstreamfactory() method in TikaInputStream, so the user can implement an InputStreamFactory interface with a getInputStream method, if he does not want to pay a performance hit with temp files for everything.
   
    Luis
   
    Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <[hidden email]>
    escreveu:
   
    I think we should just say, OK now we're trying  a different parser....
   
   
   
    On 2/5/18, 9:51 AM, "Allison, Timothy B." <[hidden email]> wrote:
   
        To my mind, the real challenge is what to do with content that should be ignored...
   
        If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?
   
        Or do we just say, hey, now we're trying a different parser...
   
   
        -----Original Message-----
        From: Mattmann, Chris A (1761) [mailto:[hidden email]]
        Sent: Monday, February 5, 2018 12:29 PM
        To: [hidden email]
        Subject: Re: Not-yet-broken breaking changes for Tika 2?
   
        Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
        In short just run through the stream 2x....
   
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    ++++++++++++++
        Chris Mattmann, Ph.D.
        Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
        Office: 180-503E, Mailstop: 180-502
        Email: [hidden email]
        WWW:  http://sunset.usc.edu/~mattmann/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    ++++++++++++++
        Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
        WWW: http://irds.usc.edu/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    ++++++++++++++
   
   
        On 2/5/18, 9:25 AM, "Nick Burch" <[hidden email]> wrote:
   
            On Mon, 5 Feb 2018, Chris Mattmann wrote:
            > Let's have a go at implementing it! You know my thoughts (make it like
            > OODT ;) )\
   
            I'm still keen to hear how we can do the text content like OODT!
   
            I have tried to copy the OODT model for the proposed metadata case though
            :)
   
            Nick
   
            > On 2/5/18, 8:37 AM, "Nick Burch" <[hidden email]> wrote:
            >
            >    Ping - anyone got any thoughts on the proposed metadata parser
    stuff, and
            >    any ideas on the content part?
            >
            >    On Tue, 2 Jan 2018, Nick Burch wrote:
            >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
            >    >> On collision, the precedence order defines what key takes
    precedence and
            >    >> _overwrites_ the other. Overwrite is but one option (you
    could save *all*
            >    >> the values it’s a multi-valued key structure so…)
            >    >
            >    > OK, I think that's fine. I've had a go at updating the wiki
    for the metadata
            >    > case:
            >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
    Supplementary.2FAdditive
            >    > And example Tika Config settings for it
            >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
    line-20
            >    > If people are happy with how that sounds/looks, I can have a
    stab at
            >    > implementing it, as I *think* it's quite easy
            >    >
            >    >
            >    > However... that still leaves the Context (XHTML SAX events)
    case to solve!
            >    >
            >    > Anyone have any ideas on how we can append to or
    cancel/reset the Content
            >    > Handler series of SAX events when we move onto a second+
    parser for a file?
            >    >
            >    > Thanks
            >    > Nick
            >    >
            >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[hidden email]>
    wrote:
            >    >>
            >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
            >    >>    > My general approach to conflicting metadata is simply
    to define
            >    >>    > precedence orders.
            >    >>    >
            >    >>    > For example here is one documented from OODT:
            >    >>    >
            >    >>    >
            >    >> https://cwiki.apache.org/confluence/display/OODT/
    Understanding+CAS-PGE+Metadata+Precendence
            >    >>    >
            >    >>    > We can do similar things with Tika, e.g.,
            >    >>    >
            >    >>    > [CoreMetadata.PROPERTIES]
            >    >>    > [ImageParser.METADATA]
            >    >>    > [TikaOCR.METADATA]
            >    >>
            >    >>    What happens if two different parsers both output the
    same bit of
            >    >> metadata
            >    >>    though? eg Tim's example of one giving dc:creator of Tim
    and the second
            >    >>    giving dc:creator of Chris?
            >    >>
            >    >>
            >    >>    Secondly, what about the XHTML sax events stream? I
    think that's
            >    >> probably
            >    >>    the harder case...
            >    >>
            >    >>    Nick
            >
            >
            >