TIKA-1509 (2.x breaking parser change) - ready for first review!

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

TIKA-1509 (2.x breaking parser change) - ready for first review!

Nick Burch-3
Hi All

As promised, I've finally had a go to try and implement my ideas for
TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
breaking 2.x parser change

My work so far is in this github branch, and is ready for review!
https://github.com/apache/tika/tree/multiple-parsers


It seems to work fine for the Fallback case, and for the Supplemental
case. You can set a policy that controls how clashing metadata is handled,
currently "first one to set a key wins", "last one to set a key wins",
"ignore previous parsers", and "keep old and new unique values"

I've also done a proof of concept for "pick best" case, to try running the
text parser with a specified set of different charsets, capture the text
from each, "pick the best" (hard coded 1st...) then run for real with that
one.


Key TODOs - Support InputStreamFactory, properly work out what mimetypes
to claim to support, Tika Config XML friendly helper for the metadata
clash policy, review ContentHandlerFactory signature and tweak if needed.

Proposed breaking 2.x change - add second parse method that takes
ContentHandlerFactory instead of ContentHandler, with most parsers getting
that just grabbing a single one and using that as before


Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
I stop? Carry on? Modify it? Other?

Nick
Reply | Threaded
Open this post in threaded view
|

Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

David Meikle
Nice one Nick!  Will take a look this week.

Cheers,
Dave

On 14 March 2018 at 17:38, Nick Burch <[hidden email]> wrote:

> Hi All
>
> As promised, I've finally had a go to try and implement my ideas for
> TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
> breaking 2.x parser change
>
> My work so far is in this github branch, and is ready for review!
> https://github.com/apache/tika/tree/multiple-parsers
>
>
> It seems to work fine for the Fallback case, and for the Supplemental
> case. You can set a policy that controls how clashing metadata is handled,
> currently "first one to set a key wins", "last one to set a key wins",
> "ignore previous parsers", and "keep old and new unique values"
>
> I've also done a proof of concept for "pick best" case, to try running the
> text parser with a specified set of different charsets, capture the text
> from each, "pick the best" (hard coded 1st...) then run for real with that
> one.
>
>
> Key TODOs - Support InputStreamFactory, properly work out what mimetypes
> to claim to support, Tika Config XML friendly helper for the metadata clash
> policy, review ContentHandlerFactory signature and tweak if needed.
>
> Proposed breaking 2.x change - add second parse method that takes
> ContentHandlerFactory instead of ContentHandler, with most parsers getting
> that just grabbing a single one and using that as before
>
>
> Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
> I stop? Carry on? Modify it? Other?
>
> Nick
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

Chris Mattmann
Completely agree, awesome job Nick.

I will definitely try this week as well.

Thank you!

Sincerely,
Chris



On 3/18/18, 2:47 PM, "David Meikle" <[hidden email]> wrote:

    Nice one Nick!  Will take a look this week.
   
    Cheers,
    Dave
   
    On 14 March 2018 at 17:38, Nick Burch <[hidden email]> wrote:
   
    > Hi All
    >
    > As promised, I've finally had a go to try and implement my ideas for
    > TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
    > breaking 2.x parser change
    >
    > My work so far is in this github branch, and is ready for review!
    > https://github.com/apache/tika/tree/multiple-parsers
    >
    >
    > It seems to work fine for the Fallback case, and for the Supplemental
    > case. You can set a policy that controls how clashing metadata is handled,
    > currently "first one to set a key wins", "last one to set a key wins",
    > "ignore previous parsers", and "keep old and new unique values"
    >
    > I've also done a proof of concept for "pick best" case, to try running the
    > text parser with a specified set of different charsets, capture the text
    > from each, "pick the best" (hard coded 1st...) then run for real with that
    > one.
    >
    >
    > Key TODOs - Support InputStreamFactory, properly work out what mimetypes
    > to claim to support, Tika Config XML friendly helper for the metadata clash
    > policy, review ContentHandlerFactory signature and tweak if needed.
    >
    > Proposed breaking 2.x change - add second parse method that takes
    > ContentHandlerFactory instead of ContentHandler, with most parsers getting
    > that just grabbing a single one and using that as before
    >
    >
    > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
    > I stop? Carry on? Modify it? Other?
    >
    > Nick
    >
   


Reply | Threaded
Open this post in threaded view
|

RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

Allison, Timothy B.
Y, this is an impressive step forward.  Thank you, Nick!

-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: Sunday, March 18, 2018 6:00 PM
To: [hidden email]
Subject: Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

Completely agree, awesome job Nick.

I will definitely try this week as well.

Thank you!

Sincerely,
Chris



On 3/18/18, 2:47 PM, "David Meikle" <[hidden email]> wrote:

    Nice one Nick!  Will take a look this week.
   
    Cheers,
    Dave
   
    On 14 March 2018 at 17:38, Nick Burch <[hidden email]> wrote:
   
    > Hi All
    >
    > As promised, I've finally had a go to try and implement my ideas for
    > TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
    > breaking 2.x parser change
    >
    > My work so far is in this github branch, and is ready for review!
    > https://github.com/apache/tika/tree/multiple-parsers
    >
    >
    > It seems to work fine for the Fallback case, and for the Supplemental
    > case. You can set a policy that controls how clashing metadata is handled,
    > currently "first one to set a key wins", "last one to set a key wins",
    > "ignore previous parsers", and "keep old and new unique values"
    >
    > I've also done a proof of concept for "pick best" case, to try running the
    > text parser with a specified set of different charsets, capture the text
    > from each, "pick the best" (hard coded 1st...) then run for real with that
    > one.
    >
    >
    > Key TODOs - Support InputStreamFactory, properly work out what mimetypes
    > to claim to support, Tika Config XML friendly helper for the metadata clash
    > policy, review ContentHandlerFactory signature and tweak if needed.
    >
    > Proposed breaking 2.x change - add second parse method that takes
    > ContentHandlerFactory instead of ContentHandler, with most parsers getting
    > that just grabbing a single one and using that as before
    >
    >
    > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
    > I stop? Carry on? Modify it? Other?
    >
    > Nick
    >
   



Reply | Threaded
Open this post in threaded view
|

Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

Nick Burch-2
In reply to this post by Chris Mattmann
In the absense of complaints, I've gone ahead and merged this to Tika's
master branch for 1.x.  If I've done it right, there won't be any breaking
changes for 1.18, as everything is either new or marked as deprecated
pending finalisation.

I haven't merged to 2.x yet, as it'd be good to get some feedback on the
proposed Parser overridden parse method taking a ContentHandlerFactory
method (to go alongside the long-standing ContentHander one for simpler
cases)

Nick

On Sun, 18 Mar 2018, Chris Mattmann wrote:

> Completely agree, awesome job Nick.
>
> I will definitely try this week as well.
>
> Thank you!
>
> Sincerely,
> Chris
>
>
>
> On 3/18/18, 2:47 PM, "David Meikle" <[hidden email]> wrote:
>
>    Nice one Nick!  Will take a look this week.
>
>    Cheers,
>    Dave
>
>    On 14 March 2018 at 17:38, Nick Burch <[hidden email]> wrote:
>
>    > Hi All
>    >
>    > As promised, I've finally had a go to try and implement my ideas for
>    > TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
>    > breaking 2.x parser change
>    >
>    > My work so far is in this github branch, and is ready for review!
>    > https://github.com/apache/tika/tree/multiple-parsers
>    >
>    >
>    > It seems to work fine for the Fallback case, and for the Supplemental
>    > case. You can set a policy that controls how clashing metadata is handled,
>    > currently "first one to set a key wins", "last one to set a key wins",
>    > "ignore previous parsers", and "keep old and new unique values"
>    >
>    > I've also done a proof of concept for "pick best" case, to try running the
>    > text parser with a specified set of different charsets, capture the text
>    > from each, "pick the best" (hard coded 1st...) then run for real with that
>    > one.
>    >
>    >
>    > Key TODOs - Support InputStreamFactory, properly work out what mimetypes
>    > to claim to support, Tika Config XML friendly helper for the metadata clash
>    > policy, review ContentHandlerFactory signature and tweak if needed.
>    >
>    > Proposed breaking 2.x change - add second parse method that takes
>    > ContentHandlerFactory instead of ContentHandler, with most parsers getting
>    > that just grabbing a single one and using that as before
>    >
>    >
>    > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
>    > I stop? Carry on? Modify it? Other?
>    >
>    > Nick
>    >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

Allison, Timothy B.
Nick,
  It looks like you merged to master, which, I think is the base for 2.0.0-SNAPSHOT.  I've been treating branch_1x as the master for 1.x.[1]
  Any objections to me cutting 1.18-SNAPSHOT from branch_1x?

        Best,

                 Tim
 
[1] https://lists.apache.org/thread.html/12342a115623d157063eb9f40064ccf21561cdab5cfb327f3f368aca@%3Cdev.tika.apache.org%3E

-----Original Message-----
From: Nick Burch [mailto:[hidden email]]
Sent: Sunday, April 8, 2018 8:47 AM
To: [hidden email]
Subject: Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

In the absense of complaints, I've gone ahead and merged this to Tika's master branch for 1.x.  If I've done it right, there won't be any breaking changes for 1.18, as everything is either new or marked as deprecated pending finalisation.

I haven't merged to 2.x yet, as it'd be good to get some feedback on the proposed Parser overridden parse method taking a ContentHandlerFactory method (to go alongside the long-standing ContentHander one for simpler
cases)

Nick

On Sun, 18 Mar 2018, Chris Mattmann wrote:

> Completely agree, awesome job Nick.
>
> I will definitely try this week as well.
>
> Thank you!
>
> Sincerely,
> Chris
>
>
>
> On 3/18/18, 2:47 PM, "David Meikle" <[hidden email]> wrote:
>
>    Nice one Nick!  Will take a look this week.
>
>    Cheers,
>    Dave
>
>    On 14 March 2018 at 17:38, Nick Burch <[hidden email]> wrote:
>
>    > Hi All
>    >
>    > As promised, I've finally had a go to try and implement my ideas for
>    > TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
>    > breaking 2.x parser change
>    >
>    > My work so far is in this github branch, and is ready for review!
>    > https://github.com/apache/tika/tree/multiple-parsers
>    >
>    >
>    > It seems to work fine for the Fallback case, and for the Supplemental
>    > case. You can set a policy that controls how clashing metadata is handled,
>    > currently "first one to set a key wins", "last one to set a key wins",
>    > "ignore previous parsers", and "keep old and new unique values"
>    >
>    > I've also done a proof of concept for "pick best" case, to try running the
>    > text parser with a specified set of different charsets, capture the text
>    > from each, "pick the best" (hard coded 1st...) then run for real with that
>    > one.
>    >
>    >
>    > Key TODOs - Support InputStreamFactory, properly work out what mimetypes
>    > to claim to support, Tika Config XML friendly helper for the metadata clash
>    > policy, review ContentHandlerFactory signature and tweak if needed.
>    >
>    > Proposed breaking 2.x change - add second parse method that takes
>    > ContentHandlerFactory instead of ContentHandler, with most parsers getting
>    > that just grabbing a single one and using that as before
>    >
>    >
>    > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
>    > I stop? Carry on? Modify it? Other?
>    >
>    > Nick
>    >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

Nick Burch-2
On Tue, 10 Apr 2018, Allison, Timothy B. wrote:
> It looks like you merged to master, which, I think is the base for
> 2.0.0-SNAPSHOT.  I've been treating branch_1x as the master for 1.x.[1]

Ah, I'd thought that the 2.x branch (with the tika-parser-bundles /
tika-parser-modules folders) was the one for 2.x, and master was still for
1.x. I haven't done any of my other fixes to the branch_1x branch

> Any objections to me cutting 1.18-SNAPSHOT from branch_1x?

As long as that has all the other fixes on, not from me. I can merge over
my multi-parser stuff to branch_1x next week for trying in 1.19

Nick
Reply | Threaded
Open this post in threaded view
|

RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

Allison, Timothy B.
Sorry...2.x is the model.  I think I stopped making updates to 2.x around last ApacheCon, and I didn't want to risk losing changes from master.

Should we rename 2.x -> 2.x_working_draft? Or similar?

-----Original Message-----
From: Nick Burch [mailto:[hidden email]]
Sent: Monday, April 9, 2018 10:52 PM
To: [hidden email]
Subject: RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

On Tue, 10 Apr 2018, Allison, Timothy B. wrote:
> It looks like you merged to master, which, I think is the base for
> 2.0.0-SNAPSHOT.  I've been treating branch_1x as the master for
> 1.x.[1]

Ah, I'd thought that the 2.x branch (with the tika-parser-bundles / tika-parser-modules folders) was the one for 2.x, and master was still for 1.x. I haven't done any of my other fixes to the branch_1x branch

> Any objections to me cutting 1.18-SNAPSHOT from branch_1x?

As long as that has all the other fixes on, not from me. I can merge over my multi-parser stuff to branch_1x next week for trying in 1.19

Nick