Tika 1.18?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika 1.18?

Allison, Timothy B.
All,
There have been some important bug fixes, a few new capabilities, and the upgrading of dependencies because of CVEs.  There are a bunch of mime tickets from Andreas Meier that I’d like to get into 1.18.  Is there anything else that is critical?
Schedule wise, I propose getting changes in by say, next Friday (3/9), regression tests the next week, RC1 the following week[0]?
WDYT?

Cheers,

            Tim

[0] week = “open source week” which can be significantly longer than a calendar week when surprises emerge. 😊

Timothy B. Allison, Ph.D.
Principal Artificial Intelligence Engineer
T835/Human Language Technology
The MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)


Reply | Threaded
Open this post in threaded view
|

Re: Tika 1.18?

Nick Burch-2
On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
> There have been some important bug fixes, a few new capabilities, and
> the upgrading of dependencies because of CVEs.  There are a bunch of
> mime tickets from Andreas Meier that I’d like to get into 1.18.  Is
> there anything else that is critical?

I've had a busy few weeks, so haven't yet had a chance to try out my
proposed multi-parser stuff for 2.x. I'll hopefully take a look next week,
assuming even the fastest review cycle and everyone loving it, I can't see
us being ready to all sign-off on those "2.x breaking changes" until
probably April.

Given that, doing an interim 1.x release soon makes sense to me!

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Tika 1.18?

Chris Mattmann
Same: makes perfect sense to me and let's do it ( I just updated (finally) Tika Python down
stream to be based on the 1.16 Tika, I guess I should get it based on 1.17 soon too (

https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#L17

Cheers,
Chris

On 3/1/18, 5:16 AM, "Nick Burch" <[hidden email]> wrote:

    On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
    > There have been some important bug fixes, a few new capabilities, and
    > the upgrading of dependencies because of CVEs.  There are a bunch of
    > mime tickets from Andreas Meier that I’d like to get into 1.18.  Is
    > there anything else that is critical?
   
    I've had a busy few weeks, so haven't yet had a chance to try out my
    proposed multi-parser stuff for 2.x. I'll hopefully take a look next week,
    assuming even the fastest review cycle and everyone loving it, I can't see
    us being ready to all sign-off on those "2.x breaking changes" until
    probably April.
   
    Given that, doing an interim 1.x release soon makes sense to me!
   
    Nick


Reply | Threaded
Open this post in threaded view
|

Re: Tika 1.18?

Luís Filipe Nassif
I think we should workaround TIKA-2591, and I would like to work
on TIKA-1466 (what do you think?) and fix TIKA-2568.

Cheers,
Luis

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Livre
de vírus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>.
<#m_3134801720618142664_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2018-03-01 13:24 GMT-03:00 Chris Mattmann <[hidden email]>:

> Same: makes perfect sense to me and let's do it ( I just updated (finally)
> Tika Python down
> stream to be based on the 1.16 Tika, I guess I should get it based on 1.17
> soon too (
>
> https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#
> L17
>
> Cheers,
> Chris
>
> On 3/1/18, 5:16 AM, "Nick Burch" <[hidden email]> wrote:
>
>     On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
>     > There have been some important bug fixes, a few new capabilities, and
>     > the upgrading of dependencies because of CVEs.  There are a bunch of
>     > mime tickets from Andreas Meier that I’d like to get into 1.18.  Is
>     > there anything else that is critical?
>
>     I've had a busy few weeks, so haven't yet had a chance to try out my
>     proposed multi-parser stuff for 2.x. I'll hopefully take a look next
> week,
>     assuming even the fastest review cycle and everyone loving it, I can't
> see
>     us being ready to all sign-off on those "2.x breaking changes" until
>     probably April.
>
>     Given that, doing an interim 1.x release soon makes sense to me!
>
>     Nick
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Tika 1.18?

Allison, Timothy B.
TIKA-2591 and TIKA-2568
+1

TIKA-1466 -- how long will it take, do you think?  This seems potentially non-trivial...

-----Original Message-----
From: Luís Filipe Nassif [mailto:[hidden email]]
Sent: Thursday, March 1, 2018 5:41 PM
To: [hidden email]
Subject: Re: Tika 1.18?

I think we should workaround TIKA-2591, and I would like to work on TIKA-1466 (what do you think?) and fix TIKA-2568.

Cheers,
Luis
Reply | Threaded
Open this post in threaded view
|

Re: Tika 1.18?

Luís Filipe Nassif
If I make no progress on TIKA-1466 until 3/9, you can start the release
process without it. But do you devs agree with the proposed change: allow
overriding of glob patterns in custom-mimetypes.xml?

2018-03-02 10:03 GMT-03:00 Allison, Timothy B. <[hidden email]>:

> TIKA-2591 and TIKA-2568
> +1
>
> TIKA-1466 -- how long will it take, do you think?  This seems potentially
> non-trivial...
>
> -----Original Message-----
> From: Luís Filipe Nassif [mailto:[hidden email]]
> Sent: Thursday, March 1, 2018 5:41 PM
> To: [hidden email]
> Subject: Re: Tika 1.18?
>
> I think we should workaround TIKA-2591, and I would like to work on
> TIKA-1466 (what do you think?) and fix TIKA-2568.
>
> Cheers,
> Luis
>
Reply | Threaded
Open this post in threaded view
|

RE: Tika 1.18?

Allison, Timothy B.
> But do you devs agree with the proposed change: allow overriding of glob patterns in custom-mimetypes.xml?

+1 from me

From: Luís Filipe Nassif [mailto:[hidden email]]
Sent: Friday, March 2, 2018 8:21 AM
To: Allison, Timothy B. <[hidden email]>
Cc: [hidden email]
Subject: Re: Tika 1.18?

If I make no progress on TIKA-1466 until 3/9, you can start the release process without it. But do you devs agree with the proposed change: allow overriding of glob patterns in custom-mimetypes.xml?

Reply | Threaded
Open this post in threaded view
|

Re: Tika 1.18?

Nick Burch-2
In reply to this post by Luís Filipe Nassif
On Fri, 2 Mar 2018, Luís Filipe Nassif wrote:
> If I make no progress on TIKA-1466 until 3/9, you can start the release
> process without it. But do you devs agree with the proposed change: allow
> overriding of glob patterns in custom-mimetypes.xml?

What happens if you have two different custom files which both claim the
same glob?

We have historically been a bit stricter about built-in types overriding,
in part to avoid people doing silly things by mistake, and in part to push
people a bit more towards contributing fixes/enhancements for built-in
types. I think the latter is less of a thing today, as we've a lot more
covered as standard, so it's just the former we need to worry about.

How do we help people know when they have conflicting overrides (possibly
from different projects), help them sensibly merge or turn off Tika
provided magic+definitions, and to alert them to when their copied +
customised version probably wants updating following a tika upgrade giving
a newer definition? Do a better job of those than we currently do now,
then I'm very happy to +1 it :)

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Tika 1.18?

Luís Filipe Nassif
I thought about logging any custom-mimetype override applied, so the user
will be warned about that. Maybe additionally creating a specific attribute
in mimetype definition xml to configure it must override the default one
instead of aborting. About multiple conflicting custom mimes from different
(external) projetcs, Tika currently aborts and it is already a problem now.

So I think it needs additional discussion and should not be addressed in
the next release. Will copy/paste this discussion in the jira issue.

But I would like to see fixed the detection of MTS videos, but it conflicts
with another existing mime glob. Any workaround for this specific case? If
yes, I can open a different ticket.



Em 2 de mar de 2018 18:23, "Nick Burch" <[hidden email]> escreveu:

On Fri, 2 Mar 2018, Luís Filipe Nassif wrote:

> If I make no progress on TIKA-1466 until 3/9, you can start the release
> process without it. But do you devs agree with the proposed change: allow
> overriding of glob patterns in custom-mimetypes.xml?
>

What happens if you have two different custom files which both claim the
same glob?

We have historically been a bit stricter about built-in types overriding,
in part to avoid people doing silly things by mistake, and in part to push
people a bit more towards contributing fixes/enhancements for built-in
types. I think the latter is less of a thing today, as we've a lot more
covered as standard, so it's just the former we need to worry about.

How do we help people know when they have conflicting overrides (possibly
from different projects), help them sensibly merge or turn off Tika
provided magic+definitions, and to alert them to when their copied +
customised version probably wants updating following a tika upgrade giving
a newer definition? Do a better job of those than we currently do now, then
I'm very happy to +1 it :)

Nick
Reply | Threaded
Open this post in threaded view
|

RE: Tika 1.18?

Allison, Timothy B.
In reply to this post by Allison, Timothy B.
All,

  I think I've made the updates that I wanted to make sure got in to 1.18.  It looks like PDFBox is going to start their release cycle shortly.  Should we wait for PDFBox 2.0.9?    

  That may add a week or two to our release, although, frankly, it might not.  We can start running the regression tests March 9(ish) and see if anything dire appears...

  Cheers,

          Tim

Reply | Threaded
Open this post in threaded view
|

Re: Tika 1.18?

Chris Mattmann
Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9


On 3/7/18, 1:16 PM, "Allison, Timothy B." <[hidden email]> wrote:

    All,
   
      I think I've made the updates that I wanted to make sure got in to 1.18.  It looks like PDFBox is going to start their release cycle shortly.  Should we wait for PDFBox 2.0.9?    
   
      That may add a week or two to our release, although, frankly, it might not.  We can start running the regression tests March 9(ish) and see if anything dire appears...
   
      Cheers,
   
              Tim
   
   


Reply | Threaded
Open this post in threaded view
|

RE: Tika 1.18?

Allison, Timothy B.
I'm working with PDFBox on regression tests for 2.0.9 now.  I'll probably kick off our own preliminary full corpus regression tests shortly... ~2018-03-12T20:00 UTC

Anyone have anything they'd like to get in before I run the regression tests?  I can certainly put it off a few days.

Cheers,

             Tim

-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: Wednesday, March 7, 2018 4:57 PM
To: [hidden email]
Subject: Re: Tika 1.18?

Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9

Reply | Threaded
Open this post in threaded view
|

RE: Tika 1.18?

Nick Burch-2
On Mon, 12 Mar 2018, Allison, Timothy B. wrote:
> Anyone have anything they'd like to get in before I run the regression
> tests?  I can certainly put it off a few days.

I've made some progress on the metadata-only fallback/merge multiple
parser work from https://wiki.apache.org/tika/CompositeParserDiscussion,
but it's some way off finished yet. I don't think I can cause any
regressions though! It can also wait for 1.19 if I don't get it stable in
time to come off a branch.

Nick