1.20?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

1.20?

Tim Allison
All,
   POI 4.0.1 will be out shortly with some important bug fixes.  What would
you all think of targeting 1st/2nd week of December for 1.20?

     Cheers,
         Tim
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Chris Mattmann
Love it and I can align tika-python with that too ☺

 

 

 

From: Tim Allison <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, November 20, 2018 at 3:04 PM
To: "[hidden email]" <[hidden email]>
Subject: 1.20?

 

All,

   POI 4.0.1 will be out shortly with some important bug fixes.  What would

you all think of targeting 1st/2nd week of December for 1.20?

 

     Cheers,

         Tim

 

Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Tim Allison
Dave,
  Should I try to get the Docker plugin working again?

On Tue, Nov 20, 2018 at 6:21 PM Chris Mattmann <[hidden email]> wrote:

> Love it and I can align tika-python with that too ☺
>
>
>
>
>
>
>
> From: Tim Allison <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Tuesday, November 20, 2018 at 3:04 PM
> To: "[hidden email]" <[hidden email]>
> Subject: 1.20?
>
>
>
> All,
>
>    POI 4.0.1 will be out shortly with some important bug fixes.  What would
>
> you all think of targeting 1st/2nd week of December for 1.20?
>
>
>
>      Cheers,
>
>          Tim
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

lewis john mcgibbney-2
In reply to this post by Tim Allison
+1 would be nice to get the recent ENVI work released as well folks.

On 2018/11/20 23:04:29, Tim Allison <[hidden email]> wrote:
> All,
>    POI 4.0.1 will be out shortly with some important bug fixes.  What would
> you all think of targeting 1st/2nd week of December for 1.20?
>
>      Cheers,
>          Tim
>
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

David Meikle
In reply to this post by Tim Allison
Hi,
On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:

> Dave,
>   Should I try to get the Docker plugin working again?
>

That would be great. I think I may have went down the wrong path building
an image at package time, as there doesn't seem to be an easy way to
publish it as an Apache labelled org on Dockerhub unless it builds from
source.

I have some time over the weekend, so could update to where I got to and
see what you think.

Cheers,
Dave
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Tim Allison
Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:

>
> Hi,
> On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:
>
> > Dave,
> >   Should I try to get the Docker plugin working again?
> >
>
> That would be great. I think I may have went down the wrong path building
> an image at package time, as there doesn't seem to be an easy way to
> publish it as an Apache labelled org on Dockerhub unless it builds from
> source.
>
> I have some time over the weekend, so could update to where I got to and
> see what you think.
>
> Cheers,
> Dave
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Tim Allison
Reports are here:

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

I'm going to revert the mp4 parser, and commit the few dependency
upgrades I ran.

The _major_ difference in content for ppt is explained by the
duplication of header/footer info.  To confirm this, note that the
values for "num_unique_tokens_a" and "num_unique_tokens_b" are
identical for nearly all ppt->ppt, but there are far more tokens in
"num_tokens_a" vs "num_tokens_b".

I also see that we're losing content in x-java and x-groovy, etc., but
that's because we're now suppressing the style markup that our parser
was (incorrectly, IMHO, inserting) -- check the values in
"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
weight: 3 | family: 2

In short, I think we're good to go.  Will roll rc1 later today or
(more likely) tomorrow unless there are objections.
On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <[hidden email]> wrote:

>
> Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
> On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:
> >
> > Hi,
> > On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:
> >
> > > Dave,
> > >   Should I try to get the Docker plugin working again?
> > >
> >
> > That would be great. I think I may have went down the wrong path building
> > an image at package time, as there doesn't seem to be an easy way to
> > publish it as an Apache labelled org on Dockerhub unless it builds from
> > source.
> >
> > I have some time over the weekend, so could update to where I got to and
> > see what you think.
> >
> > Cheers,
> > Dave
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Chris Mattmann
Roll forward! Yay!

 

 

 

From: Tim Allison <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Thursday, December 13, 2018 at 7:02 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: 1.20?

 

Reports are here:

 

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

 

I'm going to revert the mp4 parser, and commit the few dependency

upgrades I ran.

 

The _major_ difference in content for ppt is explained by the

duplication of header/footer info.  To confirm this, note that the

values for "num_unique_tokens_a" and "num_unique_tokens_b" are

identical for nearly all ppt->ppt, but there are far more tokens in

"num_tokens_a" vs "num_tokens_b".

 

I also see that we're losing content in x-java and x-groovy, etc., but

that's because we're now suppressing the style markup that our parser

was (incorrectly, IMHO, inserting) -- check the values in

"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |

0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |

weight: 3 | family: 2

 

In short, I think we're good to go.  Will roll rc1 later today or

(more likely) tomorrow unless there are objections.

On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <[hidden email]> wrote:

 

Any blockers on 1.20?  I'm going to kick off the regression tests shortly.

On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:

>

> Hi,

> On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:

>

> > Dave,

> >   Should I try to get the Docker plugin working again?

> >

>

> That would be great. I think I may have went down the wrong path building

> an image at package time, as there doesn't seem to be an easy way to

> publish it as an Apache labelled org on Dockerhub unless it builds from

> source.

>

> I have some time over the weekend, so could update to where I got to and

> see what you think.

>

> Cheers,

> Dave

 

Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Luís Filipe Nassif
In reply to this post by Tim Allison
Hi Tim,

Reading your great reports, I also saw some new exceptions with RAR files
in likely broken folder, but seems tika was able to extract some text from
them before. Do you know if those files are really broken and why tika
extracted text from them before?

Thank you,
Luis

Em qui, 13 de dez de 2018 às 13:02, Tim Allison <[hidden email]>
escreveu:

> Reports are here:
>
> http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>
> I'm going to revert the mp4 parser, and commit the few dependency
> upgrades I ran.
>
> The _major_ difference in content for ppt is explained by the
> duplication of header/footer info.  To confirm this, note that the
> values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> identical for nearly all ppt->ppt, but there are far more tokens in
> "num_tokens_a" vs "num_tokens_b".
>
> I also see that we're losing content in x-java and x-groovy, etc., but
> that's because we're now suppressing the style markup that our parser
> was (incorrectly, IMHO, inserting) -- check the values in
> "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> weight: 3 | family: 2
>
> In short, I think we're good to go.  Will roll rc1 later today or
> (more likely) tomorrow unless there are objections.
> On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <[hidden email]> wrote:
> >
> > Any blockers on 1.20?  I'm going to kick off the regression tests
> shortly.
> > On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:
> > >
> > > Hi,
> > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:
> > >
> > > > Dave,
> > > >   Should I try to get the Docker plugin working again?
> > > >
> > >
> > > That would be great. I think I may have went down the wrong path
> building
> > > an image at package time, as there doesn't seem to be an easy way to
> > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > source.
> > >
> > > I have some time over the weekend, so could update to where I got to
> and
> > > see what you think.
> > >
> > > Cheers,
> > > Dave
>
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Tim Allison
 Thank you for reading the reports!!!

The files are very likely broken.  I can take a look.  The change was
probably because of an "upgrade" to junrar.  Should I revert to the
version we used in 1.19.1?
On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <[hidden email]> wrote:

>
> Hi Tim,
>
> Reading your great reports, I also saw some new exceptions with RAR files
> in likely broken folder, but seems tika was able to extract some text from
> them before. Do you know if those files are really broken and why tika
> extracted text from them before?
>
> Thank you,
> Luis
>
> Em qui, 13 de dez de 2018 às 13:02, Tim Allison <[hidden email]>
> escreveu:
>
> > Reports are here:
> >
> > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> >
> > I'm going to revert the mp4 parser, and commit the few dependency
> > upgrades I ran.
> >
> > The _major_ difference in content for ppt is explained by the
> > duplication of header/footer info.  To confirm this, note that the
> > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > identical for nearly all ppt->ppt, but there are far more tokens in
> > "num_tokens_a" vs "num_tokens_b".
> >
> > I also see that we're losing content in x-java and x-groovy, etc., but
> > that's because we're now suppressing the style markup that our parser
> > was (incorrectly, IMHO, inserting) -- check the values in
> > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > weight: 3 | family: 2
> >
> > In short, I think we're good to go.  Will roll rc1 later today or
> > (more likely) tomorrow unless there are objections.
> > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <[hidden email]> wrote:
> > >
> > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > shortly.
> > > On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:
> > > >
> > > > Hi,
> > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:
> > > >
> > > > > Dave,
> > > > >   Should I try to get the Docker plugin working again?
> > > > >
> > > >
> > > > That would be great. I think I may have went down the wrong path
> > building
> > > > an image at package time, as there doesn't seem to be an easy way to
> > > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > > source.
> > > >
> > > > I have some time over the weekend, so could update to where I got to
> > and
> > > > see what you think.
> > > >
> > > > Cheers,
> > > > Dave
> >
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Tim Allison
Let me actually take a look before answering. Sorry!

On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <[hidden email]> wrote:

>  Thank you for reading the reports!!!
>
> The files are very likely broken.  I can take a look.  The change was
> probably because of an "upgrade" to junrar.  Should I revert to the
> version we used in 1.19.1?
> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <[hidden email]>
> wrote:
> >
> > Hi Tim,
> >
> > Reading your great reports, I also saw some new exceptions with RAR files
> > in likely broken folder, but seems tika was able to extract some text
> from
> > them before. Do you know if those files are really broken and why tika
> > extracted text from them before?
> >
> > Thank you,
> > Luis
> >
> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <[hidden email]>
> > escreveu:
> >
> > > Reports are here:
> > >
> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> > >
> > > I'm going to revert the mp4 parser, and commit the few dependency
> > > upgrades I ran.
> > >
> > > The _major_ difference in content for ppt is explained by the
> > > duplication of header/footer info.  To confirm this, note that the
> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > > identical for nearly all ppt->ppt, but there are far more tokens in
> > > "num_tokens_a" vs "num_tokens_b".
> > >
> > > I also see that we're losing content in x-java and x-groovy, etc., but
> > > that's because we're now suppressing the style markup that our parser
> > > was (incorrectly, IMHO, inserting) -- check the values in
> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > > weight: 3 | family: 2
> > >
> > > In short, I think we're good to go.  Will roll rc1 later today or
> > > (more likely) tomorrow unless there are objections.
> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <[hidden email]>
> wrote:
> > > >
> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > > shortly.
> > > > On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:
> > > > >
> > > > > Hi,
> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]>
> wrote:
> > > > >
> > > > > > Dave,
> > > > > >   Should I try to get the Docker plugin working again?
> > > > > >
> > > > >
> > > > > That would be great. I think I may have went down the wrong path
> > > building
> > > > > an image at package time, as there doesn't seem to be an easy way
> to
> > > > > publish it as an Apache labelled org on Dockerhub unless it builds
> from
> > > > > source.
> > > > >
> > > > > I have some time over the weekend, so could update to where I got
> to
> > > and
> > > > > see what you think.
> > > > >
> > > > > Cheers,
> > > > > Dave
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Tim Allison
Thank you, again, Luís Filipe Nassif!  There's no point in having
reports unless we pay attention to them :P.  I reverted junrar to
where it was in 1.19.1. I also reverted jackcess based on the reports.

All,
  On the theory that it isn't a great idea to push to production on a
Friday.  I'm going to let the recent changes rest over the weekend.
I'll rerun some tests on a subset of the regression corpus on Monday
and then roll rc1.  If anyone wants to kick the tires on the recent
version changes, including parsers that depend on the upgraded guava,
that'd be great!

Onward!

Cheers,

           Tim

On Thu, Dec 13, 2018 at 5:34 PM Tim Allison <[hidden email]> wrote:

>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <[hidden email]> wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <[hidden email]> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <[hidden email]>
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <[hidden email]> wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >
Reply | Threaded
Open this post in threaded view
|

Re: 1.20?

Tim Allison
In reply to this post by Tim Allison
Reports on mp4s, junrar, msaccess and a random subset of the
regression corpus are available here:
http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz


On Thu, Dec 13, 2018 at 5:34 PM Tim Allison <[hidden email]> wrote:

>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <[hidden email]> wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <[hidden email]> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <[hidden email]>
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <[hidden email]> wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM <[hidden email]> wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <[hidden email]> wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >