including refactored docs from govdocs1 in test suite

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

including refactored docs from govdocs1 in test suite

Allison, Timothy B.
All,

  As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix.

  These documents are in the public domain.

  Is it ok to include these modified documents in our test suite or should I avoid inclusion?

  Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious.

         Best,

                     Tim

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Monday, March 30, 2015 7:03 AM
To: [hidden email]
Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1

Unless there are objections, I'd like these to be resolved before 1.8:

TIKA-1584 -- I'll fix
TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level
TIKA-1511 -- I'll remove "provided" for xerial

TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?

I'll have these fixes completed by noon EDT.  Should I run against govdocs1 before or after the RC?

My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars.

Best,

              Tim



-----Original Message-----
From: Tyler Palsulich [mailto:[hidden email]]
Sent: Sunday, March 29, 2015 9:13 AM
To: [hidden email]
Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
something else pops up).

Thank you everyone.

Tyler
On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[hidden email]> wrote:

> +1 for 1.8
>
> Hong-Thai
>
> > On 28 Mar 2015, at 16:01, Tyler Palsulich <[hidden email]> wrote:
> >
> > Hi Folks,
> >
> > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
> > release a new version of Tika. I'll volunteer to be the release manager
> > again.
> >
> > Should we release this as 1.8 or 1.7.1?
> >
> > Does anyone have any last minute issues they'd like to finish and see in
> > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
> > TIKA-1586). Any others?
> >
> > Have a good weekend,
> > Tyler
>
Reply | Threaded
Open this post in threaded view
|

Re: including refactored docs from govdocs1 in test suite

Tyler Palsulich
Can you copy the hyperlink into a new doc and change the URL? I have no
idea about including the modified version.

Tyler
On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[hidden email]> wrote:

> All,
>
>   As part of TIKA-1512, I found that I can delete all of the contents,
> including the metadata, except for one hyperlink in two documents from
> govdocs1 and still get the proper behavior -- fail before fix, work after
> fix.
>
>   These documents are in the public domain.
>
>   Is it ok to include these modified documents in our test suite or should
> I avoid inclusion?
>
>   Happy to avoid inclusion for the sake of a quick release of 1.8 and then
> we have time to discuss/determine way ahead... unless the answer is obvious.
>
>          Best,
>
>                      Tim
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:[hidden email]]
> Sent: Monday, March 30, 2015 7:03 AM
> To: [hidden email]
> Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
>
> Unless there are objections, I'd like these to be resolved before 1.8:
>
> TIKA-1584 -- I'll fix
> TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
> TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but
> I'll leave this open and do some more digging to see if we need to open a
> ticket at the POI level
> TIKA-1511 -- I'll remove "provided" for xerial
>
> TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
>
> I'll have these fixes completed by noon EDT.  Should I run against
> govdocs1 before or after the RC?
>
> My last build of Tika app (a few days ago) ballooned to ~43MB, and that's
> before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
> build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
> README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server
> jars.
>
> Best,
>
>               Tim
>
>
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:[hidden email]]
> Sent: Sunday, March 29, 2015 9:13 AM
> To: [hidden email]
> Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
>
> Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
> something else pops up).
>
> Thank you everyone.
>
> Tyler
> On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[hidden email]> wrote:
>
> > +1 for 1.8
> >
> > Hong-Thai
> >
> > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[hidden email]>
> wrote:
> > >
> > > Hi Folks,
> > >
> > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
> to
> > > release a new version of Tika. I'll volunteer to be the release manager
> > > again.
> > >
> > > Should we release this as 1.8 or 1.7.1?
> > >
> > > Does anyone have any last minute issues they'd like to finish and see
> in
> > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
> > > TIKA-1586). Any others?
> > >
> > > Have a good weekend,
> > > Tyler
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: including refactored docs from govdocs1 in test suite

Allison, Timothy B.
Unfortunately, no.  MSOffice fixes the document when I do that.

-----Original Message-----
From: Tyler Palsulich [mailto:[hidden email]]
Sent: Monday, March 30, 2015 9:24 AM
To: [hidden email]
Subject: Re: including refactored docs from govdocs1 in test suite

Can you copy the hyperlink into a new doc and change the URL? I have no
idea about including the modified version.

Tyler
On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[hidden email]> wrote:

> All,
>
>   As part of TIKA-1512, I found that I can delete all of the contents,
> including the metadata, except for one hyperlink in two documents from
> govdocs1 and still get the proper behavior -- fail before fix, work after
> fix.
>
>   These documents are in the public domain.
>
>   Is it ok to include these modified documents in our test suite or should
> I avoid inclusion?
>
>   Happy to avoid inclusion for the sake of a quick release of 1.8 and then
> we have time to discuss/determine way ahead... unless the answer is obvious.
>
>          Best,
>
>                      Tim
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:[hidden email]]
> Sent: Monday, March 30, 2015 7:03 AM
> To: [hidden email]
> Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
>
> Unless there are objections, I'd like these to be resolved before 1.8:
>
> TIKA-1584 -- I'll fix
> TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
> TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but
> I'll leave this open and do some more digging to see if we need to open a
> ticket at the POI level
> TIKA-1511 -- I'll remove "provided" for xerial
>
> TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
>
> I'll have these fixes completed by noon EDT.  Should I run against
> govdocs1 before or after the RC?
>
> My last build of Tika app (a few days ago) ballooned to ~43MB, and that's
> before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
> build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
> README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server
> jars.
>
> Best,
>
>               Tim
>
>
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:[hidden email]]
> Sent: Sunday, March 29, 2015 9:13 AM
> To: [hidden email]
> Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
>
> Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
> something else pops up).
>
> Thank you everyone.
>
> Tyler
> On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[hidden email]> wrote:
>
> > +1 for 1.8
> >
> > Hong-Thai
> >
> > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[hidden email]>
> wrote:
> > >
> > > Hi Folks,
> > >
> > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
> to
> > > release a new version of Tika. I'll volunteer to be the release manager
> > > again.
> > >
> > > Should we release this as 1.8 or 1.7.1?
> > >
> > > Does anyone have any last minute issues they'd like to finish and see
> in
> > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
> > > TIKA-1586). Any others?
> > >
> > > Have a good weekend,
> > > Tyler
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: including refactored docs from govdocs1 in test suite

Tyler Palsulich
Ah. I see.

In general, what is the goal with handling corrupted files? Extract as much
as possible and fail gracefully?

Tyler

On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <[hidden email]> wrote:

>
> Unfortunately, no.  MSOffice fixes the document when I do that.
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:[hidden email]]
> Sent: Monday, March 30, 2015 9:24 AM
> To: [hidden email]
> Subject: Re: including refactored docs from govdocs1 in test suite
>
> Can you copy the hyperlink into a new doc and change the URL? I have no
> idea about including the modified version.
>
> Tyler
> On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
> > All,
> >
> >   As part of TIKA-1512, I found that I can delete all of the contents,
> > including the metadata, except for one hyperlink in two documents from
> > govdocs1 and still get the proper behavior -- fail before fix, work
after
> > fix.
> >
> >   These documents are in the public domain.
> >
> >   Is it ok to include these modified documents in our test suite or
should
> > I avoid inclusion?
> >
> >   Happy to avoid inclusion for the sake of a quick release of 1.8 and
then
> > we have time to discuss/determine way ahead... unless the answer is
obvious.

> >
> >          Best,
> >
> >                      Tim
> >
> > -----Original Message-----
> > From: Allison, Timothy B. [mailto:[hidden email]]
> > Sent: Monday, March 30, 2015 7:03 AM
> > To: [hidden email]
> > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
> >
> > Unless there are objections, I'd like these to be resolved before 1.8:
> >
> > TIKA-1584 -- I'll fix
> > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
> > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
> > I'll leave this open and do some more digging to see if we need to open
a
> > ticket at the POI level
> > TIKA-1511 -- I'll remove "provided" for xerial
> >
> > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
> >
> > I'll have these fixes completed by noon EDT.  Should I run against
> > govdocs1 before or after the RC?
> >
> > My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
> > before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
> > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
> > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server

> > jars.
> >
> > Best,
> >
> >               Tim
> >
> >
> >
> > -----Original Message-----
> > From: Tyler Palsulich [mailto:[hidden email]]
> > Sent: Sunday, March 29, 2015 9:13 AM
> > To: [hidden email]
> > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
> >
> > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
> > something else pops up).
> >
> > Thank you everyone.
> >
> > Tyler
> > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[hidden email]>
wrote:

> >
> > > +1 for 1.8
> > >
> > > Hong-Thai
> > >
> > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[hidden email]>
> > wrote:
> > > >
> > > > Hi Folks,
> > > >
> > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
> > to
> > > > release a new version of Tika. I'll volunteer to be the release
manager
> > > > again.
> > > >
> > > > Should we release this as 1.8 or 1.7.1?
> > > >
> > > > Does anyone have any last minute issues they'd like to finish and
see
> > in
> > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
> > > > TIKA-1586). Any others?
> > > >
> > > > Have a good weekend,
> > > > Tyler
> > >
> >
Reply | Threaded
Open this post in threaded view
|

RE: including refactored docs from govdocs1 in test suite

Allison, Timothy B.
I think this is an open question within Tika.  Some parsers prefer one thing over another.  And there are different levels of corruption.

In the two cases where govdocs1 docs might be useful in tests, the hyperlinks in .doc files do not appear to be "standard", but  MSWord opens them without a problem.  In cases where an application can open and correctly process the content, I think we ought to try to extract content without throwing exceptions.

-----Original Message-----
From: Tyler Palsulich [mailto:[hidden email]]
Sent: Monday, March 30, 2015 9:39 AM
To: [hidden email]
Subject: RE: including refactored docs from govdocs1 in test suite

Ah. I see.

In general, what is the goal with handling corrupted files? Extract as much
as possible and fail gracefully?

Tyler

On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <[hidden email]> wrote:

>
> Unfortunately, no.  MSOffice fixes the document when I do that.
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:[hidden email]]
> Sent: Monday, March 30, 2015 9:24 AM
> To: [hidden email]
> Subject: Re: including refactored docs from govdocs1 in test suite
>
> Can you copy the hyperlink into a new doc and change the URL? I have no
> idea about including the modified version.
>
> Tyler
> On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
> > All,
> >
> >   As part of TIKA-1512, I found that I can delete all of the contents,
> > including the metadata, except for one hyperlink in two documents from
> > govdocs1 and still get the proper behavior -- fail before fix, work
after
> > fix.
> >
> >   These documents are in the public domain.
> >
> >   Is it ok to include these modified documents in our test suite or
should
> > I avoid inclusion?
> >
> >   Happy to avoid inclusion for the sake of a quick release of 1.8 and
then
> > we have time to discuss/determine way ahead... unless the answer is
obvious.

> >
> >          Best,
> >
> >                      Tim
> >
> > -----Original Message-----
> > From: Allison, Timothy B. [mailto:[hidden email]]
> > Sent: Monday, March 30, 2015 7:03 AM
> > To: [hidden email]
> > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
> >
> > Unless there are objections, I'd like these to be resolved before 1.8:
> >
> > TIKA-1584 -- I'll fix
> > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
> > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
> > I'll leave this open and do some more digging to see if we need to open
a
> > ticket at the POI level
> > TIKA-1511 -- I'll remove "provided" for xerial
> >
> > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
> >
> > I'll have these fixes completed by noon EDT.  Should I run against
> > govdocs1 before or after the RC?
> >
> > My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
> > before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
> > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
> > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server

> > jars.
> >
> > Best,
> >
> >               Tim
> >
> >
> >
> > -----Original Message-----
> > From: Tyler Palsulich [mailto:[hidden email]]
> > Sent: Sunday, March 29, 2015 9:13 AM
> > To: [hidden email]
> > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
> >
> > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
> > something else pops up).
> >
> > Thank you everyone.
> >
> > Tyler
> > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[hidden email]>
wrote:

> >
> > > +1 for 1.8
> > >
> > > Hong-Thai
> > >
> > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[hidden email]>
> > wrote:
> > > >
> > > > Hi Folks,
> > > >
> > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
> > to
> > > > release a new version of Tika. I'll volunteer to be the release
manager
> > > > again.
> > > >
> > > > Should we release this as 1.8 or 1.7.1?
> > > >
> > > > Does anyone have any last minute issues they'd like to finish and
see
> > in
> > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
> > > > TIKA-1586). Any others?
> > > >
> > > > Have a good weekend,
> > > > Tyler
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: including refactored docs from govdocs1 in test suite

Konstantin Gribov
At least, parser should not hang on processing corrupted document. IMHO,
cases with hanging parser code should be considered blocker issue.

Personally I prefer variant with partial result and some meta which says
that document parsing failed somehow. But it can be hard to do.

--
Best regards,
Konstantin Gribov

пн, 30 марта 2015 г. в 16:52, Allison, Timothy B. <[hidden email]>:

> I think this is an open question within Tika.  Some parsers prefer one
> thing over another.  And there are different levels of corruption.
>
> In the two cases where govdocs1 docs might be useful in tests, the
> hyperlinks in .doc files do not appear to be "standard", but  MSWord opens
> them without a problem.  In cases where an application can open and
> correctly process the content, I think we ought to try to extract content
> without throwing exceptions.
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:[hidden email]]
> Sent: Monday, March 30, 2015 9:39 AM
> To: [hidden email]
> Subject: RE: including refactored docs from govdocs1 in test suite
>
> Ah. I see.
>
> In general, what is the goal with handling corrupted files? Extract as much
> as possible and fail gracefully?
>
> Tyler
>
> On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <[hidden email]> wrote:
> >
> > Unfortunately, no.  MSOffice fixes the document when I do that.
> >
> > -----Original Message-----
> > From: Tyler Palsulich [mailto:[hidden email]]
> > Sent: Monday, March 30, 2015 9:24 AM
> > To: [hidden email]
> > Subject: Re: including refactored docs from govdocs1 in test suite
> >
> > Can you copy the hyperlink into a new doc and change the URL? I have no
> > idea about including the modified version.
> >
> > Tyler
> > On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[hidden email]>
> wrote:
> >
> > > All,
> > >
> > >   As part of TIKA-1512, I found that I can delete all of the contents,
> > > including the metadata, except for one hyperlink in two documents from
> > > govdocs1 and still get the proper behavior -- fail before fix, work
> after
> > > fix.
> > >
> > >   These documents are in the public domain.
> > >
> > >   Is it ok to include these modified documents in our test suite or
> should
> > > I avoid inclusion?
> > >
> > >   Happy to avoid inclusion for the sake of a quick release of 1.8 and
> then
> > > we have time to discuss/determine way ahead... unless the answer is
> obvious.
> > >
> > >          Best,
> > >
> > >                      Tim
> > >
> > > -----Original Message-----
> > > From: Allison, Timothy B. [mailto:[hidden email]]
> > > Sent: Monday, March 30, 2015 7:03 AM
> > > To: [hidden email]
> > > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
> > >
> > > Unless there are objections, I'd like these to be resolved before 1.8:
> > >
> > > TIKA-1584 -- I'll fix
> > > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
> > > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
> but
> > > I'll leave this open and do some more digging to see if we need to open
> a
> > > ticket at the POI level
> > > TIKA-1511 -- I'll remove "provided" for xerial
> > >
> > > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
> > >
> > > I'll have these fixes completed by noon EDT.  Should I run against
> > > govdocs1 before or after the RC?
> > >
> > > My last build of Tika app (a few days ago) ballooned to ~43MB, and
> that's
> > > before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
> > > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
> > > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
> tika-server
> > > jars.
> > >
> > > Best,
> > >
> > >               Tim
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Tyler Palsulich [mailto:[hidden email]]
> > > Sent: Sunday, March 29, 2015 9:13 AM
> > > To: [hidden email]
> > > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
> > >
> > > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
> > > something else pops up).
> > >
> > > Thank you everyone.
> > >
> > > Tyler
> > > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[hidden email]>
> wrote:
> > >
> > > > +1 for 1.8
> > > >
> > > > Hong-Thai
> > > >
> > > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[hidden email]>
> > > wrote:
> > > > >
> > > > > Hi Folks,
> > > > >
> > > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
> need
> > > to
> > > > > release a new version of Tika. I'll volunteer to be the release
> manager
> > > > > again.
> > > > >
> > > > > Should we release this as 1.8 or 1.7.1?
> > > > >
> > > > > Does anyone have any last minute issues they'd like to finish and
> see
> > > in
> > > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
> and
> > > > > TIKA-1586). Any others?
> > > > >
> > > > > Have a good weekend,
> > > > > Tyler
> > > >
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: including refactored docs from govdocs1 in test suite

Mattmann, Chris A (3010)
In reply to this post by Allison, Timothy B.
+1 to including the modified docs.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, March 30, 2015 at 6:51 AM
To: "[hidden email]" <[hidden email]>
Subject: RE: including refactored docs from govdocs1 in test suite

>I think this is an open question within Tika.  Some parsers prefer one
>thing over another.  And there are different levels of corruption.
>
>In the two cases where govdocs1 docs might be useful in tests, the
>hyperlinks in .doc files do not appear to be "standard", but  MSWord
>opens them without a problem.  In cases where an application can open and
>correctly process the content, I think we ought to try to extract content
>without throwing exceptions.
>
>-----Original Message-----
>From: Tyler Palsulich [mailto:[hidden email]]
>Sent: Monday, March 30, 2015 9:39 AM
>To: [hidden email]
>Subject: RE: including refactored docs from govdocs1 in test suite
>
>Ah. I see.
>
>In general, what is the goal with handling corrupted files? Extract as
>much
>as possible and fail gracefully?
>
>Tyler
>
>On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <[hidden email]> wrote:
>>
>> Unfortunately, no.  MSOffice fixes the document when I do that.
>>
>> -----Original Message-----
>> From: Tyler Palsulich [mailto:[hidden email]]
>> Sent: Monday, March 30, 2015 9:24 AM
>> To: [hidden email]
>> Subject: Re: including refactored docs from govdocs1 in test suite
>>
>> Can you copy the hyperlink into a new doc and change the URL? I have no
>> idea about including the modified version.
>>
>> Tyler
>> On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[hidden email]>
>>wrote:
>>
>> > All,
>> >
>> >   As part of TIKA-1512, I found that I can delete all of the contents,
>> > including the metadata, except for one hyperlink in two documents from
>> > govdocs1 and still get the proper behavior -- fail before fix, work
>after
>> > fix.
>> >
>> >   These documents are in the public domain.
>> >
>> >   Is it ok to include these modified documents in our test suite or
>should
>> > I avoid inclusion?
>> >
>> >   Happy to avoid inclusion for the sake of a quick release of 1.8 and
>then
>> > we have time to discuss/determine way ahead... unless the answer is
>obvious.
>> >
>> >          Best,
>> >
>> >                      Tim
>> >
>> > -----Original Message-----
>> > From: Allison, Timothy B. [mailto:[hidden email]]
>> > Sent: Monday, March 30, 2015 7:03 AM
>> > To: [hidden email]
>> > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
>> >
>> > Unless there are objections, I'd like these to be resolved before 1.8:
>> >
>> > TIKA-1584 -- I'll fix
>> > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
>> > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
>but
>> > I'll leave this open and do some more digging to see if we need to
>>open
>a
>> > ticket at the POI level
>> > TIKA-1511 -- I'll remove "provided" for xerial
>> >
>> > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
>> >
>> > I'll have these fixes completed by noon EDT.  Should I run against
>> > govdocs1 before or after the RC?
>> >
>> > My last build of Tika app (a few days ago) ballooned to ~43MB, and
>that's
>> > before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my
>>last
>> > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
>> > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
>tika-server
>> > jars.
>> >
>> > Best,
>> >
>> >               Tim
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Tyler Palsulich [mailto:[hidden email]]
>> > Sent: Sunday, March 29, 2015 9:13 AM
>> > To: [hidden email]
>> > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
>> >
>> > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
>> > something else pops up).
>> >
>> > Thank you everyone.
>> >
>> > Tyler
>> > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[hidden email]>
>wrote:
>> >
>> > > +1 for 1.8
>> > >
>> > > Hong-Thai
>> > >
>> > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[hidden email]>
>> > wrote:
>> > > >
>> > > > Hi Folks,
>> > > >
>> > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
>need
>> > to
>> > > > release a new version of Tika. I'll volunteer to be the release
>manager
>> > > > again.
>> > > >
>> > > > Should we release this as 1.8 or 1.7.1?
>> > > >
>> > > > Does anyone have any last minute issues they'd like to finish and
>see
>> > in
>> > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
>and
>> > > > TIKA-1586). Any others?
>> > > >
>> > > > Have a good weekend,
>> > > > Tyler
>> > >
>> >