Tika 1.15

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Tika 1.15

Allison, Timothy B.
Unless there are blockers, I'll kick off the regression runs now.  It may take a few days to have results.

Given that  don't test the ObjectRecognition code in the regression runs, we can still add the updates there if desired before 1.15.

Cheers,

           Tim
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika 1.15

thammegowda
Tim,

Merged the InceptionV4 update PR to ObjectRecognitionParser.
Since this parser is not enabled by default, we are good there right?


--
*Thamme Gowda *
Grad. Student at Univ. of Southern California
@thammegowda <https://twitter.com/thammegowda> | http://scf.usc.
edu/~tnarayan/
~Sent via somebody's Webmail server

On Thu, Apr 20, 2017 at 5:27 AM, Allison, Timothy B. <[hidden email]>
wrote:

> Unless there are blockers, I'll kick off the regression runs now.  It may
> take a few days to have results.
>
> Given that  don't test the ObjectRecognition code in the regression runs,
> we can still add the updates there if desired before 1.15.
>
> Cheers,
>
>            Tim
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
Should be, y.  Thank you!

-----Original Message-----
From: Thamme Gowda [mailto:[hidden email]]
Sent: Thursday, April 20, 2017 4:43 PM
To: [hidden email]
Subject: Re: Tika 1.15

Tim,

Merged the InceptionV4 update PR to ObjectRecognitionParser.
Since this parser is not enabled by default, we are good there right?


--
*Thamme Gowda *
Grad. Student at Univ. of Southern California @thammegowda <https://twitter.com/thammegowda> | http://scf.usc.
edu/~tnarayan/
~Sent via somebody's Webmail server

On Thu, Apr 20, 2017 at 5:27 AM, Allison, Timothy B. <[hidden email]>
wrote:

> Unless there are blockers, I'll kick off the regression runs now.  It
> may take a few days to have results.
>
> Given that  don't test the ObjectRecognition code in the regression
> runs, we can still add the updates there if desired before 1.15.
>
> Cheers,
>
>            Tim
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
Reports for 1.14 v 1.15-SNAPSHOT are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14V1_15.zip 

I haven't looked at any of the reports yet.  

The "Report" phase took 8 hours; I think because of paging.  We've clearly hit the limits of running H2 in embedded mode.



-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Thursday, April 20, 2017 8:05 PM
To: [hidden email]
Subject: RE: Tika 1.15

Should be, y.  Thank you!

-----Original Message-----
From: Thamme Gowda [mailto:[hidden email]]
Sent: Thursday, April 20, 2017 4:43 PM
To: [hidden email]
Subject: Re: Tika 1.15

Tim,

Merged the InceptionV4 update PR to ObjectRecognitionParser.
Since this parser is not enabled by default, we are good there right?


--
*Thamme Gowda *
Grad. Student at Univ. of Southern California @thammegowda <https://twitter.com/thammegowda> | http://scf.usc.
edu/~tnarayan/
~Sent via somebody's Webmail server

On Thu, Apr 20, 2017 at 5:27 AM, Allison, Timothy B. <[hidden email]>
wrote:

> Unless there are blockers, I'll kick off the regression runs now.  It
> may take a few days to have results.
>
> Given that  don't test the ObjectRecognition code in the regression
> runs, we can still add the updates there if desired before 1.15.
>
> Cheers,
>
>            Tim
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Tika 1.15

Allison, Timothy B.
In reply to this post by Allison, Timothy B.
With the added TSD parser, I think I should rerun the regression testing.  Given that, I also fixed 2099, and we'll benefit from a rerun.

Anything else before I rerun the regression testing?

Any problems observed in first run?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika 1.15

Mattmann, Chris A (3010)
I want to see if I can get in the VideoRecognition parser, and also the Sentiment one.

I hope to get it done in the next day or so. Thanks.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]> wrote:

    With the added TSD parser, I think I should rerun the regression testing.  Given that, I also fixed 2099, and we'll benefit from a rerun.
   
    Anything else before I rerun the regression testing?
   
    Any problems observed in first run?
   
   

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
Oh.  Ok.  Will wait, then?

-----Original Message-----
From: Mattmann, Chris A (3010) [mailto:[hidden email]]
Sent: Wednesday, April 26, 2017 11:38 AM
To: [hidden email]
Subject: Re: Tika 1.15

I want to see if I can get in the VideoRecognition parser, and also the Sentiment one.

I hope to get it done in the next day or so. Thanks.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]> wrote:

    With the added TSD parser, I think I should rerun the regression testing.  Given that, I also fixed 2099, and we'll benefit from a rerun.
   
    Anything else before I rerun the regression testing?
   
    Any problems observed in first run?
   
   

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika 1.15

Mattmann, Chris A (3010)
Thank you!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:

    Oh.  Ok.  Will wait, then?
   
    -----Original Message-----
    From: Mattmann, Chris A (3010) [mailto:[hidden email]]
    Sent: Wednesday, April 26, 2017 11:38 AM
    To: [hidden email]
    Subject: Re: Tika 1.15
   
    I want to see if I can get in the VideoRecognition parser, and also the Sentiment one.
   
    I hope to get it done in the next day or so. Thanks.
   
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: [hidden email]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     
   
    On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]> wrote:
   
        With the added TSD parser, I think I should rerun the regression testing.  Given that, I also fixed 2099, and we'll benefit from a rerun.
       
        Anything else before I rerun the regression testing?
       
        Any problems observed in first run?
       
       
   
   

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
I finally had a chance to look through the results of the first regression run.

I made a few trivial changes to our parsers and to tika-eval.

We appear to have many more exceptions in files parsed by our CompressorParser, but this is because of reporting...not because of reality -- the exception is now coming in the container file, not an attachment...and tika-eval wasn't matching A and B correctly.

There is a regression that's been fixed in PDFBox trunk (PDFBOX-3717), but I don't see that as a blocker.

We have new exceptions in the new parsers, EMF, WMF, .xlsb, wordperfect, but that's because we're actually parsing those now. :)

All else looks to be in decent shape.

Chris and Team and All,
  Let me know when you're ready for me to kick off the next regression run.

          Cheers,

                  Tim




-----Original Message-----
From: Mattmann, Chris A (3010) [mailto:[hidden email]]
Sent: Wednesday, April 26, 2017 12:48 PM
To: [hidden email]
Subject: Re: Tika 1.15

Thank you!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:

    Oh.  Ok.  Will wait, then?
   
    -----Original Message-----
    From: Mattmann, Chris A (3010) [mailto:[hidden email]]
    Sent: Wednesday, April 26, 2017 11:38 AM
    To: [hidden email]
    Subject: Re: Tika 1.15
   
    I want to see if I can get in the VideoRecognition parser, and also the Sentiment one.
   
    I hope to get it done in the next day or so. Thanks.
   
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: [hidden email]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     
   
    On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]> wrote:
   
        With the added TSD parser, I think I should rerun the regression testing.  Given that, I also fixed 2099, and we'll benefit from a rerun.
       
        Anything else before I rerun the regression testing?
       
        Any problems observed in first run?
       
       
   
   

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika 1.15

Chris Mattmann
Thanks Tim. I am going to try and get tika-dl added (if possible), and also try the
Sentiment Parser next. If I can get one or both of those (in the next day or so), then
I will give you the heads up to begin testing. Video recognition is in!





On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]> wrote:

    I finally had a chance to look through the results of the first regression run.
   
    I made a few trivial changes to our parsers and to tika-eval.
   
    We appear to have many more exceptions in files parsed by our CompressorParser, but this is because of reporting...not because of reality -- the exception is now coming in the container file, not an attachment...and tika-eval wasn't matching A and B correctly.
   
    There is a regression that's been fixed in PDFBox trunk (PDFBOX-3717), but I don't see that as a blocker.
   
    We have new exceptions in the new parsers, EMF, WMF, .xlsb, wordperfect, but that's because we're actually parsing those now. :)
   
    All else looks to be in decent shape.
   
    Chris and Team and All,
      Let me know when you're ready for me to kick off the next regression run.
   
              Cheers,
   
                      Tim
   
   
   
   
    -----Original Message-----
    From: Mattmann, Chris A (3010) [mailto:[hidden email]]
    Sent: Wednesday, April 26, 2017 12:48 PM
    To: [hidden email]
    Subject: Re: Tika 1.15
   
    Thank you!
   
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: [hidden email]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     
   
    On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:
   
        Oh.  Ok.  Will wait, then?
       
        -----Original Message-----
        From: Mattmann, Chris A (3010) [mailto:[hidden email]]
        Sent: Wednesday, April 26, 2017 11:38 AM
        To: [hidden email]
        Subject: Re: Tika 1.15
       
        I want to see if I can get in the VideoRecognition parser, and also the Sentiment one.
       
        I hope to get it done in the next day or so. Thanks.
       
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Chris Mattmann, Ph.D.
        Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
        Office: 180-503E, Mailstop: 180-503
        Email: [hidden email]
        WWW:  http://sunset.usc.edu/~mattmann/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
        WWW: http://irds.usc.edu/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
         
       
        On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]> wrote:
       
            With the added TSD parser, I think I should rerun the regression testing.  Given that, I also fixed 2099, and we'll benefit from a rerun.
           
            Anything else before I rerun the regression testing?
           
            Any problems observed in first run?
           
           
       
       
   
   


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
Sounds good.  W00t!

-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: Monday, May 1, 2017 4:57 PM
To: [hidden email]
Subject: Re: Tika 1.15

Thanks Tim. I am going to try and get tika-dl added (if possible), and also try the Sentiment Parser next. If I can get one or both of those (in the next day or so), then I will give you the heads up to begin testing. Video recognition is in!





On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]> wrote:

    I finally had a chance to look through the results of the first regression run.
   
    I made a few trivial changes to our parsers and to tika-eval.
   
    We appear to have many more exceptions in files parsed by our CompressorParser, but this is because of reporting...not because of reality -- the exception is now coming in the container file, not an attachment...and tika-eval wasn't matching A and B correctly.
   
    There is a regression that's been fixed in PDFBox trunk (PDFBOX-3717), but I don't see that as a blocker.
   
    We have new exceptions in the new parsers, EMF, WMF, .xlsb, wordperfect, but that's because we're actually parsing those now. :)
   
    All else looks to be in decent shape.
   
    Chris and Team and All,
      Let me know when you're ready for me to kick off the next regression run.
   
              Cheers,
   
                      Tim
   
   
   
   
    -----Original Message-----
    From: Mattmann, Chris A (3010) [mailto:[hidden email]]
    Sent: Wednesday, April 26, 2017 12:48 PM
    To: [hidden email]
    Subject: Re: Tika 1.15
   
    Thank you!
   
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: [hidden email]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     
   
    On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:
   
        Oh.  Ok.  Will wait, then?
       
        -----Original Message-----
        From: Mattmann, Chris A (3010) [mailto:[hidden email]]
        Sent: Wednesday, April 26, 2017 11:38 AM
        To: [hidden email]
        Subject: Re: Tika 1.15
       
        I want to see if I can get in the VideoRecognition parser, and also the Sentiment one.
       
        I hope to get it done in the next day or so. Thanks.
       
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Chris Mattmann, Ph.D.
        Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
        Office: 180-503E, Mailstop: 180-503
        Email: [hidden email]
        WWW:  http://sunset.usc.edu/~mattmann/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
        WWW: http://irds.usc.edu/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
         
       
        On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]> wrote:
       
            With the added TSD parser, I think I should rerun the regression testing.  Given that, I also fixed 2099, and we'll benefit from a rerun.
           
            Anything else before I rerun the regression testing?
           
            Any problems observed in first run?
           
           
       
       
   
   


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Tyler Palsulich-2
How exactly did you "evaluate" the results? I opened the zip and looked at
a few of the sheets, but it's a bit daunting.

Any way we could dump JSON? That's a bit easier to build visualizations for.

Tyler

On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]> wrote:

> Sounds good.  W00t!
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Monday, May 1, 2017 4:57 PM
> To: [hidden email]
> Subject: Re: Tika 1.15
>
> Thanks Tim. I am going to try and get tika-dl added (if possible), and
> also try the Sentiment Parser next. If I can get one or both of those (in
> the next day or so), then I will give you the heads up to begin testing.
> Video recognition is in!
>
>
>
>
>
> On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]> wrote:
>
>     I finally had a chance to look through the results of the first
> regression run.
>
>     I made a few trivial changes to our parsers and to tika-eval.
>
>     We appear to have many more exceptions in files parsed by our
> CompressorParser, but this is because of reporting...not because of reality
> -- the exception is now coming in the container file, not an
> attachment...and tika-eval wasn't matching A and B correctly.
>
>     There is a regression that's been fixed in PDFBox trunk (PDFBOX-3717),
> but I don't see that as a blocker.
>
>     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
> wordperfect, but that's because we're actually parsing those now. :)
>
>     All else looks to be in decent shape.
>
>     Chris and Team and All,
>       Let me know when you're ready for me to kick off the next regression
> run.
>
>               Cheers,
>
>                       Tim
>
>
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (3010) [mailto:[hidden email]]
>     Sent: Wednesday, April 26, 2017 12:48 PM
>     To: [hidden email]
>     Subject: Re: Tika 1.15
>
>     Thank you!
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-503
>     Email: [hidden email]
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS) Adjunct
> Associate Professor, Computer Science Department University of Southern
> California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
>         Oh.  Ok.  Will wait, then?
>
>         -----Original Message-----
>         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>         Sent: Wednesday, April 26, 2017 11:38 AM
>         To: [hidden email]
>         Subject: Re: Tika 1.15
>
>         I want to see if I can get in the VideoRecognition parser, and
> also the Sentiment one.
>
>         I hope to get it done in the next day or so. Thanks.
>
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Chris Mattmann, Ph.D.
>         Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>         Office: 180-503E, Mailstop: 180-503
>         Email: [hidden email]
>         WWW:  http://sunset.usc.edu/~mattmann/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
>         WWW: http://irds.usc.edu/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>         On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]>
> wrote:
>
>             With the added TSD parser, I think I should rerun the
> regression testing.  Given that, I also fixed 2099, and we'll benefit from
> a rerun.
>
>             Anything else before I rerun the regression testing?
>
>             Any problems observed in first run?
>
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika 1.15

Chris Mattmann
JSON + D3 = win




On 5/1/17, 8:39 PM, "Tyler Bui-Palsulich" <[hidden email]> wrote:

    How exactly did you "evaluate" the results? I opened the zip and looked at
    a few of the sheets, but it's a bit daunting.
   
    Any way we could dump JSON? That's a bit easier to build visualizations for.
   
    Tyler
   
    On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]> wrote:
   
    > Sounds good.  W00t!
    >
    > -----Original Message-----
    > From: Chris Mattmann [mailto:[hidden email]]
    > Sent: Monday, May 1, 2017 4:57 PM
    > To: [hidden email]
    > Subject: Re: Tika 1.15
    >
    > Thanks Tim. I am going to try and get tika-dl added (if possible), and
    > also try the Sentiment Parser next. If I can get one or both of those (in
    > the next day or so), then I will give you the heads up to begin testing.
    > Video recognition is in!
    >
    >
    >
    >
    >
    > On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]> wrote:
    >
    >     I finally had a chance to look through the results of the first
    > regression run.
    >
    >     I made a few trivial changes to our parsers and to tika-eval.
    >
    >     We appear to have many more exceptions in files parsed by our
    > CompressorParser, but this is because of reporting...not because of reality
    > -- the exception is now coming in the container file, not an
    > attachment...and tika-eval wasn't matching A and B correctly.
    >
    >     There is a regression that's been fixed in PDFBox trunk (PDFBOX-3717),
    > but I don't see that as a blocker.
    >
    >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
    > wordperfect, but that's because we're actually parsing those now. :)
    >
    >     All else looks to be in decent shape.
    >
    >     Chris and Team and All,
    >       Let me know when you're ready for me to kick off the next regression
    > run.
    >
    >               Cheers,
    >
    >                       Tim
    >
    >
    >
    >
    >     -----Original Message-----
    >     From: Mattmann, Chris A (3010) [mailto:[hidden email]]
    >     Sent: Wednesday, April 26, 2017 12:48 PM
    >     To: [hidden email]
    >     Subject: Re: Tika 1.15
    >
    >     Thank you!
    >
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >     Chris Mattmann, Ph.D.
    >     Principal Data Scientist, Engineering Administrative Office (3010)
    > Manager, NSF & Open Source Projects Formulation and Development Offices
    > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    >     Office: 180-503E, Mailstop: 180-503
    >     Email: [hidden email]
    >     WWW:  http://sunset.usc.edu/~mattmann/
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >     Director, Information Retrieval and Data Science Group (IRDS) Adjunct
    > Associate Professor, Computer Science Department University of Southern
    > California, Los Angeles, CA 90089 USA
    >     WWW: http://irds.usc.edu/
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >
    >
    >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:
    >
    >         Oh.  Ok.  Will wait, then?
    >
    >         -----Original Message-----
    >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
    > nasa.gov]
    >         Sent: Wednesday, April 26, 2017 11:38 AM
    >         To: [hidden email]
    >         Subject: Re: Tika 1.15
    >
    >         I want to see if I can get in the VideoRecognition parser, and
    > also the Sentiment one.
    >
    >         I hope to get it done in the next day or so. Thanks.
    >
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >         Chris Mattmann, Ph.D.
    >         Principal Data Scientist, Engineering Administrative Office (3010)
    > Manager, NSF & Open Source Projects Formulation and Development Offices
    > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    >         Office: 180-503E, Mailstop: 180-503
    >         Email: [hidden email]
    >         WWW:  http://sunset.usc.edu/~mattmann/
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >         Director, Information Retrieval and Data Science Group (IRDS)
    > Adjunct Associate Professor, Computer Science Department University of
    > Southern California, Los Angeles, CA 90089 USA
    >         WWW: http://irds.usc.edu/
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >
    >
    >         On 4/26/17, 7:54 AM, "Allison, Timothy B." <[hidden email]>
    > wrote:
    >
    >             With the added TSD parser, I think I should rerun the
    > regression testing.  Given that, I also fixed 2099, and we'll benefit from
    > a rerun.
    >
    >             Anything else before I rerun the regression testing?
    >
    >             Any problems observed in first run?
    >
    >
    >
    >
    >
    >
    >
    >
    >
   


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
In reply to this post by Tyler Palsulich-2
Y.  It is daunting at this point, and please do help!

The key sheets I look at:

exceptions/exceptions_compared_by_mime_type.xlsx
exceptions/new_exceptions_in_B_by_mime.xlsx

mimes/mime_diffs_A_to_B.xlsx

attachments/attachment_diffs.xlsx

metadata/metadata_value_count_diffs.xlsx

I can dump json, but wouldn't it be easier for you to pull directly from the db?

My vision is to put a gui on the db that would allow you to visualize the reports/see the data and have links to the original (binary) files plus the extract files for both A and B (perhaps with a diff visualization).

Three cheers for d3.


-----Original Message-----
From: Tyler Bui-Palsulich [mailto:[hidden email]]
Sent: Monday, May 1, 2017 11:39 PM
To: [hidden email]
Subject: RE: Tika 1.15

How exactly did you "evaluate" the results? I opened the zip and looked at a few of the sheets, but it's a bit daunting.

Any way we could dump JSON? That's a bit easier to build visualizations for.

Tyler

On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]> wrote:

> Sounds good.  W00t!
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Monday, May 1, 2017 4:57 PM
> To: [hidden email]
> Subject: Re: Tika 1.15
>
> Thanks Tim. I am going to try and get tika-dl added (if possible), and
> also try the Sentiment Parser next. If I can get one or both of those
> (in the next day or so), then I will give you the heads up to begin testing.
> Video recognition is in!
>
>
>
>
>
> On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]> wrote:
>
>     I finally had a chance to look through the results of the first
> regression run.
>
>     I made a few trivial changes to our parsers and to tika-eval.
>
>     We appear to have many more exceptions in files parsed by our
> CompressorParser, but this is because of reporting...not because of
> reality
> -- the exception is now coming in the container file, not an
> attachment...and tika-eval wasn't matching A and B correctly.
>
>     There is a regression that's been fixed in PDFBox trunk
> (PDFBOX-3717), but I don't see that as a blocker.
>
>     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
> wordperfect, but that's because we're actually parsing those now. :)
>
>     All else looks to be in decent shape.
>
>     Chris and Team and All,
>       Let me know when you're ready for me to kick off the next
> regression run.
>
>               Cheers,
>
>                       Tim
>
>
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (3010) [mailto:[hidden email]]
>     Sent: Wednesday, April 26, 2017 12:48 PM
>     To: [hidden email]
>     Subject: Re: Tika 1.15
>
>     Thank you!
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-503
>     Email: [hidden email]
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
>         Oh.  Ok.  Will wait, then?
>
>         -----Original Message-----
>         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>         Sent: Wednesday, April 26, 2017 11:38 AM
>         To: [hidden email]
>         Subject: Re: Tika 1.15
>
>         I want to see if I can get in the VideoRecognition parser, and
> also the Sentiment one.
>
>         I hope to get it done in the next day or so. Thanks.
>
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Chris Mattmann, Ph.D.
>         Principal Data Scientist, Engineering Administrative Office
> (3010) Manager, NSF & Open Source Projects Formulation and Development
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>         Office: 180-503E, Mailstop: 180-503
>         Email: [hidden email]
>         WWW:  http://sunset.usc.edu/~mattmann/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
>         WWW: http://irds.usc.edu/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>         On 4/26/17, 7:54 AM, "Allison, Timothy B."
> <[hidden email]>
> wrote:
>
>             With the added TSD parser, I think I should rerun the
> regression testing.  Given that, I also fixed 2099, and we'll benefit
> from a rerun.
>
>             Anything else before I rerun the regression testing?
>
>             Any problems observed in first run?
>
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
In reply to this post by Tyler Palsulich-2
The other two critical files:

Content/common_token_comparisons_by_mime.xlsx
Content/content_diffs_ignore_exceptions.xlsx


Oh, and the key part, which is less than ideal, is that there has to be a human in the loop...which makes the need for visualizations even more critical.

For example:

1) We now have more exceptions in file type y.  Well, that's ok because we didn't have a parser for file type y before.  

2) We have fewer exceptions in file type x; that should be good, right?  Well, no, because now there are far fewer "common words" in x, which means that the parser became less restrictive and sloppier.  We now have more noise.

3) We now have more "common words" in file type x; that should be a sign of improvement, right?  Not necessarily, because:
        a) we failed to remove a few common html markup terms and our html parser/detection is failing so we have a bunch more "span" and "body" words.  That's bad.  (We can fix this as we go forward)
        b) our parsers are repeating sections now.  Doh! (We can fix this with better statistics).
        c) our OCR is hallucinating common words because we're using a heavily dictionary-biased OCR system.  (unlikely, but possible)

The lists go on...

In short, my original vision of nightly automated tests has had a run in with reality and lost.  A human has to make sense of the output/db.

My dumping some reports to xlsx yields good data for the developer who wrote the code, but, I agree, they are largely incomprehensible to someone getting started.

So, please, help!



-----Original Message-----
From: Tyler Bui-Palsulich [mailto:[hidden email]]
Sent: Monday, May 1, 2017 11:39 PM
To: [hidden email]
Subject: RE: Tika 1.15

How exactly did you "evaluate" the results? I opened the zip and looked at a few of the sheets, but it's a bit daunting.

Any way we could dump JSON? That's a bit easier to build visualizations for.

Tyler

On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]> wrote:

> Sounds good.  W00t!
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Monday, May 1, 2017 4:57 PM
> To: [hidden email]
> Subject: Re: Tika 1.15
>
> Thanks Tim. I am going to try and get tika-dl added (if possible), and
> also try the Sentiment Parser next. If I can get one or both of those
> (in the next day or so), then I will give you the heads up to begin testing.
> Video recognition is in!
>
>
>
>
>
> On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]> wrote:
>
>     I finally had a chance to look through the results of the first
> regression run.
>
>     I made a few trivial changes to our parsers and to tika-eval.
>
>     We appear to have many more exceptions in files parsed by our
> CompressorParser, but this is because of reporting...not because of
> reality
> -- the exception is now coming in the container file, not an
> attachment...and tika-eval wasn't matching A and B correctly.
>
>     There is a regression that's been fixed in PDFBox trunk
> (PDFBOX-3717), but I don't see that as a blocker.
>
>     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
> wordperfect, but that's because we're actually parsing those now. :)
>
>     All else looks to be in decent shape.
>
>     Chris and Team and All,
>       Let me know when you're ready for me to kick off the next
> regression run.
>
>               Cheers,
>
>                       Tim
>
>
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (3010) [mailto:[hidden email]]
>     Sent: Wednesday, April 26, 2017 12:48 PM
>     To: [hidden email]
>     Subject: Re: Tika 1.15
>
>     Thank you!
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-503
>     Email: [hidden email]
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
>         Oh.  Ok.  Will wait, then?
>
>         -----Original Message-----
>         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>         Sent: Wednesday, April 26, 2017 11:38 AM
>         To: [hidden email]
>         Subject: Re: Tika 1.15
>
>         I want to see if I can get in the VideoRecognition parser, and
> also the Sentiment one.
>
>         I hope to get it done in the next day or so. Thanks.
>
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Chris Mattmann, Ph.D.
>         Principal Data Scientist, Engineering Administrative Office
> (3010) Manager, NSF & Open Source Projects Formulation and Development
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>         Office: 180-503E, Mailstop: 180-503
>         Email: [hidden email]
>         WWW:  http://sunset.usc.edu/~mattmann/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
>         WWW: http://irds.usc.edu/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>         On 4/26/17, 7:54 AM, "Allison, Timothy B."
> <[hidden email]>
> wrote:
>
>             With the added TSD parser, I think I should rerun the
> regression testing.  Given that, I also fixed 2099, and we'll benefit
> from a rerun.
>
>             Anything else before I rerun the regression testing?
>
>             Any problems observed in first run?
>
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika 1.15

Chris Mattmann
In reply to this post by Allison, Timothy B.
Team, check out Polar Insights, which my USC IRDS student NIthin did:

http://polar.usc.edu/html/polar-deep-insights/index.html#/config

Click Download, then Download (the 2 download buttons), then Save, then
click the Query Interface. Something like this?

All code is OSS on http://github.com/USCDataScience/polar-deep-insights/ 

Cheers,
Chris


On 5/2/17, 4:54 AM, "Allison, Timothy B." <[hidden email]> wrote:

    Y.  It is daunting at this point, and please do help!
   
    The key sheets I look at:
   
    exceptions/exceptions_compared_by_mime_type.xlsx
    exceptions/new_exceptions_in_B_by_mime.xlsx
   
    mimes/mime_diffs_A_to_B.xlsx
   
    attachments/attachment_diffs.xlsx
   
    metadata/metadata_value_count_diffs.xlsx
   
    I can dump json, but wouldn't it be easier for you to pull directly from the db?
   
    My vision is to put a gui on the db that would allow you to visualize the reports/see the data and have links to the original (binary) files plus the extract files for both A and B (perhaps with a diff visualization).
   
    Three cheers for d3.
   
   
    -----Original Message-----
    From: Tyler Bui-Palsulich [mailto:[hidden email]]
    Sent: Monday, May 1, 2017 11:39 PM
    To: [hidden email]
    Subject: RE: Tika 1.15
   
    How exactly did you "evaluate" the results? I opened the zip and looked at a few of the sheets, but it's a bit daunting.
   
    Any way we could dump JSON? That's a bit easier to build visualizations for.
   
    Tyler
   
    On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]> wrote:
   
    > Sounds good.  W00t!
    >
    > -----Original Message-----
    > From: Chris Mattmann [mailto:[hidden email]]
    > Sent: Monday, May 1, 2017 4:57 PM
    > To: [hidden email]
    > Subject: Re: Tika 1.15
    >
    > Thanks Tim. I am going to try and get tika-dl added (if possible), and
    > also try the Sentiment Parser next. If I can get one or both of those
    > (in the next day or so), then I will give you the heads up to begin testing.
    > Video recognition is in!
    >
    >
    >
    >
    >
    > On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]> wrote:
    >
    >     I finally had a chance to look through the results of the first
    > regression run.
    >
    >     I made a few trivial changes to our parsers and to tika-eval.
    >
    >     We appear to have many more exceptions in files parsed by our
    > CompressorParser, but this is because of reporting...not because of
    > reality
    > -- the exception is now coming in the container file, not an
    > attachment...and tika-eval wasn't matching A and B correctly.
    >
    >     There is a regression that's been fixed in PDFBox trunk
    > (PDFBOX-3717), but I don't see that as a blocker.
    >
    >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
    > wordperfect, but that's because we're actually parsing those now. :)
    >
    >     All else looks to be in decent shape.
    >
    >     Chris and Team and All,
    >       Let me know when you're ready for me to kick off the next
    > regression run.
    >
    >               Cheers,
    >
    >                       Tim
    >
    >
    >
    >
    >     -----Original Message-----
    >     From: Mattmann, Chris A (3010) [mailto:[hidden email]]
    >     Sent: Wednesday, April 26, 2017 12:48 PM
    >     To: [hidden email]
    >     Subject: Re: Tika 1.15
    >
    >     Thank you!
    >
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >     Chris Mattmann, Ph.D.
    >     Principal Data Scientist, Engineering Administrative Office (3010)
    > Manager, NSF & Open Source Projects Formulation and Development
    > Offices
    > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    >     Office: 180-503E, Mailstop: 180-503
    >     Email: [hidden email]
    >     WWW:  http://sunset.usc.edu/~mattmann/
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >     Director, Information Retrieval and Data Science Group (IRDS)
    > Adjunct Associate Professor, Computer Science Department University of
    > Southern California, Los Angeles, CA 90089 USA
    >     WWW: http://irds.usc.edu/
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >
    >
    >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]> wrote:
    >
    >         Oh.  Ok.  Will wait, then?
    >
    >         -----Original Message-----
    >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
    > nasa.gov]
    >         Sent: Wednesday, April 26, 2017 11:38 AM
    >         To: [hidden email]
    >         Subject: Re: Tika 1.15
    >
    >         I want to see if I can get in the VideoRecognition parser, and
    > also the Sentiment one.
    >
    >         I hope to get it done in the next day or so. Thanks.
    >
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >         Chris Mattmann, Ph.D.
    >         Principal Data Scientist, Engineering Administrative Office
    > (3010) Manager, NSF & Open Source Projects Formulation and Development
    > Offices
    > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    >         Office: 180-503E, Mailstop: 180-503
    >         Email: [hidden email]
    >         WWW:  http://sunset.usc.edu/~mattmann/
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >         Director, Information Retrieval and Data Science Group (IRDS)
    > Adjunct Associate Professor, Computer Science Department University of
    > Southern California, Los Angeles, CA 90089 USA
    >         WWW: http://irds.usc.edu/
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >
    >
    >         On 4/26/17, 7:54 AM, "Allison, Timothy B."
    > <[hidden email]>
    > wrote:
    >
    >             With the added TSD parser, I think I should rerun the
    > regression testing.  Given that, I also fixed 2099, and we'll benefit
    > from a rerun.
    >
    >             Anything else before I rerun the regression testing?
    >
    >             Any problems observed in first run?
    >
    >
    >
    >
    >
    >
    >
    >
    >
   


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika 1.15

Tyler Palsulich-2
Thanks for the link. It looks like the UI is written with Angular and uses
Elastic + static JSON. See
https://github.com/USCDataScience/polar-deep-insights/wiki/Architecture.

I also like d3. In general, I think we are on the same page the best option
is a web based UI.

I see a few options to get data into the frontend:
1. Static JSON
2. JSON from a server (meaning the server runs queries (either built by the
client or the server))
3. Load a local DB (meaning the client runs queries)

From some quick searching, 3 seems like it has poor support. I could be
wrong.

1 and 2 are clearly related. If we have a working application with static
JSON, changing it to use served JSON should be straightforward (from a Java
server, probably). Static JSON will be faster than live queries, but I
don't know how long the queries take. The polar project seems to hard code
queries and provide an interface to manually enter more.

Static JSON seems easiest to get started. What do you think?

Tyler

On May 2, 2017 6:57 AM, "Chris Mattmann" <[hidden email]> wrote:

> Team, check out Polar Insights, which my USC IRDS student NIthin did:
>
> http://polar.usc.edu/html/polar-deep-insights/index.html#/config
>
> Click Download, then Download (the 2 download buttons), then Save, then
> click the Query Interface. Something like this?
>
> All code is OSS on http://github.com/USCDataScience/polar-deep-insights/
>
> Cheers,
> Chris
>
>
> On 5/2/17, 4:54 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
>     Y.  It is daunting at this point, and please do help!
>
>     The key sheets I look at:
>
>     exceptions/exceptions_compared_by_mime_type.xlsx
>     exceptions/new_exceptions_in_B_by_mime.xlsx
>
>     mimes/mime_diffs_A_to_B.xlsx
>
>     attachments/attachment_diffs.xlsx
>
>     metadata/metadata_value_count_diffs.xlsx
>
>     I can dump json, but wouldn't it be easier for you to pull directly
> from the db?
>
>     My vision is to put a gui on the db that would allow you to visualize
> the reports/see the data and have links to the original (binary) files plus
> the extract files for both A and B (perhaps with a diff visualization).
>
>     Three cheers for d3.
>
>
>     -----Original Message-----
>     From: Tyler Bui-Palsulich [mailto:[hidden email]]
>     Sent: Monday, May 1, 2017 11:39 PM
>     To: [hidden email]
>     Subject: RE: Tika 1.15
>
>     How exactly did you "evaluate" the results? I opened the zip and
> looked at a few of the sheets, but it's a bit daunting.
>
>     Any way we could dump JSON? That's a bit easier to build
> visualizations for.
>
>     Tyler
>
>     On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>
>     > Sounds good.  W00t!
>     >
>     > -----Original Message-----
>     > From: Chris Mattmann [mailto:[hidden email]]
>     > Sent: Monday, May 1, 2017 4:57 PM
>     > To: [hidden email]
>     > Subject: Re: Tika 1.15
>     >
>     > Thanks Tim. I am going to try and get tika-dl added (if possible),
> and
>     > also try the Sentiment Parser next. If I can get one or both of those
>     > (in the next day or so), then I will give you the heads up to begin
> testing.
>     > Video recognition is in!
>     >
>     >
>     >
>     >
>     >
>     > On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >     I finally had a chance to look through the results of the first
>     > regression run.
>     >
>     >     I made a few trivial changes to our parsers and to tika-eval.
>     >
>     >     We appear to have many more exceptions in files parsed by our
>     > CompressorParser, but this is because of reporting...not because of
>     > reality
>     > -- the exception is now coming in the container file, not an
>     > attachment...and tika-eval wasn't matching A and B correctly.
>     >
>     >     There is a regression that's been fixed in PDFBox trunk
>     > (PDFBOX-3717), but I don't see that as a blocker.
>     >
>     >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
>     > wordperfect, but that's because we're actually parsing those now. :)
>     >
>     >     All else looks to be in decent shape.
>     >
>     >     Chris and Team and All,
>     >       Let me know when you're ready for me to kick off the next
>     > regression run.
>     >
>     >               Cheers,
>     >
>     >                       Tim
>     >
>     >
>     >
>     >
>     >     -----Original Message-----
>     >     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>     >     Sent: Wednesday, April 26, 2017 12:48 PM
>     >     To: [hidden email]
>     >     Subject: Re: Tika 1.15
>     >
>     >     Thank you!
>     >
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Chris Mattmann, Ph.D.
>     >     Principal Data Scientist, Engineering Administrative Office
> (3010)
>     > Manager, NSF & Open Source Projects Formulation and Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >     Office: 180-503E, Mailstop: 180-503
>     >     Email: [hidden email]
>     >     WWW:  http://sunset.usc.edu/~mattmann/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department University
> of
>     > Southern California, Los Angeles, CA 90089 USA
>     >     WWW: http://irds.usc.edu/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >         Oh.  Ok.  Will wait, then?
>     >
>     >         -----Original Message-----
>     >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
>     > nasa.gov]
>     >         Sent: Wednesday, April 26, 2017 11:38 AM
>     >         To: [hidden email]
>     >         Subject: Re: Tika 1.15
>     >
>     >         I want to see if I can get in the VideoRecognition parser,
> and
>     > also the Sentiment one.
>     >
>     >         I hope to get it done in the next day or so. Thanks.
>     >
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Chris Mattmann, Ph.D.
>     >         Principal Data Scientist, Engineering Administrative Office
>     > (3010) Manager, NSF & Open Source Projects Formulation and
> Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >         Office: 180-503E, Mailstop: 180-503
>     >         Email: [hidden email]
>     >         WWW:  http://sunset.usc.edu/~mattmann/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department University
> of
>     > Southern California, Los Angeles, CA 90089 USA
>     >         WWW: http://irds.usc.edu/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >         On 4/26/17, 7:54 AM, "Allison, Timothy B."
>     > <[hidden email]>
>     > wrote:
>     >
>     >             With the added TSD parser, I think I should rerun the
>     > regression testing.  Given that, I also fixed 2099, and we'll benefit
>     > from a rerun.
>     >
>     >             Anything else before I rerun the regression testing?
>     >
>     >             Any problems observed in first run?
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
Let's move the conversation to the existing and still open ticket: https://issues.apache.org/jira/browse/TIKA-1334

:)

I'm really excited about this!


-----Original Message-----
From: Tyler Bui-Palsulich [mailto:[hidden email]]
Sent: Tuesday, May 2, 2017 7:20 PM
To: [hidden email]
Subject: Re: Tika 1.15

Thanks for the link. It looks like the UI is written with Angular and uses Elastic + static JSON. See https://github.com/USCDataScience/polar-deep-insights/wiki/Architecture.

I also like d3. In general, I think we are on the same page the best option is a web based UI.

I see a few options to get data into the frontend:
1. Static JSON
2. JSON from a server (meaning the server runs queries (either built by the client or the server)) 3. Load a local DB (meaning the client runs queries)

From some quick searching, 3 seems like it has poor support. I could be wrong.

1 and 2 are clearly related. If we have a working application with static JSON, changing it to use served JSON should be straightforward (from a Java server, probably). Static JSON will be faster than live queries, but I don't know how long the queries take. The polar project seems to hard code queries and provide an interface to manually enter more.

Static JSON seems easiest to get started. What do you think?

Tyler

On May 2, 2017 6:57 AM, "Chris Mattmann" <[hidden email]> wrote:

> Team, check out Polar Insights, which my USC IRDS student NIthin did:
>
> http://polar.usc.edu/html/polar-deep-insights/index.html#/config
>
> Click Download, then Download (the 2 download buttons), then Save,
> then click the Query Interface. Something like this?
>
> All code is OSS on
> http://github.com/USCDataScience/polar-deep-insights/
>
> Cheers,
> Chris
>
>
> On 5/2/17, 4:54 AM, "Allison, Timothy B." <[hidden email]> wrote:
>
>     Y.  It is daunting at this point, and please do help!
>
>     The key sheets I look at:
>
>     exceptions/exceptions_compared_by_mime_type.xlsx
>     exceptions/new_exceptions_in_B_by_mime.xlsx
>
>     mimes/mime_diffs_A_to_B.xlsx
>
>     attachments/attachment_diffs.xlsx
>
>     metadata/metadata_value_count_diffs.xlsx
>
>     I can dump json, but wouldn't it be easier for you to pull
> directly from the db?
>
>     My vision is to put a gui on the db that would allow you to
> visualize the reports/see the data and have links to the original
> (binary) files plus the extract files for both A and B (perhaps with a diff visualization).
>
>     Three cheers for d3.
>
>
>     -----Original Message-----
>     From: Tyler Bui-Palsulich [mailto:[hidden email]]
>     Sent: Monday, May 1, 2017 11:39 PM
>     To: [hidden email]
>     Subject: RE: Tika 1.15
>
>     How exactly did you "evaluate" the results? I opened the zip and
> looked at a few of the sheets, but it's a bit daunting.
>
>     Any way we could dump JSON? That's a bit easier to build
> visualizations for.
>
>     Tyler
>
>     On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>
>     > Sounds good.  W00t!
>     >
>     > -----Original Message-----
>     > From: Chris Mattmann [mailto:[hidden email]]
>     > Sent: Monday, May 1, 2017 4:57 PM
>     > To: [hidden email]
>     > Subject: Re: Tika 1.15
>     >
>     > Thanks Tim. I am going to try and get tika-dl added (if
> possible), and
>     > also try the Sentiment Parser next. If I can get one or both of those
>     > (in the next day or so), then I will give you the heads up to
> begin testing.
>     > Video recognition is in!
>     >
>     >
>     >
>     >
>     >
>     > On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >     I finally had a chance to look through the results of the first
>     > regression run.
>     >
>     >     I made a few trivial changes to our parsers and to tika-eval.
>     >
>     >     We appear to have many more exceptions in files parsed by our
>     > CompressorParser, but this is because of reporting...not because of
>     > reality
>     > -- the exception is now coming in the container file, not an
>     > attachment...and tika-eval wasn't matching A and B correctly.
>     >
>     >     There is a regression that's been fixed in PDFBox trunk
>     > (PDFBOX-3717), but I don't see that as a blocker.
>     >
>     >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
>     > wordperfect, but that's because we're actually parsing those now. :)
>     >
>     >     All else looks to be in decent shape.
>     >
>     >     Chris and Team and All,
>     >       Let me know when you're ready for me to kick off the next
>     > regression run.
>     >
>     >               Cheers,
>     >
>     >                       Tim
>     >
>     >
>     >
>     >
>     >     -----Original Message-----
>     >     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>     >     Sent: Wednesday, April 26, 2017 12:48 PM
>     >     To: [hidden email]
>     >     Subject: Re: Tika 1.15
>     >
>     >     Thank you!
>     >
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Chris Mattmann, Ph.D.
>     >     Principal Data Scientist, Engineering Administrative Office
> (3010)
>     > Manager, NSF & Open Source Projects Formulation and Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >     Office: 180-503E, Mailstop: 180-503
>     >     Email: [hidden email]
>     >     WWW:  http://sunset.usc.edu/~mattmann/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >     WWW: http://irds.usc.edu/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >         Oh.  Ok.  Will wait, then?
>     >
>     >         -----Original Message-----
>     >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
>     > nasa.gov]
>     >         Sent: Wednesday, April 26, 2017 11:38 AM
>     >         To: [hidden email]
>     >         Subject: Re: Tika 1.15
>     >
>     >         I want to see if I can get in the VideoRecognition parser,
> and
>     > also the Sentiment one.
>     >
>     >         I hope to get it done in the next day or so. Thanks.
>     >
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Chris Mattmann, Ph.D.
>     >         Principal Data Scientist, Engineering Administrative Office
>     > (3010) Manager, NSF & Open Source Projects Formulation and
> Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >         Office: 180-503E, Mailstop: 180-503
>     >         Email: [hidden email]
>     >         WWW:  http://sunset.usc.edu/~mattmann/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >         WWW: http://irds.usc.edu/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >         On 4/26/17, 7:54 AM, "Allison, Timothy B."
>     > <[hidden email]>
>     > wrote:
>     >
>     >             With the added TSD parser, I think I should rerun the
>     > regression testing.  Given that, I also fixed 2099, and we'll benefit
>     > from a rerun.
>     >
>     >             Anything else before I rerun the regression testing?
>     >
>     >             Any problems observed in first run?
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
I reran the eval with some updates, including rc1 of PDFBox 2.0.6, which is now integrated.

http://162.242.228.174/reports/reports_tika_20170515.tar.gz

I need to do some more digging on attachments -- hit max limit.  The decrease in attachments from the few docs I reviewed is explained by change in default behavior of macro extraction -- in 1.14 we were extracting macros by default, but we aren't doing this in 1.15.  However, I want to look at more than the first x diffs because there may be other file formats further down the results that weren't included in the report.

I also want to look at the contents...haven't had a chance.

>     On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>
>     > Sounds good.  W00t!
>     >
>     > -----Original Message-----
>     > From: Chris Mattmann [mailto:[hidden email]]
>     > Sent: Monday, May 1, 2017 4:57 PM
>     > To: [hidden email]
>     > Subject: Re: Tika 1.15
>     >
>     > Thanks Tim. I am going to try and get tika-dl added (if
> possible), and
>     > also try the Sentiment Parser next. If I can get one or both of those
>     > (in the next day or so), then I will give you the heads up to
> begin testing.
>     > Video recognition is in!
>     >
>     >
>     >
>     >
>     >
>     > On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >     I finally had a chance to look through the results of the first
>     > regression run.
>     >
>     >     I made a few trivial changes to our parsers and to tika-eval.
>     >
>     >     We appear to have many more exceptions in files parsed by our
>     > CompressorParser, but this is because of reporting...not because of
>     > reality
>     > -- the exception is now coming in the container file, not an
>     > attachment...and tika-eval wasn't matching A and B correctly.
>     >
>     >     There is a regression that's been fixed in PDFBox trunk
>     > (PDFBOX-3717), but I don't see that as a blocker.
>     >
>     >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
>     > wordperfect, but that's because we're actually parsing those now. :)
>     >
>     >     All else looks to be in decent shape.
>     >
>     >     Chris and Team and All,
>     >       Let me know when you're ready for me to kick off the next
>     > regression run.
>     >
>     >               Cheers,
>     >
>     >                       Tim
>     >
>     >
>     >
>     >
>     >     -----Original Message-----
>     >     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>     >     Sent: Wednesday, April 26, 2017 12:48 PM
>     >     To: [hidden email]
>     >     Subject: Re: Tika 1.15
>     >
>     >     Thank you!
>     >
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Chris Mattmann, Ph.D.
>     >     Principal Data Scientist, Engineering Administrative Office
> (3010)
>     > Manager, NSF & Open Source Projects Formulation and Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >     Office: 180-503E, Mailstop: 180-503
>     >     Email: [hidden email]
>     >     WWW:  http://sunset.usc.edu/~mattmann/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >     WWW: http://irds.usc.edu/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >         Oh.  Ok.  Will wait, then?
>     >
>     >         -----Original Message-----
>     >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
>     > nasa.gov]
>     >         Sent: Wednesday, April 26, 2017 11:38 AM
>     >         To: [hidden email]
>     >         Subject: Re: Tika 1.15
>     >
>     >         I want to see if I can get in the VideoRecognition parser,
> and
>     > also the Sentiment one.
>     >
>     >         I hope to get it done in the next day or so. Thanks.
>     >
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Chris Mattmann, Ph.D.
>     >         Principal Data Scientist, Engineering Administrative Office
>     > (3010) Manager, NSF & Open Source Projects Formulation and
> Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >         Office: 180-503E, Mailstop: 180-503
>     >         Email: [hidden email]
>     >         WWW:  http://sunset.usc.edu/~mattmann/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >         WWW: http://irds.usc.edu/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >         On 4/26/17, 7:54 AM, "Allison, Timothy B."
>     > <[hidden email]>
>     > wrote:
>     >
>     >             With the added TSD parser, I think I should rerun the
>     > regression testing.  Given that, I also fixed 2099, and we'll benefit
>     > from a rerun.
>     >
>     >             Anything else before I rerun the regression testing?
>     >
>     >             Any problems observed in first run?
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tika 1.15

Allison, Timothy B.
Full report on attachment # diffs: http://162.242.228.174/reports/attachment_diffs_complete_20170516.xlsx

Still need to look through contents diffs.

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Tuesday, May 16, 2017 3:11 PM
To: [hidden email]
Subject: RE: Tika 1.15

I reran the eval with some updates, including rc1 of PDFBox 2.0.6, which is now integrated.

http://162.242.228.174/reports/reports_tika_20170515.tar.gz

I need to do some more digging on attachments -- hit max limit.  The decrease in attachments from the few docs I reviewed is explained by change in default behavior of macro extraction -- in 1.14 we were extracting macros by default, but we aren't doing this in 1.15.  However, I want to look at more than the first x diffs because there may be other file formats further down the results that weren't included in the report.

I also want to look at the contents...haven't had a chance.

>     On May 1, 2017 3:59 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>
>     > Sounds good.  W00t!
>     >
>     > -----Original Message-----
>     > From: Chris Mattmann [mailto:[hidden email]]
>     > Sent: Monday, May 1, 2017 4:57 PM
>     > To: [hidden email]
>     > Subject: Re: Tika 1.15
>     >
>     > Thanks Tim. I am going to try and get tika-dl added (if
> possible), and
>     > also try the Sentiment Parser next. If I can get one or both of those
>     > (in the next day or so), then I will give you the heads up to
> begin testing.
>     > Video recognition is in!
>     >
>     >
>     >
>     >
>     >
>     > On 5/1/17, 12:42 PM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >     I finally had a chance to look through the results of the first
>     > regression run.
>     >
>     >     I made a few trivial changes to our parsers and to tika-eval.
>     >
>     >     We appear to have many more exceptions in files parsed by our
>     > CompressorParser, but this is because of reporting...not because of
>     > reality
>     > -- the exception is now coming in the container file, not an
>     > attachment...and tika-eval wasn't matching A and B correctly.
>     >
>     >     There is a regression that's been fixed in PDFBox trunk
>     > (PDFBOX-3717), but I don't see that as a blocker.
>     >
>     >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
>     > wordperfect, but that's because we're actually parsing those now. :)
>     >
>     >     All else looks to be in decent shape.
>     >
>     >     Chris and Team and All,
>     >       Let me know when you're ready for me to kick off the next
>     > regression run.
>     >
>     >               Cheers,
>     >
>     >                       Tim
>     >
>     >
>     >
>     >
>     >     -----Original Message-----
>     >     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>     >     Sent: Wednesday, April 26, 2017 12:48 PM
>     >     To: [hidden email]
>     >     Subject: Re: Tika 1.15
>     >
>     >     Thank you!
>     >
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Chris Mattmann, Ph.D.
>     >     Principal Data Scientist, Engineering Administrative Office
> (3010)
>     > Manager, NSF & Open Source Projects Formulation and Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >     Office: 180-503E, Mailstop: 180-503
>     >     Email: [hidden email]
>     >     WWW:  http://sunset.usc.edu/~mattmann/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >     WWW: http://irds.usc.edu/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <[hidden email]>
> wrote:
>     >
>     >         Oh.  Ok.  Will wait, then?
>     >
>     >         -----Original Message-----
>     >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
>     > nasa.gov]
>     >         Sent: Wednesday, April 26, 2017 11:38 AM
>     >         To: [hidden email]
>     >         Subject: Re: Tika 1.15
>     >
>     >         I want to see if I can get in the VideoRecognition parser,
> and
>     > also the Sentiment one.
>     >
>     >         I hope to get it done in the next day or so. Thanks.
>     >
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Chris Mattmann, Ph.D.
>     >         Principal Data Scientist, Engineering Administrative Office
>     > (3010) Manager, NSF & Open Source Projects Formulation and
> Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >         Office: 180-503E, Mailstop: 180-503
>     >         Email: [hidden email]
>     >         WWW:  http://sunset.usc.edu/~mattmann/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >         WWW: http://irds.usc.edu/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >         On 4/26/17, 7:54 AM, "Allison, Timothy B."
>     > <[hidden email]>
>     > wrote:
>     >
>     >             With the added TSD parser, I think I should rerun the
>     > regression testing.  Given that, I also fixed 2099, and we'll benefit
>     > from a rerun.
>     >
>     >             Anything else before I rerun the regression testing?
>     >
>     >             Any problems observed in first run?
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>
>
>
>
12
Loading...