[jira] [Commented] (TIKA-2730) parseToString fails for a simple mp3

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2730) parseToString fails for a simple mp3

Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621008#comment-16621008 ]

Hudson commented on TIKA-2730:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [https://builds.apache.org/job/tika-branch-1x/94/])
TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF (tallison: [https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a])
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
* (add) tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java


> parseToString fails for a simple mp3
> ------------------------------------
>
>                 Key: TIKA-2730
>                 URL: https://issues.apache.org/jira/browse/TIKA-2730
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.19
>            Reporter: Boris Petrov
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0, 1.20
>
>         Attachments: demo.mp3
>
>
> This is a regression from 1.18. I've attached the mp3 that fails. The exception I get is:
> {noformat}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at org.apache.tika.Tika.parseToString(Tika.java:527)
>     at com.company.TextExtractor.getText(TextExtractor.java:39)
>     Caused by:
>     java.io.EOFException: EOF: tried to skip 361 but could only skip 247
>         at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)
>         at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)
>         at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         ... 5 more{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
Reply | Threaded
Open this post in threaded view
|

1.19.1?

Tim Allison
The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly
clear on this but I did some self-hand-waving to excuse away the
numbers...I shouldn’t have.

I want to add some new reports to tika-eval so that this never happens
again.

How long should we wait for 1.19.1 or 1.20?

Best,

    Tim

On Wed, Sep 19, 2018 at 2:29 PM Hudson (JIRA) <[hidden email]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621008#comment-16621008
> ]
>
> Hudson commented on TIKA-2730:
> ------------------------------
>
> SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [
> https://builds.apache.org/job/tika-branch-1x/94/])
> TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF
> (tallison: [
> https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a
> ])
> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java
> * (edit)
> tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
> * (add)
> tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3
> * (edit)
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
>
>
> > parseToString fails for a simple mp3
> > ------------------------------------
> >
> >                 Key: TIKA-2730
> >                 URL: https://issues.apache.org/jira/browse/TIKA-2730
> >             Project: Tika
> >          Issue Type: Bug
> >    Affects Versions: 1.19
> >            Reporter: Boris Petrov
> >            Assignee: Tim Allison
> >            Priority: Major
> >             Fix For: 2.0.0, 1.20
> >
> >         Attachments: demo.mp3
> >
> >
> > This is a regression from 1.18. I've attached the mp3 that fails. The
> exception I get is:
> > {noformat}
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6
> >     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
> >     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> >     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> >     at org.apache.tika.Tika.parseToString(Tika.java:527)
> >     at com.company.TextExtractor.getText(TextExtractor.java:39)
> >     Caused by:
> >     java.io.EOFException: EOF: tried to skip 361 but could only skip 247
> >         at
> org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)
> >         at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)
> >         at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
> >         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> >         ... 5 more{noformat}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
Reply | Threaded
Open this post in threaded view
|

Re: 1.19.1?

Nick Burch-2
On Wed, 19 Sep 2018, Tim Allison wrote:
> The mp3 regression is bad. In hindsight, the Tika-eval reports were
> fairly clear on this but I did some self-hand-waving to excuse away the
> numbers...I shouldn’t have.
>
> I want to add some new reports to tika-eval so that this never happens
> again.
>
> How long should we wait for 1.19.1 or 1.20?

There's a POI xml bug on certain older platforms (POI tries too hard to
lock down the xml settings even if the xml parser doesn't do that...),
maybe worth trying to get a POI 4.0.1 out, then do a Tika 1.19.1 or 1.20
(depending on how many other bugs we spot in the POI wait!)

Nick
Reply | Threaded
Open this post in threaded view
|

Re: 1.19.1?

Tim Allison
Y, and I think I duplicated that bug when I copied/pasted from POI to
Tika, so that's a good reminder to fix that in Tika asap as well as
potentially wait for POI 4.0.1.  Thank you!
On Wed, Sep 19, 2018 at 4:53 PM Nick Burch <[hidden email]> wrote:

>
> On Wed, 19 Sep 2018, Tim Allison wrote:
> > The mp3 regression is bad. In hindsight, the Tika-eval reports were
> > fairly clear on this but I did some self-hand-waving to excuse away the
> > numbers...I shouldn’t have.
> >
> > I want to add some new reports to tika-eval so that this never happens
> > again.
> >
> > How long should we wait for 1.19.1 or 1.20?
>
> There's a POI xml bug on certain older platforms (POI tries too hard to
> lock down the xml settings even if the xml parser doesn't do that...),
> maybe worth trying to get a POI 4.0.1 out, then do a Tika 1.19.1 or 1.20
> (depending on how many other bugs we spot in the POI wait!)
>
> Nick
Reply | Threaded
Open this post in threaded view
|

Re: 1.19.1?

Chris Mattmann
In reply to this post by Tim Allison
Let’s roll it….

 

 

 

From: Tim Allison <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Wednesday, September 19, 2018 at 12:14 PM
To: "[hidden email]" <[hidden email]>
Subject: 1.19.1?

 

The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly

clear on this but I did some self-hand-waving to excuse away the

numbers...I shouldn’t have.

 

I want to add some new reports to tika-eval so that this never happens

again.

 

How long should we wait for 1.19.1 or 1.20?

 

Best,

 

    Tim

 

On Wed, Sep 19, 2018 at 2:29 PM Hudson (JIRA) <[hidden email]> wrote:

 

 

     [

https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621008#comment-16621008

]

 

Hudson commented on TIKA-2730:

------------------------------

 

SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [

https://builds.apache.org/job/tika-branch-1x/94/])

TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF

(tallison: [

https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a

])

* (edit)

tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java

* (edit)

tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java

* (add)

tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3

* (edit)

tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java

 

 

> parseToString fails for a simple mp3

> ------------------------------------

>

>                 Key: TIKA-2730

>                 URL: https://issues.apache.org/jira/browse/TIKA-2730

>             Project: Tika

>          Issue Type: Bug

>    Affects Versions: 1.19

>            Reporter: Boris Petrov

>            Assignee: Tim Allison

>            Priority: Major

>             Fix For: 2.0.0, 1.20

>

>         Attachments: demo.mp3

>

>

> This is a regression from 1.18. I've attached the mp3 that fails. The

exception I get is:

> {noformat}

> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException

from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6

>     at

org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)

>     at

org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

>     at

org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

>     at org.apache.tika.Tika.parseToString(Tika.java:527)

>     at com.company.TextExtractor.getText(TextExtractor.java:39)

>     Caused by:

>     java.io.EOFException: EOF: tried to skip 361 but could only skip 247

>         at

org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)

>         at

org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)

>         at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)

>         at

org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

>         ... 5 more{noformat}

 

 

 

--

This message was sent by Atlassian JIRA

(v7.6.3#76005)

 

 

Reply | Threaded
Open this post in threaded view
|

Re: 1.19.1?

Tim Allison
Nick,
  Aside from the problem with users and non-standard XML parsers, were
there any other show-stoppers in POI 4.0.0?  Is there a reason to wait
for POI 4.0.1?
On Fri, Sep 21, 2018 at 12:48 PM Chris Mattmann <[hidden email]> wrote:

>
> Let’s roll it….
>
>
>
>
>
>
>
> From: Tim Allison <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Wednesday, September 19, 2018 at 12:14 PM
> To: "[hidden email]" <[hidden email]>
> Subject: 1.19.1?
>
>
>
> The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly
>
> clear on this but I did some self-hand-waving to excuse away the
>
> numbers...I shouldn’t have.
>
>
>
> I want to add some new reports to tika-eval so that this never happens
>
> again.
>
>
>
> How long should we wait for 1.19.1 or 1.20?
>
>
>
> Best,
>
>
>
>     Tim
>
>
>
> On Wed, Sep 19, 2018 at 2:29 PM Hudson (JIRA) <[hidden email]> wrote:
>
>
>
>
>
>      [
>
> https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621008#comment-16621008
>
> ]
>
>
>
> Hudson commented on TIKA-2730:
>
> ------------------------------
>
>
>
> SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [
>
> https://builds.apache.org/job/tika-branch-1x/94/])
>
> TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF
>
> (tallison: [
>
> https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a
>
> ])
>
> * (edit)
>
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java
>
> * (edit)
>
> tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
>
> * (add)
>
> tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3
>
> * (edit)
>
> tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
>
>
>
>
>
> > parseToString fails for a simple mp3
>
> > ------------------------------------
>
> >
>
> >                 Key: TIKA-2730
>
> >                 URL: https://issues.apache.org/jira/browse/TIKA-2730
>
> >             Project: Tika
>
> >          Issue Type: Bug
>
> >    Affects Versions: 1.19
>
> >            Reporter: Boris Petrov
>
> >            Assignee: Tim Allison
>
> >            Priority: Major
>
> >             Fix For: 2.0.0, 1.20
>
> >
>
> >         Attachments: demo.mp3
>
> >
>
> >
>
> > This is a regression from 1.18. I've attached the mp3 that fails. The
>
> exception I get is:
>
> > {noformat}
>
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
>
> from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6
>
> >     at
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>
> >     at
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> >     at
>
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> >     at org.apache.tika.Tika.parseToString(Tika.java:527)
>
> >     at com.company.TextExtractor.getText(TextExtractor.java:39)
>
> >     Caused by:
>
> >     java.io.EOFException: EOF: tried to skip 361 but could only skip 247
>
> >         at
>
> org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)
>
> >         at
>
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)
>
> >         at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>
> >         at
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> >         ... 5 more{noformat}
>
>
>
>
>
>
>
> --
>
> This message was sent by Atlassian JIRA
>
> (v7.6.3#76005)
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: 1.19.1?

Nick Burch-2
On Mon, 24 Sep 2018, Tim Allison wrote:
> Aside from the problem with users and non-standard XML parsers, were
> there any other show-stoppers in POI 4.0.0?  Is there a reason to wait
> for POI 4.0.1?

I think, in terms of Tika affecting bugs, it was the xml parser stuff, and
commons compress missing from the pom.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: 1.19.1?

Tim Allison
Given the mp3 issue and some other items, let's go with 1.19.1 rc1
today or tomorrow?
On Mon, Sep 24, 2018 at 3:07 PM Nick Burch <[hidden email]> wrote:

>
> On Mon, 24 Sep 2018, Tim Allison wrote:
> > Aside from the problem with users and non-standard XML parsers, were
> > there any other show-stoppers in POI 4.0.0?  Is there a reason to wait
> > for POI 4.0.1?
>
> I think, in terms of Tika affecting bugs, it was the xml parser stuff, and
> commons compress missing from the pom.
>
> Nick
Reply | Threaded
Open this post in threaded view
|

Re: 1.19.1?

Chris Mattmann
Sounds great!

 

 

 

From: Tim Allison <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, September 25, 2018 at 9:40 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: 1.19.1?

 

Given the mp3 issue and some other items, let's go with 1.19.1 rc1

today or tomorrow?

On Mon, Sep 24, 2018 at 3:07 PM Nick Burch <[hidden email]> wrote:

 

On Mon, 24 Sep 2018, Tim Allison wrote:

> Aside from the problem with users and non-standard XML parsers, were

> there any other show-stoppers in POI 4.0.0?  Is there a reason to wait

> for POI 4.0.1?

 

I think, in terms of Tika affecting bugs, it was the xml parser stuff, and

commons compress missing from the pom.

 

Nick