MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

Christian Reuschling-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Normally Parsers triggers the ContentHandler.startDocument() method in their parse(...
ContentHandler ...) method for sure - this is also true in the case of an Error, which normally
throws an Exception.

We wrote and maintain an open source crawler lib (leech crawler) based on Tika, where we works
with special Content Handlers that deals with the recursive crawling issues. To recognize that
there is an error during the crawl, we are in need to recognize an Exception. On the other hand -
in the case there is no error - we need to recognize that there was a crawled entity (to count the
crawled items, etc.). To recognize this, we implemented the startDocument() method inside our
ContentHandler decorators.


This works like a charme, but inside MP4Parser, there exists these lines of code:


Line 146-154, parse() method:

        MovieBox moov = getOrNull(isoFile, MovieBox.class);
        if (moov == null) {
           // Bail out
           return;
        }


        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        ......
        ......


There, in the case there is no content?! inside the MP4 file, with a 'Bail out' comment, the parse
method will be leaved - at least for us - silently.

I don't know if this is also a problem in general (because Tika has also a plenty of
ContentHandler decorators), but from our point of view Tika signals an empty content with the
invocation of xhtml.startDocument() and xhtml.endDocument() with noting in between. In the case
this moov==null situation should be an error, an exception should be thrown.


If we are right (and we hope so, because we are in need of this ;) ) we want to suggest this
modification, as said:



        MovieBox moov = getOrNull(isoFile, MovieBox.class);
        if (moov == null) {
           // Bail out
           handler.startDocument();
           handler.endDocument();

           return;
        }



Looking forward to your opinions!

Chris



- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:[hidden email]  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGkw+IACgkQ6EqMXq+WZg8RFQCeLNmQ9XnG7b1CHVyWVLkHDmhf
wccAmwRu6V28syceVJJ13c97+dNQ0Xkv
=9MGc
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

Nick Burch-2
On Tue, 28 May 2013, Christian Reuschling wrote:

> This works like a charme, but inside MP4Parser, there exists these lines
> of code:
>
> Line 146-154, parse() method:
>
>        MovieBox moov = getOrNull(isoFile, MovieBox.class);
>        if (moov == null) {
>           // Bail out
>           return;
>        }

I think that's a sign that it isn't a valid mp4 file. Do you know where
the file came from / what kind of thing it has in it / can you share an
example?

Nick
Reply | Threaded
Open this post in threaded view
|

Re: MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

Christian Reuschling-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

You are right - I checked this, and its a .mov file under an URL, but the length of the file is zero.

Nevertheless, in this case an Exception (like in all other parsers) or a tika body with length
zero, which is indicated at least by handler.endDocument() would be the appropriate way, isn't it?
- From the ContentHandlers point of view, there is nothing in between.

Chris


On 28.05.2013 19:06, Nick Burch wrote:

> On Tue, 28 May 2013, Christian Reuschling wrote:
>> This works like a charme, but inside MP4Parser, there exists these lines of code:
>>
>> Line 146-154, parse() method:
>>
>> MovieBox moov = getOrNull(isoFile, MovieBox.class); if (moov == null) { // Bail out return;
>> }
>
> I think that's a sign that it isn't a valid mp4 file. Do you know where the file came from /
> what kind of thing it has in it / can you share an example?
>
> Nick

- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:[hidden email]  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGlxrkACgkQ6EqMXq+WZg8n5wCgldKhsi9/+aUb1Ual00f3XIhb
gzwAn2WDr74lEvPzpQ8yMeOG2u+7PmZi
=i/0C
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

Nick Burch-2
On Wed, 29 May 2013, Christian Reuschling wrote:
> Nevertheless, in this case an Exception (like in all other parsers) or a
> tika body with length zero, which is indicated at least by
> handler.endDocument() would be the appropriate way, isn't it? - From the
> ContentHandlers point of view, there is nothing in between.

I'm not sure if we do have a properly documented policy on what a parser
should do if it receives a file it can't handle. For ones that are
invalid (eg corrupt), I believe an exception is the expected result. The
case when the file seems valid, but can't be handled by the parser, not
sure

Does anyone know if we have a policy on this, and/or where we should
document it?

Nick
Reply | Threaded
Open this post in threaded view
|

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

Christian Reuschling-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

it would be very interesting if somebody has a principle comment on this thread...


On 29.05.2013 14:42, Nick Burch wrote:

> On Wed, 29 May 2013, Christian Reuschling wrote:
>> Nevertheless, in this case an Exception (like in all other parsers) or a tika body with
>> length zero, which is indicated at least by handler.endDocument() would be the appropriate
>> way, isn't it? - From the ContentHandlers point of view, there is nothing in between.
>
> I'm not sure if we do have a properly documented policy on what a parser should do if it
> receives a file it can't handle. For ones that are invalid (eg corrupt), I believe an exception
> is the expected result. The case when the file seems valid, but can't be handled by the parser,
> not sure
>
> Does anyone know if we have a policy on this, and/or where we should document it?
>
> Nick

- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:[hidden email]  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGxxFkACgkQ6EqMXq+WZg91CgCffJoxohycTUP0F2ha9djqAQbp
tRAAoIbAkUjqZujYM/BHINMmbhNswir9
=a1xL
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

Ray Gauss II-2
I think the Parser interface Javadoc would make sense as a place to document, but I don't know if there is an existing policy.

We'll certainly need to consider things like DelegatingParsers which may be using other parsers to do portions of the work.

Not the principle comment you were looking for, but my 2 cents.

Ray

On Jun 7, 2013, at 7:30 AM, Christian Reuschling <[hidden email]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> it would be very interesting if somebody has a principle comment on this thread...
>
>
> On 29.05.2013 14:42, Nick Burch wrote:
>> On Wed, 29 May 2013, Christian Reuschling wrote:
>>> Nevertheless, in this case an Exception (like in all other parsers) or a tika body with
>>> length zero, which is indicated at least by handler.endDocument() would be the appropriate
>>> way, isn't it? - From the ContentHandlers point of view, there is nothing in between.
>>
>> I'm not sure if we do have a properly documented policy on what a parser should do if it
>> receives a file it can't handle. For ones that are invalid (eg corrupt), I believe an exception
>> is the expected result. The case when the file seems valid, but can't be handled by the parser,
>> not sure
>>
>> Does anyone know if we have a policy on this, and/or where we should document it?
>>
>> Nick
>
> - --
> ______________________________________________________________________________
> Christian Reuschling, Dipl.-Ing.(BA)
> Software Engineer
>
> Knowledge Management Department
> German Research Center for Artificial Intelligence DFKI GmbH
> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>
> Phone: +49.631.20575-1250
> mailto:[hidden email]  http://www.dfki.uni-kl.de/~reuschling/
>
> - ------------Legal Company Information Required by German Law------------------
> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>                  Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313=
> ______________________________________________________________________________
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.19 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlGxxFkACgkQ6EqMXq+WZg91CgCffJoxohycTUP0F2ha9djqAQbp
> tRAAoIbAkUjqZujYM/BHINMmbhNswir9
> =a1xL
> -----END PGP SIGNATURE-----

Reply | Threaded
Open this post in threaded view
|

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

Nick Burch-2
On Fri, 7 Jun 2013, Ray Gauss II wrote:
> I think the Parser interface Javadoc would make sense as a place to
> document, but I don't know if there is an existing policy.

It might be helpful if some kind soul could take a few hours to review all
the existing parsers, and give a summary of what they seem to do on
invalid or empty documents (eg 5 throw a tika exception, 1 a sax
exception, 8 do start then end, 2 do nothing). I don't know what those
numbers will be, but that may help us work out if there's almost a
standard we can aim for or not!

Nick
Reply | Threaded
Open this post in threaded view
|

Re: MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

Nick Burch-2
In reply to this post by Nick Burch-2
On Wed, 29 May 2013, Nick Burch wrote:
> I'm not sure if we do have a properly documented policy on what a parser
> should do if it receives a file it can't handle. For ones that are
> invalid (eg corrupt), I believe an exception is the expected result. The
> case when the file seems valid, but can't be handled by the parser, not
> sure
>
> Does anyone know if we have a policy on this, and/or where we should document
> it?

I've made a start on documenting this on the wiki:
    https://wiki.apache.org/tika/ErrorsAndExceptions

However, there are a few bits we still need to sort out, such as this case
(parser thinks the file is valid, but just in a format it can't cope
with), or the case of an empty file (what we should/shouldn't output, eg
body tag). Hopefully someone can come up with a good suggestion...!

Nick