[jira] Created: (TIKA-216) Zip bomb prevention

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-216) Zip bomb prevention

JIRA jira@apache.org
Zip bomb prevention
-------------------

                 Key: TIKA-216
                 URL: https://issues.apache.org/jira/browse/TIKA-216
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Jukka Zitting


It would be good to have a mechanism that automatically detects a "zip bomb", i.e. a compressed document that expands to excessive amounts of extracted text. The classic example is the 42.zip file that's just 42kB in size, but expands to about 4 *petabytes* when all layers are fully uncompressed.

A simple preventive measure could be a Parser decorator that counts the number of input bytes and the output characters, and fails with a TikaException when the ratio exceeds some configurable limit.

As another preventive measure, the decorator could also keep track of the time (and perhaps even memory, if possible) it takes to process the input document. A TikaException would be thrown if processing time exceeds some configurable limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-216) Zip bomb prevention

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-216.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.4
         Assignee: Jukka Zitting

There is now a SecureContentHandler decorator class that implements a simple zip bomb prevention heuristic. By default the class throws an exception for any input documents that produce over 100 output characters per input byte, a compression ratio that no normal document is expected to reach. There is a default threshold of 1M output characters after which the zip bomb detection gets activated. This threshold avoids false positives for otherwise normal documents that for some reason start with a sequence of highly compressible data.

The SecureContentHandler decorator class is used together with the CountingInputStream from Commons IO. See below for sample usage:

    CountingInputStream count = new CountingInputStream(stream);
    SecureContentHandler secure = new SecureContentHandler(handler, count);
    try {
        parser.parse(count, secure, metadata);
    } catch (SAXException e) {
        secure.throwIfCauseOf(e);
        throw e;
    }

I added this to the AutoDetectParser, so all clients that use AutoDetectParser or tools based on that are automatically protected against simple zip bombs.



> Zip bomb prevention
> -------------------
>
>                 Key: TIKA-216
>                 URL: https://issues.apache.org/jira/browse/TIKA-216
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>
> It would be good to have a mechanism that automatically detects a "zip bomb", i.e. a compressed document that expands to excessive amounts of extracted text. The classic example is the 42.zip file that's just 42kB in size, but expands to about 4 *petabytes* when all layers are fully uncompressed.
> A simple preventive measure could be a Parser decorator that counts the number of input bytes and the output characters, and fails with a TikaException when the ratio exceeds some configurable limit.
> As another preventive measure, the decorator could also keep track of the time (and perhaps even memory, if possible) it takes to process the input document. A TikaException would be thrown if processing time exceeds some configurable limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-216) Zip bomb prevention

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718184#action_12718184 ]

Jukka Zitting commented on TIKA-216:
------------------------------------

For the record, see a similar issue in the Aperture project:
http://sourceforge.net/tracker/?func=detail&aid=2786554&group_id=150969&atid=779500

> Zip bomb prevention
> -------------------
>
>                 Key: TIKA-216
>                 URL: https://issues.apache.org/jira/browse/TIKA-216
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>
> It would be good to have a mechanism that automatically detects a "zip bomb", i.e. a compressed document that expands to excessive amounts of extracted text. The classic example is the 42.zip file that's just 42kB in size, but expands to about 4 *petabytes* when all layers are fully uncompressed.
> A simple preventive measure could be a Parser decorator that counts the number of input bytes and the output characters, and fails with a TikaException when the ratio exceeds some configurable limit.
> As another preventive measure, the decorator could also keep track of the time (and perhaps even memory, if possible) it takes to process the input document. A TikaException would be thrown if processing time exceeds some configurable limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (TIKA-216) Zip bomb prevention

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting reopened TIKA-216:
--------------------------------


Reopening as the zip file in the Aperture issue is still causing problems for Tika.

> Zip bomb prevention
> -------------------
>
>                 Key: TIKA-216
>                 URL: https://issues.apache.org/jira/browse/TIKA-216
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>
> It would be good to have a mechanism that automatically detects a "zip bomb", i.e. a compressed document that expands to excessive amounts of extracted text. The classic example is the 42.zip file that's just 42kB in size, but expands to about 4 *petabytes* when all layers are fully uncompressed.
> A simple preventive measure could be a Parser decorator that counts the number of input bytes and the output characters, and fails with a TikaException when the ratio exceeds some configurable limit.
> As another preventive measure, the decorator could also keep track of the time (and perhaps even memory, if possible) it takes to process the input document. A TikaException would be thrown if processing time exceeds some configurable limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-216) Zip bomb prevention

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730185#action_12730185 ]

Chris A. Mattmann commented on TIKA-216:
----------------------------------------

Hey Jukka, Tika'ers:

Do you see this as a blocker to 0.4? I'd like to cut an RC in the next day or so, but this is still open and I wanted to check with you and get your thoughts?

My vote is -1 for this being a blocker -- I think we can fix it in 0.5. Please let me know ASAP -- if I don't hear back in the next 48 hours I'm going to go ahead and push this to 0.5. If I do hear back and there is significant support that this can go to 0.5, then I will do so earlier and move on to the RC.

Cheers,
Chris


> Zip bomb prevention
> -------------------
>
>                 Key: TIKA-216
>                 URL: https://issues.apache.org/jira/browse/TIKA-216
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>
> It would be good to have a mechanism that automatically detects a "zip bomb", i.e. a compressed document that expands to excessive amounts of extracted text. The classic example is the 42.zip file that's just 42kB in size, but expands to about 4 *petabytes* when all layers are fully uncompressed.
> A simple preventive measure could be a Parser decorator that counts the number of input bytes and the output characters, and fails with a TikaException when the ratio exceeds some configurable limit.
> As another preventive measure, the decorator could also keep track of the time (and perhaps even memory, if possible) it takes to process the input document. A TikaException would be thrown if processing time exceeds some configurable limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-216) Zip bomb prevention

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-216.
--------------------------------

    Resolution: Fixed

The basic zip bomb prevention mechanism is already included, so I'm resolving this as Fixed for the 0.4 release. The followup issue with droste.zip and similar specially crafted packages is filed as TIKA-259.

> Zip bomb prevention
> -------------------
>
>                 Key: TIKA-216
>                 URL: https://issues.apache.org/jira/browse/TIKA-216
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>
> It would be good to have a mechanism that automatically detects a "zip bomb", i.e. a compressed document that expands to excessive amounts of extracted text. The classic example is the 42.zip file that's just 42kB in size, but expands to about 4 *petabytes* when all layers are fully uncompressed.
> A simple preventive measure could be a Parser decorator that counts the number of input bytes and the output characters, and fails with a TikaException when the ratio exceeds some configurable limit.
> As another preventive measure, the decorator could also keep track of the time (and perhaps even memory, if possible) it takes to process the input document. A TikaException would be thrown if processing time exceeds some configurable limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.