[jira] Created: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

JIRA jira@apache.org
ParsingReader does not parse metadata for larger MS Office documents
--------------------------------------------------------------------

                 Key: TIKA-262
                 URL: https://issues.apache.org/jira/browse/TIKA-262
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.3
            Reporter: Daan de Wit


The ParsingReader should cause the metadata to be extracted before anything is read from the reader. This is not done for certain MS Office files, it seems to be related to the size of the document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daan de Wit updated TIKA-262:
-----------------------------

    Attachment: lipsum.doc

word document to reproduce the issue

> ParsingReader does not parse metadata for larger MS Office documents
> --------------------------------------------------------------------
>
>                 Key: TIKA-262
>                 URL: https://issues.apache.org/jira/browse/TIKA-262
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Daan de Wit
>         Attachments: lipsum.doc
>
>
> The ParsingReader should cause the metadata to be extracted before anything is read from the reader. This is not done for certain MS Office files, it seems to be related to the size of the document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daan de Wit updated TIKA-262:
-----------------------------

    Attachment: tika-0.3_large-ms-office-metadata.patch

test case

> ParsingReader does not parse metadata for larger MS Office documents
> --------------------------------------------------------------------
>
>                 Key: TIKA-262
>                 URL: https://issues.apache.org/jira/browse/TIKA-262
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Daan de Wit
>         Attachments: lipsum.doc, tika-0.3_large-ms-office-metadata.patch
>
>
> The ParsingReader should cause the metadata to be extracted before anything is read from the reader. This is not done for certain MS Office files, it seems to be related to the size of the document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daan de Wit updated TIKA-262:
-----------------------------

    Attachment: OfficeParser.java.patch

It seems that word reorders the entries, such that the content entry is before the summary information entry for larger documents. Attached is a naive fix to OfficeParser.java that handles this.

> ParsingReader does not parse metadata for larger MS Office documents
> --------------------------------------------------------------------
>
>                 Key: TIKA-262
>                 URL: https://issues.apache.org/jira/browse/TIKA-262
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Daan de Wit
>         Attachments: lipsum.doc, OfficeParser.java.patch, tika-0.3_large-ms-office-metadata.patch
>
>
> The ParsingReader should cause the metadata to be extracted before anything is read from the reader. This is not done for certain MS Office files, it seems to be related to the size of the document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daan de Wit updated TIKA-262:
-----------------------------

    Attachment: OfficeParser.java.patch

Cleaned up patch, previous file contained code from another patch.

> ParsingReader does not parse metadata for larger MS Office documents
> --------------------------------------------------------------------
>
>                 Key: TIKA-262
>                 URL: https://issues.apache.org/jira/browse/TIKA-262
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Daan de Wit
>         Attachments: lipsum.doc, OfficeParser.java.patch, OfficeParser.java.patch, tika-0.3_large-ms-office-metadata.patch
>
>
> The ParsingReader should cause the metadata to be extracted before anything is read from the reader. This is not done for certain MS Office files, it seems to be related to the size of the document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daan de Wit updated TIKA-262:
-----------------------------

    Attachment: OfficeParser.java.patch

new patch, summary entries may not exist, all tests pass now.

> ParsingReader does not parse metadata for larger MS Office documents
> --------------------------------------------------------------------
>
>                 Key: TIKA-262
>                 URL: https://issues.apache.org/jira/browse/TIKA-262
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Daan de Wit
>         Attachments: lipsum.doc, OfficeParser.java.patch, OfficeParser.java.patch, OfficeParser.java.patch, tika-0.3_large-ms-office-metadata.patch
>
>
> The ParsingReader should cause the metadata to be extracted before anything is read from the reader. This is not done for certain MS Office files, it seems to be related to the size of the document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-262) ParsingReader does not parse metadata for larger MS Office documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-262.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Good stuff, thanks!

I committed a slightly modified version (inlined smaller methods, indent with spaces) of the patch in revision 795266.

> ParsingReader does not parse metadata for larger MS Office documents
> --------------------------------------------------------------------
>
>                 Key: TIKA-262
>                 URL: https://issues.apache.org/jira/browse/TIKA-262
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Daan de Wit
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: lipsum.doc, OfficeParser.java.patch, OfficeParser.java.patch, OfficeParser.java.patch, tika-0.3_large-ms-office-metadata.patch
>
>
> The ParsingReader should cause the metadata to be extracted before anything is read from the reader. This is not done for certain MS Office files, it seems to be related to the size of the document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.