[jira] Created: (TIKA-244) Missing Header/Footer text for Word'97 documents

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-244) Missing Header/Footer text for Word'97 documents

JIRA jira@apache.org
Missing Header/Footer text for Word'97 documents
------------------------------------------------

                 Key: TIKA-244
                 URL: https://issues.apache.org/jira/browse/TIKA-244
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.3
            Reporter: Maxim Valyanskiy
         Attachments: tika-patch

Tika output lacks header/footer text for Word'07 document. This patch fixes this problem:

diff -u -r apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
--- apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-02-14 03:07:51.000000000 +0300
+++ apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-06-09 13:24:56.000000000 +0400
@@ -75,9 +75,14 @@
             } else if ("WordDocument".equals(name)) {
                 setType(metadata, "application/msword");
                 WordExtractor extractor = new WordExtractor(filesystem);
+
+                xhtml.element("p", extractor.getHeaderText());
+
                 for (String paragraph : extractor.getParagraphText()) {
                     xhtml.element("p", paragraph);
                 }
+
+                xhtml.element("p", extractor.getFooterText());
             } else if ("PowerPoint Document".equals(name)) {
                 setType(metadata, "application/vnd.ms-powerpoint");
                 PowerPointExtractor extractor =


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-244) Missing Header/Footer text for Word'97 documents

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maxim Valyanskiy updated TIKA-244:
----------------------------------

    Attachment: tika-patch

> Missing Header/Footer text for Word'97 documents
> ------------------------------------------------
>
>                 Key: TIKA-244
>                 URL: https://issues.apache.org/jira/browse/TIKA-244
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Maxim Valyanskiy
>         Attachments: tika-patch
>
>
> Tika output lacks header/footer text for Word'07 document. This patch fixes this problem:
> diff -u -r apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
> --- apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-02-14 03:07:51.000000000 +0300
> +++ apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-06-09 13:24:56.000000000 +0400
> @@ -75,9 +75,14 @@
>              } else if ("WordDocument".equals(name)) {
>                  setType(metadata, "application/msword");
>                  WordExtractor extractor = new WordExtractor(filesystem);
> +
> +                xhtml.element("p", extractor.getHeaderText());
> +
>                  for (String paragraph : extractor.getParagraphText()) {
>                      xhtml.element("p", paragraph);
>                  }
> +
> +                xhtml.element("p", extractor.getFooterText());
>              } else if ("PowerPoint Document".equals(name)) {
>                  setType(metadata, "application/vnd.ms-powerpoint");
>                  PowerPointExtractor extractor =

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-244) Missing Header/Footer text for Word'97 documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-244.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.4
         Assignee: Jukka Zitting

Thanks! Patch applied in revision 788595.

I added <div class="header"/> and <div class="footer"/> wrappers around the header and footer texts, and modified the code to only output those sections when the header or footer are non-empty.

> Missing Header/Footer text for Word'97 documents
> ------------------------------------------------
>
>                 Key: TIKA-244
>                 URL: https://issues.apache.org/jira/browse/TIKA-244
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Maxim Valyanskiy
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>         Attachments: tika-patch
>
>
> Tika output lacks header/footer text for Word'07 document. This patch fixes this problem:
> diff -u -r apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
> --- apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-02-14 03:07:51.000000000 +0300
> +++ apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-06-09 13:24:56.000000000 +0400
> @@ -75,9 +75,14 @@
>              } else if ("WordDocument".equals(name)) {
>                  setType(metadata, "application/msword");
>                  WordExtractor extractor = new WordExtractor(filesystem);
> +
> +                xhtml.element("p", extractor.getHeaderText());
> +
>                  for (String paragraph : extractor.getParagraphText()) {
>                      xhtml.element("p", paragraph);
>                  }
> +
> +                xhtml.element("p", extractor.getFooterText());
>              } else if ("PowerPoint Document".equals(name)) {
>                  setType(metadata, "application/vnd.ms-powerpoint");
>                  PowerPointExtractor extractor =

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.