[jira] [Commented] (TIKA-2807) .docx text extract leaves out rich text content-control inside of a text box

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2807) .docx text extract leaves out rich text content-control inside of a text box

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736040#comment-16736040 ]

Hudson commented on TIKA-2807:
------------------------------

FAILURE: Integrated in Jenkins build Tika-trunk #1616 (See [https://builds.apache.org/job/Tika-trunk/1616/])
TIKA-2807 -- extract sdt content from within textbox in docx (tallison: [https://github.com/apache/tika/commit/06cf66cef14863fee0111dddefaebaa051a40c72])
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
* (add) tika-parsers/src/test/resources/test-documents/testWORD_sdtInTextBox.docx
* (edit) CHANGES.txt
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java


> .docx text extract leaves out rich text content-control inside of a text box
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-2807
>                 URL: https://issues.apache.org/jira/browse/TIKA-2807
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20
>            Reporter: Claudia Mickiewicz
>            Assignee: Tim Allison
>            Priority: Critical
>             Fix For: 2.0.0, 1.21
>
>         Attachments: test-document.docx
>
>
> When parsing a Microsoft Word .docx, Rich Text Content Control nested inside of a Text Box remain unextracted.
> I have attached a .docx file that can be tested against. 
>  
> "_rich-text-content-control_inside-text-box_" remains unextracted while "rich-text-content-control " and "_simple text_" are extracted without any problem. ** 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)