[jira] [Resolved] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2459.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.17

Thank you for opening this and sharing a test file.  We hadn't seen \u0014 and \u0015 together in the same character run before.  This is now fixed.

> Missing text in .doc file (but can be extracted by POI)
> -------------------------------------------------------
>
>                 Key: TIKA-2459
>                 URL: https://issues.apache.org/jira/browse/TIKA-2459
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>         Environment: Windows and Linux
>            Reporter: Dustin Spicuzza
>             Fix For: 1.17
>
>         Attachments: foo2.doc
>
>
> I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.
> Tika's output:
> {noformat}
> Something
> One:
> Else
> Two:
> Here
> Three:
> Four
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}
> POI's output:
> {noformat}
> Something
> One:    Else
> Two:    Here
> Three:  Four
> Paragraph one
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)