Quantcast

[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866268#comment-15866268 ]

Tim Allison commented on TIKA-2265:
-----------------------------------

{noformat}
<w:footnote w:type="separator" w:id="0"><w:p w:rsidR="00D2605B" w:rsidRDefault="00D2605B"><w:r><w:separator/></w:r></w:p></w:footnote>

<w:footnote w:type="continuationSeparator" w:id="1"><w:p w:rsidR="00D2605B" w:rsidRDefault="00D2605B"><w:r><w:continuationSeparator/></w:r></w:p></w:footnote>

<w:footnote w:id="2">...actual footnote
{noformat}

Yep, that's a problem we should fix.  We can't rely on the "id" being equal to the footnote number.  Looks like we have to calculate it dynamically by skipping separators(?)...

Thank you for submitting an example document.

> Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-2265
>                 URL: https://issues.apache.org/jira/browse/TIKA-2265
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: N/A
>            Reporter: Mike Rodent
>            Assignee: Tim Allison
>            Priority: Minor
>              Labels: newbie
>         Attachments: test.docx, test shorter.docx
>
>
> It seems to be the case that a footnote numbered "1" in the real document will be outputted by Tika.parseToString() as "2" in the footnote reference, and "2" in the corresponding footnote body text.... real footnote "2" becomes "3", "3" becomes "4", etc.  Have not yet looked at source code ... I can't imagine it would be difficult to correct this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
Loading...