[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868023#comment-15868023 ]

Mike Rodent commented on TIKA-2265:
-----------------------------------

I just had a look at the file in question i.e. content.xml when you unzip the .docx.

As I understand things it would appear that MS Word may actually be numbering footnotes (if they are "per-page") sort of on-the-fly... and of course this makes sense: change the pagination, change the footnote numbering.

But the trouble is, with Tika there is no way you can work out where the soft page breaks occur.

The .odt file I used when developing my ODF "patch" class (TIKA-2264) turns out not to have "per page" numbering... but I just changed this and extracted the content.xml.  In this case it appears that the "text:note-citation" value only restarts (i.e. to value "1") in the case of a hard break.  So it isn't really "per page" numbering in the true sense in LibreOffice Writer.

Anyway, I suspect it may not be possible to do anything about the "anomalous" per-page footnote numbering with .docx files...

> Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-2265
>                 URL: https://issues.apache.org/jira/browse/TIKA-2265
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: N/A
>            Reporter: Mike Rodent
>            Assignee: Tim Allison
>            Priority: Minor
>              Labels: newbie
>         Attachments: test.docx, test shorter.docx
>
>
> It seems to be the case that a footnote numbered "1" in the real document will be outputted by Tika.parseToString() as "2" in the footnote reference, and "2" in the corresponding footnote body text.... real footnote "2" becomes "3", "3" becomes "4", etc.  Have not yet looked at source code ... I can't imagine it would be difficult to correct this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)