[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736461#comment-16736461 ]

ASF GitHub Bot commented on TIKA-1841:
--------------------------------------

dstevenson commented on issue #86: fix for TIKA-1841 contributed by zetisam
URL: https://github.com/apache/tika/pull/86#issuecomment-452113630
 
 
   Is there a reason this is being held off on? The current output from XSLF extraction is very hard to parse to do anything useful with since it's a flat structure that may or may not contain a notes block.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Different XML output structure for PPT and PPTX
> -----------------------------------------------
>
>                 Key: TIKA-1841
>                 URL: https://issues.apache.org/jira/browse/TIKA-1841
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Sam H
>            Priority: Major
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is different.
> The structure for PPTX seems as follows:
> {code}
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> ...
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of each slide.
> For powerpoint the structure is as follows:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> <div class="slide-notes">
> {code}
> In my application, I'm using XPath to get the desired information . As the XML structure is different, I have to differentiate my XPath queries whether the file is PPT (old) or PPTX (new). It would be nice for Tika to return the same XML for both.
> I would propose changing the XML structure to this:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> {code}
> So, essentially, like the current PPT output, but without the list of notes at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm willing to donate my time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)