[jira] Created: (TIKA-214) Excel Parsing Issues

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-214) Excel Parsing Issues

JIRA jira@apache.org
Excel Parsing Issues
--------------------

                 Key: TIKA-214
                 URL: https://issues.apache.org/jira/browse/TIKA-214
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.3
         Environment: Debian Etch / Debian Sid
            Reporter: David Weekly


I ran a sample Excel 2003 file (which I will attempt to attach) that I made through Tika 0.3 and the output didn't correctly identify the sheets, did not include text from the first column of the first sheet, and did not include any supplementary text (e.g. titles for charts, legends, etc.).

Specific issues with parsing xls.xls: (pardon the deliberately random names)
 - "charttabyodawg" (a chart sheet) improperly labeled as the sheet for data actually on Sheet1.
 - "Sheet1" data is actually the data on Sheet2
 - Sheet2 is not mentioned.
 - Chart title for chart on "charttabyodawg" is "WhamPuff" and is not included in the output.
 - Chart title for inline chart on Sheet1 is "fizzlepuff" and is not included in output.
 - Y-axis for inline chart on Sheet1 is "whyaxis" and is not included in output.
 - X-axis for inline chart on Sheet1 is "eksaxis" and is not included in output.
 - Label for data in inline chart on Sheet1 is "YottaPuff" and is not included in output.

Below is the output fromt Tika v0.3 when run on the attached XLS:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<div class="page">
<h1>charttabyodawg</h1>
<table>
<tbody>
<tr>    <td>1</td>
</tr>
<tr>    <td>2</td>
</tr>
<tr>    <td>300</td>    <td/>   <td/>   <td>1</td>
</tr>
<tr>    <td>baz</td>    <td/>   <td/>   <td>2</td>      <td/>   <td>9</td>
</tr>
<tr>    <td>yadda yam</td>      <td/>   <td/>   <td>300</td>    <td/>   <td>5</td>
</tr>
<tr>    <td/>   <td/>   <td/>   <td/>   <td/>   <td>16</td>
</tr>
</tbody>
</table>
</div>
<div class="page">
<h1>Sheet1</h1>
<table>
<tbody>
<tr>    <td/>
</tr>
<tr>    <td/>
</tr>
<tr>    <td/>
</tr>
<tr>    <td/>
</tr>
<tr>    <td/>   <td/>   <td>dingdong</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-214) Excel Parsing Issues

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Weekly updated TIKA-214:
------------------------------

    Attachment: xls.xls

Attached is a sample Excel 2003 file with several unique keywords, useful for testing completeness of textual extraction.

> Excel Parsing Issues
> --------------------
>
>                 Key: TIKA-214
>                 URL: https://issues.apache.org/jira/browse/TIKA-214
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>         Environment: Debian Etch / Debian Sid
>            Reporter: David Weekly
>         Attachments: xls.xls
>
>
> I ran a sample Excel 2003 file (which I will attempt to attach) that I made through Tika 0.3 and the output didn't correctly identify the sheets, did not include text from the first column of the first sheet, and did not include any supplementary text (e.g. titles for charts, legends, etc.).
> Specific issues with parsing xls.xls: (pardon the deliberately random names)
>  - "charttabyodawg" (a chart sheet) improperly labeled as the sheet for data actually on Sheet1.
>  - "Sheet1" data is actually the data on Sheet2
>  - Sheet2 is not mentioned.
>  - Chart title for chart on "charttabyodawg" is "WhamPuff" and is not included in the output.
>  - Chart title for inline chart on Sheet1 is "fizzlepuff" and is not included in output.
>  - Y-axis for inline chart on Sheet1 is "whyaxis" and is not included in output.
>  - X-axis for inline chart on Sheet1 is "eksaxis" and is not included in output.
>  - Label for data in inline chart on Sheet1 is "YottaPuff" and is not included in output.
> Below is the output fromt Tika v0.3 when run on the attached XLS:
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <div class="page">
> <h1>charttabyodawg</h1>
> <table>
> <tbody>
> <tr>    <td>1</td>
> </tr>
> <tr>    <td>2</td>
> </tr>
> <tr>    <td>300</td>    <td/>   <td/>   <td>1</td>
> </tr>
> <tr>    <td>baz</td>    <td/>   <td/>   <td>2</td>      <td/>   <td>9</td>
> </tr>
> <tr>    <td>yadda yam</td>      <td/>   <td/>   <td>300</td>    <td/>   <td>5</td>
> </tr>
> <tr>    <td/>   <td/>   <td/>   <td/>   <td/>   <td>16</td>
> </tr>
> </tbody>
> </table>
> </div>
> <div class="page">
> <h1>Sheet1</h1>
> <table>
> <tbody>
> <tr>    <td/>
> </tr>
> <tr>    <td/>
> </tr>
> <tr>    <td/>
> </tr>
> <tr>    <td/>
> </tr>
> <tr>    <td/>   <td/>   <td>dingdong</td>
> </tr>
> </tbody>
> </table>
> </div>
> </body>
> </html>

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.