Excel Parsing Issues With Tika 0.3

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Excel Parsing Issues With Tika 0.3

David Weekly-3
Hello, Tika developers! I'm David Weekly, the founder of PBwiki. We're looking at using Tika to do text extraction for our internal search engine to help users search content uploaded to their wiki. Right now we use xls2csv and are looking to move to a more...sophisticated solution. Particularly one that can scrape useful strings from more parts of the document, like legends, axis & chart titles, etc.

So I was a little sad when I ran a sample Excel 2003 file (attached) that I made through Tika 0.3 (which was quite easy to build - awesome work!) and the output didn't correctly identify the sheets, did not include text from the first column of the first sheet, and did not include any supplementary text (e.g. titles for charts, legends, etc.)

So this is part "bug report" (the columns of the first sheet should definitely be included!) and part query as to whether or not there is a plan w/Tika to extract more than sheet & cell data from documents.

Specific issues with parsing xls.xls: (pardon the deliberately random names)
 - "charttabyodawg" (a chart sheet) improperly labeled as the sheet for data actually on Sheet1.
 - "Sheet1" data is actually the data on Sheet2
 - Sheet2 is not mentioned.
 - Chart title for chart on "charttabyodawg" is "WhamPuff" and is not included in the output.
 - Chart title for inline chart on Sheet1 is "fizzlepuff" and is not included in output.
 - Y-axis for inline chart on Sheet1 is "whyaxis" and is not included in output.
 - X-axis for inline chart on Sheet1 is "eksaxis" and is not included in output.
 - Label for data in inline chart on Sheet1 is "YottaPuff" and is not included in output.

Below is the output fromt Tika v0.3 when run on the attached XLS:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<div class="page">
<h1>charttabyodawg</h1>
<table>
<tbody>
<tr>    <td>1</td>
</tr>
<tr>    <td>2</td>
</tr>
<tr>    <td>300</td>    <td/>   <td/>   <td>1</td>
</tr>
<tr>    <td>baz</td>    <td/>   <td/>   <td>2</td>      <td/>   <td>9</td>
</tr>
<tr>    <td>yadda yam</td>      <td/>   <td/>   <td>300</td>    <td/>   <td>5</td>
</tr>
<tr>    <td/>   <td/>   <td/>   <td/>   <td/>   <td>16</td>
</tr>
</tbody>
</table>
</div>
<div class="page">
<h1>Sheet1</h1>
<table>
<tbody>
<tr>    <td/>
</tr>
<tr>    <td/>
</tr>
<tr>    <td/>
</tr>
<tr>    <td/>
</tr>
<tr>    <td/>   <td/>   <td>dingdong</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>


I hope this kind of message is seen as helpful and constructive and not redundant (I did check Jira for similar issues) or whiny or non-specific.

Yours,
 David Weekly

xls.xls (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Excel Parsing Issues With Tika 0.3

Jukka Zitting
Hi,

On Sat, Mar 28, 2009 at 6:18 AM, David Weekly <[hidden email]> wrote:
> So this is part "bug report" (the columns of the first sheet should
> definitely be included!)

Agreed. Can you please file a Jira bug report for this? It looks
similar to some of the zero- vs. one-based index issues we faced when
upgrading to POI 3.5.

> and part query as to whether or not there is a plan
> w/Tika to extract more than sheet & cell data from documents.

Doing so would be very nice. You may want to file a Jira improvement
request for that.

And if you're familiar with Apache POI (or willing to learn it),
patches would of course also be welcome. :-) Otherwise I don't know
when one of us will encounter a similar need.

You may also want to contact the POI project to see if they've already
implemented text extraction improvements that would cover these
features. Last week at the ApacheCon I noticed that they've recently
been improving the out-of-the-box text extraction features in POI.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Excel Parsing Issues With Tika 0.3

David Weekly-3
TIKA-214 has now been filed, along with the sample XLS file.

https://issues.apache.org/jira/browse/TIKA-214

Should I separately bother the POI folks about this issue?

Incidentally, although sad and hacky it may be worth noting that
catting the output of strings and "strings -el" does a decent job of
pulling unique strings out. (Although does include font names, etc.)

-David


2009/3/30 Jukka Zitting <[hidden email]>:

> Hi,
>
> On Sat, Mar 28, 2009 at 6:18 AM, David Weekly <[hidden email]> wrote:
>> So this is part "bug report" (the columns of the first sheet should
>> definitely be included!)
>
> Agreed. Can you please file a Jira bug report for this? It looks
> similar to some of the zero- vs. one-based index issues we faced when
> upgrading to POI 3.5.
>
>> and part query as to whether or not there is a plan
>> w/Tika to extract more than sheet & cell data from documents.
>
> Doing so would be very nice. You may want to file a Jira improvement
> request for that.
>
> And if you're familiar with Apache POI (or willing to learn it),
> patches would of course also be welcome. :-) Otherwise I don't know
> when one of us will encounter a similar need.
>
> You may also want to contact the POI project to see if they've already
> implemented text extraction improvements that would cover these
> features. Last week at the ApacheCon I noticed that they've recently
> been improving the out-of-the-box text extraction features in POI.
>
> BR,
>
> Jukka Zitting
>



--
Follow me on Twitter! http://twitter.com/dweekly