[jira] Created: (TIKA-250) XLS parser does not extract empty sheet names

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-250) XLS parser does not extract empty sheet names

JIRA jira@apache.org
XLS parser does not extract empty sheet names
---------------------------------------------

                 Key: TIKA-250
                 URL: https://issues.apache.org/jira/browse/TIKA-250
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
            Reporter: Maxim Valyanskiy
            Priority: Minor


ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-250) XLS parser does not extract empty sheet names

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maxim Valyanskiy updated TIKA-250:
----------------------------------

    Attachment: empty.patch

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Priority: Minor
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-250) XLS parser does not extract empty sheet names

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724114#action_12724114 ]

Jukka Zitting commented on TIKA-250:
------------------------------------

The currentSheet.isEmpty() conditional was added explicitly to avoid outputting empty sheets. Most Excel files out there have the three default worksheets but in the majority of cases only the first sheet contains anything and it's cleaner if the empty extra sheets aren't included in the output.

Are there real world cases where the name of an empty sheet is an important part of the extracted text content? I would assume that any essential sheets contain at least some content beside the sheet name.

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Priority: Minor
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-250) XLS parser does not extract empty sheet names

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725112#action_12725112 ]

Maxim Valyanskiy commented on TIKA-250:
---------------------------------------

Yes there are real cases where we really need to know names of the empty sheets. For example we faced the following issue. In the workbook each sheet represented a branch of the company, some sheets were empty just because information was not filled in yet. So when we extracted text from the files the names of some branches were missed. So later when we tried to search our database for these particular names we failed to find this information.

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Priority: Minor
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-250) XLS parser does not extract empty sheet names

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-250.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Fair enough, fix committed in revision 801432. Thanks for the patch and the rationale!

> XLS parser does not extract empty sheet names
> ---------------------------------------------
>
>                 Key: TIKA-250
>                 URL: https://issues.apache.org/jira/browse/TIKA-250
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: empty.patch
>
>
> ExcelExtractor misses sheet titles if sheet is empty. Fix it trivial, patch attached

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.