[jira] Created: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

JIRA jira@apache.org
Refactor Excel extractor to parse per sheet and add hyperlink support
---------------------------------------------------------------------

                 Key: TIKA-132
                 URL: https://issues.apache.org/jira/browse/TIKA-132
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.1-incubating
            Reporter: Niall Pemberton
            Priority: Minor


In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.

Jukka suggested the following on the mailing list:

"How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."

See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niall Pemberton updated TIKA-132:
---------------------------------

    Attachment: TIKA-132-ExcelExtractor-refactor-v1.patch

Attaching a patch to refactor ExcelExtractor as per Jukka's suggestion. A few points to note:

 - Maintains "linked-lists" of Rows and Cells (each Row/Cell has a reference to the next Row/Cell)
 - Hyperlink support is currently commented out as it includes un-released POI features - marked with FIXME
 - Empty sheets are ignored - is this OK
 - Still doesn't produce links in the output using the WriteOutContentHandler as the link is a "href" attribute of an <a> element - is this correct?

To try out the hyperlink support - uncomment the relevant lines and use a POI version built from the latest subversion trunk.

> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v1.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niall Pemberton updated TIKA-132:
---------------------------------

    Attachment:     (was: TIKA-132-ExcelExtractor-refactor-v1.patch)

> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niall Pemberton updated TIKA-132:
---------------------------------

    Attachment: TIKA-132-ExcelExtractor-refactor-v2.patch

Apologies - attaching a second patch, with minor changes

 - make visibility of methods in new private static inner classes consistent
 - use row/column parameter names rather than rowNo/columnNo as POI does


> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582353#action_12582353 ]

Jukka Zitting commented on TIKA-132:
------------------------------------

Thanks! Applied your patch as-is in revision 641394.

Good point about empty sheets, I was wondering before how we could avoid exposing them. (A related thought: Perhaps we should avoid outputting the <h1> tags on default sheet names like "Sheet 1".

Re: WriteOutContentHandler; I think that's acceptable, as IMHO in default settings WriteOutContentHandler should output the actual text that's visible in the document. If a client wants to access extra information like the embedded links, it should use the SAX stream.

There are a few improvements I'd like to make:

- I'd replace the processCellValue flow on the "text" variable with a method call as the case-if-if control flow may be a bit hard to follow especially if we keep adding functionality to processCellValue.
- We should leverage existing java.util collections instead of creating our own linked lists. For example a SortedMap of cell coordinates to cell values should fit our needs and reduce the amount of custom code in Tika
- Cell formatting could be delegated to TikaExcelCell subclasses for better separation of concerns
- The inner classes could be made package-private top level classes to avoid bloating ExcelExtractor

I'll follow up with respective commits directly in svn, but feel free to debate my changes if you prefer other solutions. I'll update svn accordingly until there's consensus.


> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582432#action_12582432 ]

Jukka Zitting commented on TIKA-132:
------------------------------------

I'm now done streamlining the class. Most notably I extracted and abstracted the TikaExcelCell class to the Cell interface and the related implementation classes TextCell and NumberCell and the LinkedCell decorator. These classes have no dependencies to Excel parsing, and could be used for similar page-by-page rendering purposes also by other parser implementations. I'll follow up with another issue to generalize the Cell classes.

I'll leave this issue open until POI releases the next version with hyperlink support.

> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-132) Refactor Excel extractor to parse per sheet and add hyperlink support

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-132.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.2-incubating

POI is now 3.1 and hyperlinks are enabled in Tika. Resolving as Fixed.

> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>             Fix For: 0.2-incubating
>
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, after the cell value records. This is a problem for the current streaming implementation of the excel parser since it means the hyperlink cannot be output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table of the contents of the sheet that is currently being parsed and would only generate the respective SAX events once the sheet has been parsed? Since we can focus on only the information that's relevant to Tika clients, the memory requirements sould be moderate even for huge sheets (i.e. much less than the file size even for a single-sheet file). This should satisfy the low memory footprint requirements reasonably well while allowing us to generate more accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.