Streaming vs. other features in parsers

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Streaming vs. other features in parsers

Jukka Zitting-3
Hi,

I was looking at implementing link extraction for Excel files, and
found out that the link information is only available at the end of
the file as a special "cell X links to URI Y" record. The parser could
just slap such link records as artificial <a/> tags at the end of the
produced XHTML SAX stream, but for properly associating the <a/> tags
with the correct text would require dropping the streaming feature.

Should we consider dropping the streaming parser, or provide an
alternative parser that reads the whole document to provide better
output? Note that some time ago we dropped the non-streaming parser in
favor of the (then better) streaming parser contributed by Niall. If
we decide to maintain alternative parsers, which one should be the
default?

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Streaming vs. other features in parsers

Niall Pemberton
On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[hidden email]> wrote:
> Hi,
>
>  I was looking at implementing link extraction for Excel files, and
>  found out that the link information is only available at the end of
>  the file as a special "cell X links to URI Y" record. The parser could

Its probably academic, but I believe they come at the end of each
sheet, rather than file.

I didn't think link support was in the latest POI release and was only
added a few weeks ago:
http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java

Not trying to make any point, just wondering whether I got this wrong
or you found another way or you tried the lastest POI from svn?

>  just slap such link records as artificial <a/> tags at the end of the
>  produced XHTML SAX stream, but for properly associating the <a/> tags
>  with the correct text would require dropping the streaming feature.
>
>  Should we consider dropping the streaming parser, or provide an
>  alternative parser that reads the whole document to provide better
>  output? Note that some time ago we dropped the non-streaming parser in
>  favor of the (then better) streaming parser contributed by Niall. If
>  we decide to maintain alternative parsers, which one should be the
>  default?

I think a low-memory-footprint parser still has value, despite this
drawback - I'm pretty sure that where I work lack of hyperlink support
is not an issue. Is there not room for two implementations in Tika?

Niall

>  BR,
>
>  Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Streaming vs. other features in parsers

Jukka Zitting-3
Hi,

On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton
<[hidden email]> wrote:
> On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[hidden email]> wrote:
>  >  I was looking at implementing link extraction for Excel files, and
>  >  found out that the link information is only available at the end of
>  >  the file as a special "cell X links to URI Y" record. The parser could
>
>  Its probably academic, but I believe they come at the end of each
>  sheet, rather than file.

You're right, good point!

PDF parsing can typically be streamed one page at a time, i.e. you
need to parse a whole page to be able to render the output, and this
is something we might want to consider doing also for Excel sheets:

How about if the streaming Excel parser maintained a sparse in-memory
table of the contents of the sheet that is currently being parsed and
would only generate the respective SAX events once the sheet has been
parsed? Since we can focus on only the information that's relevant to
Tika clients, the memory requirements sould be moderate even for huge
sheets (i.e. much less than the file size even for a single-sheet
file). This should satisfy the low memory footprint requirements
reasonably well while allowing us to generate more accurate output.

>  I didn't think link support was in the latest POI release and was only
>  added a few weeks ago:
>  http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java
>
>  Not trying to make any point, just wondering whether I got this wrong
>  or you found another way or you tried the lastest POI from svn?

I'm using POI trunk.

>  I think a low-memory-footprint parser still has value, despite this
>  drawback - I'm pretty sure that where I work lack of hyperlink support
>  is not an issue. Is there not room for two implementations in Tika?

There certainly is, my main concern are just the duplicate maintenance
effort and the added configuration complexity.

Would the above sheet-by-sheet streaming option work for your
requirements? Alternatively, we could avoid much duplication by making
the sheet-by-sheet feature a configurable mode of the normal streaming
Excel parser instead of using a separate parser class.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Streaming vs. other features in parsers

Niall Pemberton
On Thu, Mar 20, 2008 at 2:56 AM, Jukka Zitting <[hidden email]> wrote:

> Hi,
>
>
>  On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton
>  <[hidden email]> wrote:
>  > On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[hidden email]> wrote:
>
> >  >  I was looking at implementing link extraction for Excel files, and
>  >  >  found out that the link information is only available at the end of
>  >  >  the file as a special "cell X links to URI Y" record. The parser could
>  >
>  >  Its probably academic, but I believe they come at the end of each
>  >  sheet, rather than file.
>
>  You're right, good point!
>
>  PDF parsing can typically be streamed one page at a time, i.e. you
>  need to parse a whole page to be able to render the output, and this
>  is something we might want to consider doing also for Excel sheets:
>
>  How about if the streaming Excel parser maintained a sparse in-memory
>  table of the contents of the sheet that is currently being parsed and
>  would only generate the respective SAX events once the sheet has been
>  parsed? Since we can focus on only the information that's relevant to
>  Tika clients, the memory requirements sould be moderate even for huge
>  sheets (i.e. much less than the file size even for a single-sheet
>  file). This should satisfy the low memory footprint requirements
>  reasonably well while allowing us to generate more accurate output.
>
>
>  >  I didn't think link support was in the latest POI release and was only
>  >  added a few weeks ago:
>  >  http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java
>  >
>  >  Not trying to make any point, just wondering whether I got this wrong
>  >  or you found another way or you tried the lastest POI from svn?
>
>  I'm using POI trunk.
>
>
>  >  I think a low-memory-footprint parser still has value, despite this
>  >  drawback - I'm pretty sure that where I work lack of hyperlink support
>  >  is not an issue. Is there not room for two implementations in Tika?
>
>  There certainly is, my main concern are just the duplicate maintenance
>  effort and the added configuration complexity.
>
>  Would the above sheet-by-sheet streaming option work for your
>  requirements?

Sounds good to me. I'll put a patch together.

Niall

> Alternatively, we could avoid much duplication by making
>  the sheet-by-sheet feature a configurable mode of the normal streaming
>  Excel parser instead of using a separate parser class.
>
>  BR,
>
>  Jukka Zitting
>
Reply | Threaded
Open this post in threaded view
|

Re: Streaming vs. other features in parsers

Niall Pemberton
On Thu, Mar 20, 2008 at 5:05 PM, Niall Pemberton
<[hidden email]> wrote:

>
> On Thu, Mar 20, 2008 at 2:56 AM, Jukka Zitting <[hidden email]> wrote:
>  > Hi,
>  >
>  >
>  >  On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton
>  >  <[hidden email]> wrote:
>  >  > On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[hidden email]> wrote:
>  >
>  > >  >  I was looking at implementing link extraction for Excel files, and
>  >  >  >  found out that the link information is only available at the end of
>  >  >  >  the file as a special "cell X links to URI Y" record. The parser could
>  >  >
>  >  >  Its probably academic, but I believe they come at the end of each
>  >  >  sheet, rather than file.
>  >
>  >  You're right, good point!
>  >
>  >  PDF parsing can typically be streamed one page at a time, i.e. you
>  >  need to parse a whole page to be able to render the output, and this
>  >  is something we might want to consider doing also for Excel sheets:
>  >
>  >  How about if the streaming Excel parser maintained a sparse in-memory
>  >  table of the contents of the sheet that is currently being parsed and
>  >  would only generate the respective SAX events once the sheet has been
>  >  parsed? Since we can focus on only the information that's relevant to
>  >  Tika clients, the memory requirements sould be moderate even for huge
>  >  sheets (i.e. much less than the file size even for a single-sheet
>  >  file). This should satisfy the low memory footprint requirements
>  >  reasonably well while allowing us to generate more accurate output.
>  >
>  >
>  >  >  I didn't think link support was in the latest POI release and was only
>  >  >  added a few weeks ago:
>  >  >  http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java
>  >  >
>  >  >  Not trying to make any point, just wondering whether I got this wrong
>  >  >  or you found another way or you tried the lastest POI from svn?
>  >
>  >  I'm using POI trunk.
>  >
>  >
>  >  >  I think a low-memory-footprint parser still has value, despite this
>  >  >  drawback - I'm pretty sure that where I work lack of hyperlink support
>  >  >  is not an issue. Is there not room for two implementations in Tika?
>  >
>  >  There certainly is, my main concern are just the duplicate maintenance
>  >  effort and the added configuration complexity.
>  >
>  >  Would the above sheet-by-sheet streaming option work for your
>  >  requirements?
>
>  Sounds good to me. I'll put a patch together.

I've created a JIRA ticket and attached a patch:
  https://issues.apache.org/jira/browse/TIKA-132

Suggestions welcome, if you don't like how it resolves this - I can
work up another patch

Niall

>  Niall
>
>
>
>  > Alternatively, we could avoid much duplication by making
>  >  the sheet-by-sheet feature a configurable mode of the normal streaming
>  >  Excel parser instead of using a separate parser class.
>  >
>  >  BR,
>  >
>  >  Jukka Zitting
>  >
>