[jira] Created: (TIKA-53) XHTML SAX events from parsers

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-53) XHTML SAX events from parsers

Cristian Vat (Jira)
XHTML SAX events from parsers
-----------------------------

                 Key: TIKA-53
                 URL: https://issues.apache.org/jira/browse/TIKA-53
             Project: Tika
          Issue Type: Improvement
          Components: general
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting
             Fix For: 0.1-incubator


Tika parsers should produce a sequence XHTML SAX events instead of a single unstructured String as the parsed document content.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-53) XHTML SAX events from parsers

Cristian Vat (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-53:
------------------------------

    Attachment: TIKA-53.patch

The attached patch (TIKA-53.patch) is my first shot at this.

Most of the parsers just take the String that they used to produce before, and output the following SAX events:

    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <title>...</title>
        </head>
        <body>
          <p>...</p>
        </body>
    </html>

The only exception for now is the HTMLParser (surprise!) that uses the XHTML output from Tidy.

The TXTParser class is also slightly more advanced, as it'll avoid reading the full document in memory (assuming ICU4J doesn't do that). Instead it'll read the character stream in small batches and use the characters() SAX event to feed that stream to the given ContentHandler.

> XHTML SAX events from parsers
> -----------------------------
>
>                 Key: TIKA-53
>                 URL: https://issues.apache.org/jira/browse/TIKA-53
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-53.patch
>
>
> Tika parsers should produce a sequence XHTML SAX events instead of a single unstructured String as the parsed document content.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-53) XHTML SAX events from parsers

Cristian Vat (Jira)
In reply to this post by Cristian Vat (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-53.
-------------------------------

    Resolution: Fixed

Committed the proposed patch with slight modifications in revision 584092.

> XHTML SAX events from parsers
> -----------------------------
>
>                 Key: TIKA-53
>                 URL: https://issues.apache.org/jira/browse/TIKA-53
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-53.patch
>
>
> Tika parsers should produce a sequence XHTML SAX events instead of a single unstructured String as the parsed document content.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.