[jira] Created: (TIKA-131) Lazy XHTML prefix generation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-131) Lazy XHTML prefix generation

JIRA jira@apache.org
Lazy XHTML prefix generation
----------------------------

                 Key: TIKA-131
                 URL: https://issues.apache.org/jira/browse/TIKA-131
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting
            Priority: Minor


The XHTMLContentHandler utility class is used by many Tika parsers to generate XHTML output. Among other things, the XHTMLContentHandler automatically generates the following XHTML skeleton:

    <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
        <title>...</title>
      </head>
      <body>
        ...
      </body>
    </html>

The <title/> tag (and potentially other metadata in future) is based on the Metadata.TITLE property of the document being parsed. Unfortunately that metadata is often not yet available when the XHTML generation is started, as a typical usage pattern is:

    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    // parse the document
    xhtml.endDocument();

We can avoid the problem in many cases by postponing the XHTML prefix generation to when the parser actually starts to produce some SAX events.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-131) Lazy XHTML prefix generation

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-131.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.2-incubating

Resolved in revision 638656.

> Lazy XHTML prefix generation
> ----------------------------
>
>                 Key: TIKA-131
>                 URL: https://issues.apache.org/jira/browse/TIKA-131
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.2-incubating
>
>
> The XHTMLContentHandler utility class is used by many Tika parsers to generate XHTML output. Among other things, the XHTMLContentHandler automatically generates the following XHTML skeleton:
>     <html xmlns="http://www.w3.org/1999/xhtml">
>       <head>
>         <title>...</title>
>       </head>
>       <body>
>         ...
>       </body>
>     </html>
> The <title/> tag (and potentially other metadata in future) is based on the Metadata.TITLE property of the document being parsed. Unfortunately that metadata is often not yet available when the XHTML generation is started, as a typical usage pattern is:
>     XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
>     xhtml.startDocument();
>     // parse the document
>     xhtml.endDocument();
> We can avoid the problem in many cases by postponing the XHTML prefix generation to when the parser actually starts to produce some SAX events.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.