[jira] Created: (TIKA-347) Make HtmlParser customizable through ParseContext

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-347) Make HtmlParser customizable through ParseContext

JIRA jira@apache.org
Make HtmlParser customizable through ParseContext
-------------------------------------------------

                 Key: TIKA-347
                 URL: https://issues.apache.org/jira/browse/TIKA-347
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting
             Fix For: 0.6


In TIKA-304 we added the mapSafeElement() and isDiscardElement() methods to HtmlParser so that subclasses could better customize how incoming HTML elements get mapped to the XHMTL output from Tika. This works fairly well but requires you to modify the Tika configuration file or to explicitly inject a custom HtmlParser subclass instance to the CompositeParser instance you're using (AutoDetectParser, etc.).

Now that we have the ParseContext mechanism available to simplify such customization, it would be nice to allow you to provide a custom "HTML mapper" instance through the parse context and have HtmlParser call that mapper (if available) for the mapSafeElement() and isDiscardElement() operations.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-347) Make HtmlParser customizable through ParseContext

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-347.
--------------------------------

    Resolution: Fixed

Implemented in revision 890117.

> Make HtmlParser customizable through ParseContext
> -------------------------------------------------
>
>                 Key: TIKA-347
>                 URL: https://issues.apache.org/jira/browse/TIKA-347
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.6
>
>
> In TIKA-304 we added the mapSafeElement() and isDiscardElement() methods to HtmlParser so that subclasses could better customize how incoming HTML elements get mapped to the XHMTL output from Tika. This works fairly well but requires you to modify the Tika configuration file or to explicitly inject a custom HtmlParser subclass instance to the CompositeParser instance you're using (AutoDetectParser, etc.).
> Now that we have the ParseContext mechanism available to simplify such customization, it would be nice to allow you to provide a custom "HTML mapper" instance through the parse context and have HtmlParser call that mapper (if available) for the mapSafeElement() and isDiscardElement() operations.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.