[jira] Created: (TIKA-58) Replace jtidy html parser with nekohtml based parser

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-58) Replace jtidy html parser with nekohtml based parser

Hudson (Jira)
Replace jtidy html parser with nekohtml based parser
----------------------------------------------------

                 Key: TIKA-58
                 URL: https://issues.apache.org/jira/browse/TIKA-58
             Project: Tika
          Issue Type: Improvement
          Components: general
            Reporter: Sami Siren
            Assignee: Sami Siren
            Priority: Minor


Following patch will replace the JTidy based html parser with NekoHTML based sax parser. It only provides the same functionality that the JTidy based one (extracts a title into metadata) and passes other sax events through. Speed improvement is around 100%.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-58) Replace jtidy html parser with nekohtml based parser

Hudson (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren updated TIKA-58:
---------------------------

    Attachment: TIKA-58.diff

> Replace jtidy html parser with nekohtml based parser
> ----------------------------------------------------
>
>                 Key: TIKA-58
>                 URL: https://issues.apache.org/jira/browse/TIKA-58
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Sami Siren
>            Assignee: Sami Siren
>            Priority: Minor
>         Attachments: TIKA-58.diff
>
>
> Following patch will replace the JTidy based html parser with NekoHTML based sax parser. It only provides the same functionality that the JTidy based one (extracts a title into metadata) and passes other sax events through. Speed improvement is around 100%.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-58) Replace jtidy html parser with nekohtml based parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534482 ]

Bertrand Delacretaz commented on TIKA-58:
-----------------------------------------

Good idea, in Cocoon 2.1.x we have both JTidy and NekoHTML, and I find myself using Neko all the time.

> Replace jtidy html parser with nekohtml based parser
> ----------------------------------------------------
>
>                 Key: TIKA-58
>                 URL: https://issues.apache.org/jira/browse/TIKA-58
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Sami Siren
>            Assignee: Sami Siren
>            Priority: Minor
>         Attachments: TIKA-58.diff
>
>
> Following patch will replace the JTidy based html parser with NekoHTML based sax parser. It only provides the same functionality that the JTidy based one (extracts a title into metadata) and passes other sax events through. Speed improvement is around 100%.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-58) Replace jtidy html parser with nekohtml based parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534492 ]

Jukka Zitting commented on TIKA-58:
-----------------------------------

+1 Nice

Some comments:

* I would rather keep the class named as HtmlParser instead of NekoHtmlParser to avoid exposing implementation details.

* Use spaces instead of tabs for indentation.

* Does anyone know if NekoHTML is actively maintained, or if the idea of adopting it in Xerces is still being considered?


> Replace jtidy html parser with nekohtml based parser
> ----------------------------------------------------
>
>                 Key: TIKA-58
>                 URL: https://issues.apache.org/jira/browse/TIKA-58
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Sami Siren
>            Assignee: Sami Siren
>            Priority: Minor
>         Attachments: TIKA-58.diff
>
>
> Following patch will replace the JTidy based html parser with NekoHTML based sax parser. It only provides the same functionality that the JTidy based one (extracts a title into metadata) and passes other sax events through. Speed improvement is around 100%.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-58) Replace jtidy html parser with nekohtml based parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren updated TIKA-58:
---------------------------

    Attachment: TIKA-58_2.diff

Modified according to comments from Jukka.

> Replace jtidy html parser with nekohtml based parser
> ----------------------------------------------------
>
>                 Key: TIKA-58
>                 URL: https://issues.apache.org/jira/browse/TIKA-58
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Sami Siren
>            Assignee: Sami Siren
>            Priority: Minor
>         Attachments: TIKA-58.diff, TIKA-58_2.diff
>
>
> Following patch will replace the JTidy based html parser with NekoHTML based sax parser. It only provides the same functionality that the JTidy based one (extracts a title into metadata) and passes other sax events through. Speed improvement is around 100%.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-58) Replace jtidy html parser with nekohtml based parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren resolved TIKA-58.
----------------------------

       Resolution: Fixed
    Fix Version/s: 0.1-incubator

committed with minor modifications

> Replace jtidy html parser with nekohtml based parser
> ----------------------------------------------------
>
>                 Key: TIKA-58
>                 URL: https://issues.apache.org/jira/browse/TIKA-58
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Sami Siren
>            Assignee: Sami Siren
>            Priority: Minor
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-58.diff, TIKA-58_2.diff
>
>
> Following patch will replace the JTidy based html parser with NekoHTML based sax parser. It only provides the same functionality that the JTidy based one (extracts a title into metadata) and passes other sax events through. Speed improvement is around 100%.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.