[jira] Created: (TIKA-138) Better HTML parsing

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-138) Better HTML parsing

JIRA jira@apache.org
Better HTML parsing
-------------------

                 Key: TIKA-138
                 URL: https://issues.apache.org/jira/browse/TIKA-138
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: julien nioche


The current parser used for HTML leaves code in the extracted text.

For instance in the page http://implicitweb.blogspot.com/ the CSS section

<style id='page-skin-1' type='text/css'><!--
/*
* Blogger Template Style
*
* Sand Dollar
* by Jason Sutter
* Updated by Blogger Team
*//* Variable definitions
====================
<Variable name="textcolor" description="Text Color"
type="color" default="#000"><Variable name="bgcolor" description="Page Background Color"
type="color" default="#f6f6f6"><Variable name="pagetitlecolor" description="Blog Title Color"
type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" description="Blog Title Background Color"
type="color" default="#DE7008"><Variable name="descriptionColor" description="Blog Description Color"
type="color" default="#9E5205" /><Variable name="descbgcolor" description="Description Background Color"
type="color" default="#F5E39e"><Variable name="titlecolor" description="Post Title Color"
type="color" default="#9E5205"><Variable name="datecolor" description="Date Header Color"
type="color" default="#777777"><Variable name="footercolor" description="Post Footer Color"
....

is found in the extracted text. This is not the case when saving the same page as txt from Firefox or OpenOffice.

J.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-138) Ignore HTML style and script content

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-138:
-------------------------------

    Assignee: Jukka Zitting
     Summary: Ignore HTML style and script content  (was: Better HTML parsing)

Good point. As discussed recently on the mailing list, there are probably some cases where style and script content is useful for a Tika client, but by default the extracted text should match what is normally shown by a browser.

> Ignore HTML style and script content
> ------------------------------------
>
>                 Key: TIKA-138
>                 URL: https://issues.apache.org/jira/browse/TIKA-138
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>
> The current parser used for HTML leaves code in the extracted text.
> For instance in the page http://implicitweb.blogspot.com/ the CSS section
> <style id='page-skin-1' type='text/css'><!--
> /*
> * Blogger Template Style
> *
> * Sand Dollar
> * by Jason Sutter
> * Updated by Blogger Team
> *//* Variable definitions
> ====================
> <Variable name="textcolor" description="Text Color"
> type="color" default="#000"><Variable name="bgcolor" description="Page Background Color"
> type="color" default="#f6f6f6"><Variable name="pagetitlecolor" description="Blog Title Color"
> type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" description="Blog Title Background Color"
> type="color" default="#DE7008"><Variable name="descriptionColor" description="Blog Description Color"
> type="color" default="#9E5205" /><Variable name="descbgcolor" description="Description Background Color"
> type="color" default="#F5E39e"><Variable name="titlecolor" description="Post Title Color"
> type="color" default="#9E5205"><Variable name="datecolor" description="Date Header Color"
> type="color" default="#777777"><Variable name="footercolor" description="Post Footer Color"
> ....
> is found in the extracted text. This is not the case when saving the same page as txt from Firefox or OpenOffice.
> J.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-138) Ignore HTML style and script content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-138.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.2-incubating

Resolved in revision 645982.

> Ignore HTML style and script content
> ------------------------------------
>
>                 Key: TIKA-138
>                 URL: https://issues.apache.org/jira/browse/TIKA-138
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>
> The current parser used for HTML leaves code in the extracted text.
> For instance in the page http://implicitweb.blogspot.com/ the CSS section
> <style id='page-skin-1' type='text/css'><!--
> /*
> * Blogger Template Style
> *
> * Sand Dollar
> * by Jason Sutter
> * Updated by Blogger Team
> *//* Variable definitions
> ====================
> <Variable name="textcolor" description="Text Color"
> type="color" default="#000"><Variable name="bgcolor" description="Page Background Color"
> type="color" default="#f6f6f6"><Variable name="pagetitlecolor" description="Blog Title Color"
> type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" description="Blog Title Background Color"
> type="color" default="#DE7008"><Variable name="descriptionColor" description="Blog Description Color"
> type="color" default="#9E5205" /><Variable name="descbgcolor" description="Description Background Color"
> type="color" default="#F5E39e"><Variable name="titlecolor" description="Post Title Color"
> type="color" default="#9E5205"><Variable name="datecolor" description="Date Header Color"
> type="color" default="#777777"><Variable name="footercolor" description="Post Footer Color"
> ....
> is found in the extracted text. This is not the case when saving the same page as txt from Firefox or OpenOffice.
> J.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (TIKA-138) Ignore HTML style and script content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche closed TIKA-138.
------------------------------


Thanks! That's great

> Ignore HTML style and script content
> ------------------------------------
>
>                 Key: TIKA-138
>                 URL: https://issues.apache.org/jira/browse/TIKA-138
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>
> The current parser used for HTML leaves code in the extracted text.
> For instance in the page http://implicitweb.blogspot.com/ the CSS section
> <style id='page-skin-1' type='text/css'><!--
> /*
> * Blogger Template Style
> *
> * Sand Dollar
> * by Jason Sutter
> * Updated by Blogger Team
> *//* Variable definitions
> ====================
> <Variable name="textcolor" description="Text Color"
> type="color" default="#000"><Variable name="bgcolor" description="Page Background Color"
> type="color" default="#f6f6f6"><Variable name="pagetitlecolor" description="Blog Title Color"
> type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" description="Blog Title Background Color"
> type="color" default="#DE7008"><Variable name="descriptionColor" description="Blog Description Color"
> type="color" default="#9E5205" /><Variable name="descbgcolor" description="Description Background Color"
> type="color" default="#F5E39e"><Variable name="titlecolor" description="Post Title Color"
> type="color" default="#9E5205"><Variable name="datecolor" description="Date Header Color"
> type="color" default="#777777"><Variable name="footercolor" description="Post Footer Color"
> ....
> is found in the extracted text. This is not the case when saving the same page as txt from Firefox or OpenOffice.
> J.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.