[jira] Created: (TIKA-113) Metadata (such as title) should not be part of content

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-113) Metadata (such as title) should not be part of content

JIRA jira@apache.org
Metadata (such as title) should not be part of content
------------------------------------------------------

                 Key: TIKA-113
                 URL: https://issues.apache.org/jira/browse/TIKA-113
             Project: Tika
          Issue Type: Wish
          Components: parser
    Affects Versions: 0.2-incubating
            Reporter: Rida Benjelloun


Metadata (such as title)  is added in the content. In my opinion it would be preferable  that the toString () on the writer return only the content of the document and not metadata. The metadata  are already  stored in the metadata object
Rida.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-113) Metadata (such as title) should not be part of content

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-113:
-------------------------------

    Affects Version/s:     (was: 0.2-incubating)
        Fix Version/s: 0.2-incubating
           Issue Type: Improvement  (was: Wish)

I think the SAX event stream should still contain selected metadata in the <head/> section. For example the current XHTMLContentHandler outputs the TITLE metadata field (if available) as the <title/> of the generated XML document.

Instead of changing that pattern, we should probably either change WriteOutContentHandler to only output content of the <body/> element or add a new ContentHandler utility class with that feature.

> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
>                 Key: TIKA-113
>                 URL: https://issues.apache.org/jira/browse/TIKA-113
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
>
>
> Metadata (such as title)  is added in the content. In my opinion it would be preferable  that the toString () on the writer return only the content of the document and not metadata. The metadata  are already  stored in the metadata object
> Rida.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-113) Metadata (such as title) should not be part of content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560908#action_12560908 ]

Rida Benjelloun commented on TIKA-113:
--------------------------------------

+1, I agree with Jukka suggestion.
Rida.

> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
>                 Key: TIKA-113
>                 URL: https://issues.apache.org/jira/browse/TIKA-113
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
>
>
> Metadata (such as title)  is added in the content. In my opinion it would be preferable  that the toString () on the writer return only the content of the document and not metadata. The metadata  are already  stored in the metadata object
> Rida.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-113) Metadata (such as title) should not be part of content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569683#action_12569683 ]

Jukka Zitting commented on TIKA-113:
------------------------------------

A solution based on the current code is:

    Writer writer = ...;
    XPathParser xpath = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");
    ContentHandler handler = new MatchingContentHandler(
            new WriteOutContentHandler(writer),
            xpath.parse("/xhtml:html/xhtml:body//*"));

I'm not sure if we should to codify that into a helper class or a method.

> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
>                 Key: TIKA-113
>                 URL: https://issues.apache.org/jira/browse/TIKA-113
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
>
>
> Metadata (such as title)  is added in the content. In my opinion it would be preferable  that the toString () on the writer return only the content of the document and not metadata. The metadata  are already  stored in the metadata object
> Rida.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-113) Metadata (such as title) should not be part of content

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-113.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Resolved in revision  646748 by implementing a BodyContentHandler class for getting just the XHTML body content.

> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
>                 Key: TIKA-113
>                 URL: https://issues.apache.org/jira/browse/TIKA-113
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Rida Benjelloun
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>
> Metadata (such as title)  is added in the content. In my opinion it would be preferable  that the toString () on the writer return only the content of the document and not metadata. The metadata  are already  stored in the metadata object
> Rida.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.