[jira] Created: (TIKA-46) Use Metadata in Parser

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-46) Use Metadata in Parser

David Pilato (Jira)
Use Metadata in Parser
----------------------

                 Key: TIKA-46
                 URL: https://issues.apache.org/jira/browse/TIKA-46
             Project: Tika
          Issue Type: Improvement
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting


The Parser interface should use the Metadata framework to pass document metadata in and out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-46) Use Metadata in Parser

David Pilato (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-46:
------------------------------

    Attachment: TIKA-46-part1.patch

Attached a patch (TIKA-46-part1.patch) for introducing a Metadata object to the Parser interface. This is just the first half of the complete solution, as we still need to find a way to pass the configuration information currently contained in the Content collection.

> Use Metadata in Parser
> ----------------------
>
>                 Key: TIKA-46
>                 URL: https://issues.apache.org/jira/browse/TIKA-46
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>         Attachments: TIKA-46-part1.patch
>
>
> The Parser interface should use the Metadata framework to pass document metadata in and out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-46) Use Metadata in Parser

David Pilato (Jira)
In reply to this post by David Pilato (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-46:
----------------------------------

    Attachment: TIKA-46-part1.mattmann.100707.patch.txt

- Jukka, +1 for this change.

I'm attaching a slight update to your patch that:

1. Removes an extra ';' present in your original patch
2. Changes the use of the literal string "title" several places in the updated TestParser to use Metadata.TITLE (for Dublin Core title)

Overall though, +1 for your original patch, and these minor updates. Thanks!

> Use Metadata in Parser
> ----------------------
>
>                 Key: TIKA-46
>                 URL: https://issues.apache.org/jira/browse/TIKA-46
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>         Attachments: TIKA-46-part1.mattmann.100707.patch.txt, TIKA-46-part1.patch
>
>
> The Parser interface should use the Metadata framework to pass document metadata in and out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (TIKA-46) Use Metadata in Parser

Keith R. Bennett
In reply to this post by David Pilato (Jira)
Jukka -

Do you want to revisit the architecture regarding the information we are currently keeping in the Content object (and will be moving momentarily)? Specifically, the text, xml, and regexp values?  Wouldn't there be cases where the different parsers would need their own strings identifying a property such as title?  Should we support overriding the existing keys with parser implementation-specific keys?  So maybe they would be something like this?:

defaultText="title"
defaultXML=...
defaultRegExp=...
org.xyz.FooParser=...

... so perhaps the parser would look up its own class name, and fall back to the default if it doesn't find it?

- Keith

JIRA jira@apache.org wrote
     [ https://issues.apache.org/jira/browse/TIKA-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-46:
------------------------------

    Attachment: TIKA-46-part1.patch

Attached a patch (TIKA-46-part1.patch) for introducing a Metadata object to the Parser interface. This is just the first half of the complete solution, as we still need to find a way to pass the configuration information currently contained in the Content collection.

> Use Metadata in Parser
> ----------------------
>
>                 Key: TIKA-46
>                 URL: https://issues.apache.org/jira/browse/TIKA-46
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>         Attachments: TIKA-46-part1.patch
>
>
> The Parser interface should use the Metadata framework to pass document metadata in and out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-46) Use Metadata in Parser

David Pilato (Jira)
In reply to this post by David Pilato (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-46:
------------------------------

    Attachment: TIKA-46-part2.patch

I committed the first patch (with improvements, thanks Chris!) in revisions 582674 and 582678.

Here's (TIKA-46-part2.patch) the second half of the required changes, i.e. dropping the Content configuration from the parse() method.

The patch actually removes the Content class entirely and simplifies the tika-config.xml file quite a lot by hardcoding the available metadata in the actual Parser classes. As discussed on the mailing list, this actually makes sense as in many cases the parsers can only support a given set of metadata regardless of configuration. Anyway, we probably need to come up with some configuration mechanism for parsers that could support extensible metadata extraction.

> Use Metadata in Parser
> ----------------------
>
>                 Key: TIKA-46
>                 URL: https://issues.apache.org/jira/browse/TIKA-46
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>         Attachments: TIKA-46-part1.mattmann.100707.patch.txt, TIKA-46-part1.patch, TIKA-46-part2.patch
>
>
> The Parser interface should use the Metadata framework to pass document metadata in and out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (TIKA-46) Use Metadata in Parser

Jukka Zitting
In reply to this post by Keith R. Bennett
Hi,

On 10/7/07, Keith R. Bennett <[hidden email]> wrote:
> Do you want to revisit the architecture regarding the information we are
> currently keeping in the Content object (and will be moving momentarily)?

Check out the part2 patch, where I actually removed the Content object
entirely in favor of hardcoding the set of available metadata in the
parser classes.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-46) Use Metadata in Parser

David Pilato (Jira)
In reply to this post by David Pilato (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-46.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.1-incubator

Part2 patch committed in revision 582689.

> Use Metadata in Parser
> ----------------------
>
>                 Key: TIKA-46
>                 URL: https://issues.apache.org/jira/browse/TIKA-46
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-46-part1.mattmann.100707.patch.txt, TIKA-46-part1.patch, TIKA-46-part2.patch
>
>
> The Parser interface should use the Metadata framework to pass document metadata in and out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.