[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064963#comment-17064963 ]

Tim Allison commented on TIKA-694:

One of the challenges is that different parsers may need to parse the whole file before having all the metadata.  In general, we try to parse the metadata or at least add the metadata as early as possible because as soon as we hit a body element, no more metadata can be written to the xhtml...although the data will be added to the metadata object.

In short, it is hard.

> On extraction, get properties AND / OR content extraction
> ---------------------------------------------------------
>                 Key: TIKA-694
>                 URL: https://issues.apache.org/jira/browse/TIKA-694
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 1.0
>         Environment: All OS
>            Reporter: Etienne Jouvin
>            Priority: Minor
>         Attachments: Tika-1.0.zip
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse method, I check the flag, and if equals to true, I removed all the extraction from the content.

This message was sent by Atlassian Jira