[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063260#comment-17063260 ]

Rapster commented on TIKA-694:

Let me comment on this 5 years later :)

Etienne has a strong point, there are plenty of use cases where you only need to extract metadata and not content.
So far, only PDFParser can achieve this, pass null as handler and it should work. As a consequence, extracting metadata on a 1G PDF file takes 14sec instead of 84s, it's definitely not negligible especially if you're working synchronously is your only option.

I'm not aware about all parsers, but I know a lot of them are not supporting null handlers. I'm fully aware it'd be a lot of work but worth it ;-)

Please consider reopening this ticket

> On extraction, get properties AND / OR content extraction
> ---------------------------------------------------------
>                 Key: TIKA-694
>                 URL: https://issues.apache.org/jira/browse/TIKA-694
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 1.0
>         Environment: All OS
>            Reporter: Etienne Jouvin
>            Priority: Minor
>         Attachments: Tika-1.0.zip
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse method, I check the flag, and if equals to true, I removed all the extraction from the content.

This message was sent by Atlassian Jira