[jira] [Commented] (TIKA-3044) add -C/--content cli option using WriteOutContentHandler

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3044) add -C/--content cli option using WriteOutContentHandler

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036728#comment-17036728 ]

Alexander Klimetschek commented on TIKA-3044:
---------------------------------------------

Pull request: https://github.com/apache/tika/pull/312

Patch: [https://patch-diff.githubusercontent.com/raw/apache/tika/pull/312.patch]

Includes a unit test.

> add -C/--content cli option using WriteOutContentHandler
> --------------------------------------------------------
>
>                 Key: TIKA-3044
>                 URL: https://issues.apache.org/jira/browse/TIKA-3044
>             Project: Tika
>          Issue Type: New Feature
>          Components: cli
>            Reporter: Alexander Klimetschek
>            Priority: Major
>
> For text extraction, the cli currently provides both --text and --text-main options. For html files, --text will return the body, while --text-main will only return the title. There is currently no cli option that gives all text content. However, the Tika API has the WriteOutContentHandler which does the trick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)