[jira] [Comment Edited] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091774#comment-17091774 ]

Tim Allison edited comment on TIKA-3093 at 4/24/20, 5:33 PM:
-------------------------------------------------------------

A strawman proposal...

This relies on the /rmeta style output, e.g. [^test_recursive_embedded.docx.json].

Users could specify mappings in a forward-config.json file like so at server startup.

{noformat}
{
        "url":"http://localhost:8983/solr",
        "method":"(put|post)",
        "onException":"(skip|continue)",
        "fields" : {
                "include_non_mapped":false
                "mappings" : {
                        "Content-Type" : "mime",
                        "X-TIKA:content" : "content"
                }
        }
}
{noformat}

They'd put their bytes to http://localhost:9998/rmeta_forward.  In the http headers, they could include fields to inject, e.g. -H "field: id ; doc1" -H "field: myfield ; something_special".

If there's a parse exception and "onException" is "continue", then the stacktrace would be stored in the /rmeta output, and the document would be forwarded.  If set to "skip", the handler would throw an exception back to the client.


was (Author: [hidden email]):
A strawman proposal...

This relies on the /rmeta style output, e.g. [^test_recursive_embedded.docx.json].

Users could specify mappings in a forward-config.json file like so at server startup.

{noformat}
{
        "url":"http://localhost:8983/solr",
        "method":"(put|post)",
        "onException":"(skip|continue)",
        "fields" : {
                "include_non_mapped":false
                "mappings" : {
                        "Content-Type" : "mime",
                        "X-TIKA:content" : "content"
                }
        }
}
{noformat}

They'd put their bytes to http://localhost:9998/tika_forward.  In the http headers, they could include fields to inject, e.g. -H "field: id ; doc1" -H "field: myfield ; something_special".

If there's a parse exception and "onException" is "continue", then the stacktrace would be stored in the /rmeta output, and the document would be forwarded.  If set to "skip", the handler would throw an exception back to the client.

> Enable tika-server to forward parse results to another endpoint
> ---------------------------------------------------------------
>
>                 Key: TIKA-3093
>                 URL: https://issues.apache.org/jira/browse/TIKA-3093
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: test_recursive_embedded.docx.json
>
>
> bq. I see the "send the results to a remote network service" thing as probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to another endpoint.  For example, a user could specify a Solr URL/update/json/docs handler or an elastic /<index>/_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)