[jira] [Comment Edited] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

Mihir Sharma (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091774#comment-17091774 ]

Tim Allison edited comment on TIKA-3093 at 4/24/20, 5:35 PM:
-------------------------------------------------------------

A strawman proposal...

This relies on the /rmeta style output, e.g. [^test_recursive_embedded.docx.json].

Users could specify mappings in a forward-config.json file like so at server startup.

{noformat}
{
        "url":"http://localhost:8983/solr",
        "method":"(put|post)",
        "onException":"(skip|continue)",
        "fields" : {
                "include_non_mapped":false
                "mappings" : {
                        "Content-Type" : "mime",
                        "X-TIKA:content" : "content"
                }
        }
}
{noformat}

They'd put their bytes to http://localhost:9998/rmeta_forward.  In the http headers, they could include fields to inject, e.g. -H "field: id ; doc1" -H "field: myfield ; something_special".

If there's a parse exception and "onException" is "continue", then the stacktrace would be stored in the /rmeta output, and the document would be forwarded.  If set to "skip", the handler would throw an exception back to the client.

If the "fields" element is missing, the document would be sent as is /rmeta.


was (Author: [hidden email]):
A strawman proposal...

This relies on the /rmeta style output, e.g. [^test_recursive_embedded.docx.json].

Users could specify mappings in a forward-config.json file like so at server startup.

{noformat}
{
        "url":"http://localhost:8983/solr",
        "method":"(put|post)",
        "onException":"(skip|continue)",
        "fields" : {
                "include_non_mapped":false
                "mappings" : {
                        "Content-Type" : "mime",
                        "X-TIKA:content" : "content"
                }
        }
}
{noformat}

They'd put their bytes to http://localhost:9998/rmeta_forward.  In the http headers, they could include fields to inject, e.g. -H "field: id ; doc1" -H "field: myfield ; something_special".

If there's a parse exception and "onException" is "continue", then the stacktrace would be stored in the /rmeta output, and the document would be forwarded.  If set to "skip", the handler would throw an exception back to the client.

> Enable tika-server to forward parse results to another endpoint
> ---------------------------------------------------------------
>
>                 Key: TIKA-3093
>                 URL: https://issues.apache.org/jira/browse/TIKA-3093
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: test_recursive_embedded.docx.json
>
>
> bq. I see the "send the results to a remote network service" thing as probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to another endpoint.  For example, a user could specify a Solr URL/update/json/docs handler or an elastic /<index>/_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)