[jira] [Commented] (TIKA-3221) /rmeta/text endpoint - allow a "max parse time" parameter where after exceeded, return bytes/metadata mangaed to get up to that point

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3221) /rmeta/text endpoint - allow a "max parse time" parameter where after exceeded, return bytes/metadata mangaed to get up to that point

Steve Loughran (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237597#comment-17237597 ]

Tim Allison commented on TIKA-3221:
-----------------------------------

I worked on this again a bit today.  There are two more issues.

1) trivial...the underlying metadata values are String[] currently, so we'd have to catch ConcurrentModificationExceptions or, more reliably, change that data structure.

2) more serious problem... RecursiveParserWrapper adds the metadata to the underlying list only after the handler reaches the end of the document.  I don't see an elegant solution to allow a reading thread to grab the contents of the currently active handler and all the other handlers that may be on hold waiting for the end of embedded docs.  This would require a substantial rewrite. :(

> /rmeta/text endpoint - allow a "max parse time" parameter where after exceeded, return bytes/metadata mangaed to get up to that point
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3221
>                 URL: https://issues.apache.org/jira/browse/TIKA-3221
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Can we make a change to the
> {code}
> /rmeta/text
> {code}
> endpoint to allow a "max parse time" parameter where after exceeded, return bytes/metadata managed to get up to that point.
> Motivation:
> I have a massive number of documents that I need to fetch through apache tika server.
> Prior to making a switch to tika server, I used a project I created myself https://github.com/nddipiazza/tika-fork that created tika forked VMs and would send work to the VMs through sockets directly.
> This was OK but super complicated so I chose to switch to the Tika jetty server for simplicity's sake.
> Tika Server works great for the most part for this use case... But one feature I had before was that I could say "If I don't get a result within MAX_PARSE_TIMEOUT_MS, then stop parsing at that moment and return the bytes we managed to get up to that point.
> This is because with the massive number of documents I need to parse, I cannot afford to have any parse hang longer than a certain amount of time. But conversely, if I make timeout 20 seconds, then I suffer massive gaps with *no* content at all.
> With the rmeta/text method, we recently added the ability to send a writeLimit where we will stop parsing after we reach that number of bytes.
> I'm hoping we can do the same for the time parsed. Perhaps when checking byte size, periodically check time and quit parser in the same way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)