Modified ForkParser

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Modified ForkParser

Tim Allison
  I have a proof-of-concept (not anywhere near ready for committing)
modified/plagiarized version of the ForkParser available here:

  First, the ForkParser is simply genius.  Every time I dig into it, I feel
like I'm looking at advanced alien technology.

  I see three drawbacks to our current ForkParser:

1) It requires that the full tika parser be on the client's class path, it
then sends that parser and inputstream to a separate process for the actual
processing.  I think we're lucky just to be able to build tika-app without
too many jar conflicts.

2) Related, it requires that all of our dependencies be serializable.

3) I don't see an easy way to incorporate the RecursiveParserWrapper,
partly because of my mistakes in implementing it!

  My current alternative moves most of Tika to the child process, so the
client only needs tika-core and tika-serialization.  The client specifies a
directory where tika-app and optional dependencies live, and the child
process builds a Parser from that.

  The current alternative uses the RecursiveParserWrapper as the (hard
coded) default, but I think we could fairly easily make this configurable
via tika-config.xml (ParserFactory)

  The current alternative uses a TextContentHandler, not xhtml...again, I
_think_ we could make this configurable via tika-config

  My current proof of concept is strictly file based...should be easy
enough to fix.

  Anyhow, any and all feedback is welcomed.