Configuring parsers and translators

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Configuring parsers and translators

Nick Burch-2
Hi All

This came up in TIKA-1623, but I thought it might be better brought out to
the list for discussion

To configure parsers on a per-document basis, such as setting PDF spacing
tolerances, or telling Tesseract what language it should be OCRing for, we
have the *Config objects. You create one of these, use the setters to
configure it for your document, pop it onto the Parse context and it's
used when processing your document

To configure parsers and translators on a per-JVM basis, to apply to all
documents processed, it's a bit less consistent. At least some look for a
properties file with a specific name, usually in the tika namespace, and
grab their settings / keys / etc out of that. At least some expect to find
a *Config with their program path on it, even though that remains constant
between documents. None of them support getting their settings from the
Tika Config


As part of our evolution of parser preferences, we're moving towards
people either being able to set their preferences in code, or being able
to supply a Tika Config xml which sets their parser preferences or
overrides certain bits of the default. The code option works for people
who want to declare certain specific things, the Tika Config one gives the
same functionality but allows a consistent and clean way to set it between
Tika App, Tika Server and java code.

Another related example is the External Parser support. Because you can
have multiple External Parser instances in your setup, one per format /
program, we look for all the
org/apache/tika/parser/external/tika-external-parsers.xml files on the
classpath, and create parser instances based on definitions in there


What do we think about setting executable paths and keys/logins for
parsers like OCR, Strings, Translators etc? Always on ParseContext?
Properties? Custom xml config? Tika config xml? Other? Combination?

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Nick Burch-2
Anyone have any thoughts on this?

On Fri, 8 May 2015, Nick Burch wrote:

> Hi All
>
> This came up in TIKA-1623, but I thought it might be better brought out to
> the list for discussion
>
> To configure parsers on a per-document basis, such as setting PDF
> spacing tolerances, or telling Tesseract what language it should be
> OCRing for, we have the *Config objects. You create one of these, use
> the setters to configure it for your document, pop it onto the Parse
> context and it's used when processing your document
>
> To configure parsers and translators on a per-JVM basis, to apply to all
> documents processed, it's a bit less consistent. At least some look for
> a properties file with a specific name, usually in the tika namespace,
> and grab their settings / keys / etc out of that. At least some expect
> to find a *Config with their program path on it, even though that
> remains constant between documents. None of them support getting their
> settings from the Tika Config
>
>
> As part of our evolution of parser preferences, we're moving towards
> people either being able to set their preferences in code, or being able
> to supply a Tika Config xml which sets their parser preferences or
> overrides certain bits of the default. The code option works for people
> who want to declare certain specific things, the Tika Config one gives
> the same functionality but allows a consistent and clean way to set it
> between Tika App, Tika Server and java code.
>
> Another related example is the External Parser support. Because you can
> have multiple External Parser instances in your setup, one per format /
> program, we look for all the
> org/apache/tika/parser/external/tika-external-parsers.xml files on the
> classpath, and create parser instances based on definitions in there
>
>
> What do we think about setting executable paths and keys/logins for
> parsers like OCR, Strings, Translators etc? Always on ParseContext?
> Properties? Custom xml config? Tika config xml? Other? Combination?
>
> Nick
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Tyler Palsulich
Hi Nick,

I've been mulling this over since you sent the first message. But, I'm
afraid I don't have a good solution or developed ideas.

I agree, it would be very nice to consolidate all configuration for all
parsers in the server and app.

Is it feasible to put everything into tika-config? Then Parser
implementations would read the config to pull out their own configuration.
Or, would it be better to keep some configuration separate? Documentation
would be an issue if every parser defines its own metadata keys... But, it
might be an improvement since we don't have "free form" properties and
configuration files.

Tyler

On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]> wrote:

> Anyone have any thoughts on this?
>
> On Fri, 8 May 2015, Nick Burch wrote:
> > Hi All
> >
> > This came up in TIKA-1623, but I thought it might be better brought out
> to
> > the list for discussion
> >
> > To configure parsers on a per-document basis, such as setting PDF
> > spacing tolerances, or telling Tesseract what language it should be
> > OCRing for, we have the *Config objects. You create one of these, use
> > the setters to configure it for your document, pop it onto the Parse
> > context and it's used when processing your document
> >
> > To configure parsers and translators on a per-JVM basis, to apply to all
> > documents processed, it's a bit less consistent. At least some look for
> > a properties file with a specific name, usually in the tika namespace,
> > and grab their settings / keys / etc out of that. At least some expect
> > to find a *Config with their program path on it, even though that
> > remains constant between documents. None of them support getting their
> > settings from the Tika Config
> >
> >
> > As part of our evolution of parser preferences, we're moving towards
> > people either being able to set their preferences in code, or being able
> > to supply a Tika Config xml which sets their parser preferences or
> > overrides certain bits of the default. The code option works for people
> > who want to declare certain specific things, the Tika Config one gives
> > the same functionality but allows a consistent and clean way to set it
> > between Tika App, Tika Server and java code.
> >
> > Another related example is the External Parser support. Because you can
> > have multiple External Parser instances in your setup, one per format /
> > program, we look for all the
> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
> > classpath, and create parser instances based on definitions in there
> >
> >
> > What do we think about setting executable paths and keys/logins for
> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >
> > Nick
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Mattmann, Chris A (3010)
I think it would be great to have all this in the Tika Config.

The one thing then is to provide an example default config and
to make it *hugely* clear rather than all the levels of indirection
that we currently have going on which makes it super hard when
there is a config error (SPI, swallowing print messages, etc.)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Saturday, June 6, 2015 at 3:45 PM
To: "[hidden email]" <[hidden email]>
Subject: Re: Configuring parsers and translators

>Hi Nick,
>
>I've been mulling this over since you sent the first message. But, I'm
>afraid I don't have a good solution or developed ideas.
>
>I agree, it would be very nice to consolidate all configuration for all
>parsers in the server and app.
>
>Is it feasible to put everything into tika-config? Then Parser
>implementations would read the config to pull out their own configuration.
>Or, would it be better to keep some configuration separate? Documentation
>would be an issue if every parser defines its own metadata keys... But, it
>might be an improvement since we don't have "free form" properties and
>configuration files.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]> wrote:
>
>> Anyone have any thoughts on this?
>>
>> On Fri, 8 May 2015, Nick Burch wrote:
>> > Hi All
>> >
>> > This came up in TIKA-1623, but I thought it might be better brought
>>out
>> to
>> > the list for discussion
>> >
>> > To configure parsers on a per-document basis, such as setting PDF
>> > spacing tolerances, or telling Tesseract what language it should be
>> > OCRing for, we have the *Config objects. You create one of these, use
>> > the setters to configure it for your document, pop it onto the Parse
>> > context and it's used when processing your document
>> >
>> > To configure parsers and translators on a per-JVM basis, to apply to
>>all
>> > documents processed, it's a bit less consistent. At least some look
>>for
>> > a properties file with a specific name, usually in the tika namespace,
>> > and grab their settings / keys / etc out of that. At least some expect
>> > to find a *Config with their program path on it, even though that
>> > remains constant between documents. None of them support getting their
>> > settings from the Tika Config
>> >
>> >
>> > As part of our evolution of parser preferences, we're moving towards
>> > people either being able to set their preferences in code, or being
>>able
>> > to supply a Tika Config xml which sets their parser preferences or
>> > overrides certain bits of the default. The code option works for
>>people
>> > who want to declare certain specific things, the Tika Config one gives
>> > the same functionality but allows a consistent and clean way to set it
>> > between Tika App, Tika Server and java code.
>> >
>> > Another related example is the External Parser support. Because you
>>can
>> > have multiple External Parser instances in your setup, one per format
>>/
>> > program, we look for all the
>> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
>> > classpath, and create parser instances based on definitions in there
>> >
>> >
>> > What do we think about setting executable paths and keys/logins for
>> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >
>> > Nick
>> >
>>

Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Tyler Palsulich
(Devil's advocate hat slightly on.) My one hesitation about putting it all
into tika-config is that the default might get to be a monstrosity --
difficult for new users to use.

Tyler

On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
[hidden email]> wrote:

> I think it would be great to have all this in the Tika Config.
>
> The one thing then is to provide an example default config and
> to make it *hugely* clear rather than all the levels of indirection
> that we currently have going on which makes it super hard when
> there is a config error (SPI, swallowing print messages, etc.)
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [hidden email]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Saturday, June 6, 2015 at 3:45 PM
> To: "[hidden email]" <[hidden email]>
> Subject: Re: Configuring parsers and translators
>
> >Hi Nick,
> >
> >I've been mulling this over since you sent the first message. But, I'm
> >afraid I don't have a good solution or developed ideas.
> >
> >I agree, it would be very nice to consolidate all configuration for all
> >parsers in the server and app.
> >
> >Is it feasible to put everything into tika-config? Then Parser
> >implementations would read the config to pull out their own configuration.
> >Or, would it be better to keep some configuration separate? Documentation
> >would be an issue if every parser defines its own metadata keys... But, it
> >might be an improvement since we don't have "free form" properties and
> >configuration files.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]> wrote:
> >
> >> Anyone have any thoughts on this?
> >>
> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> > Hi All
> >> >
> >> > This came up in TIKA-1623, but I thought it might be better brought
> >>out
> >> to
> >> > the list for discussion
> >> >
> >> > To configure parsers on a per-document basis, such as setting PDF
> >> > spacing tolerances, or telling Tesseract what language it should be
> >> > OCRing for, we have the *Config objects. You create one of these, use
> >> > the setters to configure it for your document, pop it onto the Parse
> >> > context and it's used when processing your document
> >> >
> >> > To configure parsers and translators on a per-JVM basis, to apply to
> >>all
> >> > documents processed, it's a bit less consistent. At least some look
> >>for
> >> > a properties file with a specific name, usually in the tika namespace,
> >> > and grab their settings / keys / etc out of that. At least some expect
> >> > to find a *Config with their program path on it, even though that
> >> > remains constant between documents. None of them support getting their
> >> > settings from the Tika Config
> >> >
> >> >
> >> > As part of our evolution of parser preferences, we're moving towards
> >> > people either being able to set their preferences in code, or being
> >>able
> >> > to supply a Tika Config xml which sets their parser preferences or
> >> > overrides certain bits of the default. The code option works for
> >>people
> >> > who want to declare certain specific things, the Tika Config one gives
> >> > the same functionality but allows a consistent and clean way to set it
> >> > between Tika App, Tika Server and java code.
> >> >
> >> > Another related example is the External Parser support. Because you
> >>can
> >> > have multiple External Parser instances in your setup, one per format
> >>/
> >> > program, we look for all the
> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
> >> > classpath, and create parser instances based on definitions in there
> >> >
> >> >
> >> > What do we think about setting executable paths and keys/logins for
> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >
> >> > Nick
> >> >
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Mattmann, Chris A (3010)
Hey Tyler,

I hear you, but balance that against all the hidden things here
and there, and everywhere, that I constantly keep discovering and
having to pour through lines of TikaConfig - service loaders, class
loaders.

When things work right - no problem. When something goes wrong;
HUGE waste of time.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Saturday, June 6, 2015 at 3:59 PM
To: "[hidden email]" <[hidden email]>
Subject: Re: Configuring parsers and translators

>(Devil's advocate hat slightly on.) My one hesitation about putting it all
>into tika-config is that the default might get to be a monstrosity --
>difficult for new users to use.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
>[hidden email]> wrote:
>
>> I think it would be great to have all this in the Tika Config.
>>
>> The one thing then is to provide an example default config and
>> to make it *hugely* clear rather than all the levels of indirection
>> that we currently have going on which makes it super hard when
>> there is a config error (SPI, swallowing print messages, etc.)
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [hidden email]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tyler Palsulich <[hidden email]>
>> Reply-To: "[hidden email]" <[hidden email]>
>> Date: Saturday, June 6, 2015 at 3:45 PM
>> To: "[hidden email]" <[hidden email]>
>> Subject: Re: Configuring parsers and translators
>>
>> >Hi Nick,
>> >
>> >I've been mulling this over since you sent the first message. But, I'm
>> >afraid I don't have a good solution or developed ideas.
>> >
>> >I agree, it would be very nice to consolidate all configuration for all
>> >parsers in the server and app.
>> >
>> >Is it feasible to put everything into tika-config? Then Parser
>> >implementations would read the config to pull out their own
>>configuration.
>> >Or, would it be better to keep some configuration separate?
>>Documentation
>> >would be an issue if every parser defines its own metadata keys...
>>But, it
>> >might be an improvement since we don't have "free form" properties and
>> >configuration files.
>> >
>> >Tyler
>> >
>> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]>
>>wrote:
>> >
>> >> Anyone have any thoughts on this?
>> >>
>> >> On Fri, 8 May 2015, Nick Burch wrote:
>> >> > Hi All
>> >> >
>> >> > This came up in TIKA-1623, but I thought it might be better brought
>> >>out
>> >> to
>> >> > the list for discussion
>> >> >
>> >> > To configure parsers on a per-document basis, such as setting PDF
>> >> > spacing tolerances, or telling Tesseract what language it should be
>> >> > OCRing for, we have the *Config objects. You create one of these,
>>use
>> >> > the setters to configure it for your document, pop it onto the
>>Parse
>> >> > context and it's used when processing your document
>> >> >
>> >> > To configure parsers and translators on a per-JVM basis, to apply
>>to
>> >>all
>> >> > documents processed, it's a bit less consistent. At least some look
>> >>for
>> >> > a properties file with a specific name, usually in the tika
>>namespace,
>> >> > and grab their settings / keys / etc out of that. At least some
>>expect
>> >> > to find a *Config with their program path on it, even though that
>> >> > remains constant between documents. None of them support getting
>>their
>> >> > settings from the Tika Config
>> >> >
>> >> >
>> >> > As part of our evolution of parser preferences, we're moving
>>towards
>> >> > people either being able to set their preferences in code, or being
>> >>able
>> >> > to supply a Tika Config xml which sets their parser preferences or
>> >> > overrides certain bits of the default. The code option works for
>> >>people
>> >> > who want to declare certain specific things, the Tika Config one
>>gives
>> >> > the same functionality but allows a consistent and clean way to
>>set it
>> >> > between Tika App, Tika Server and java code.
>> >> >
>> >> > Another related example is the External Parser support. Because you
>> >>can
>> >> > have multiple External Parser instances in your setup, one per
>>format
>> >>/
>> >> > program, we look for all the
>> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
>>the
>> >> > classpath, and create parser instances based on definitions in
>>there
>> >> >
>> >> >
>> >> > What do we think about setting executable paths and keys/logins for
>> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >> >
>> >> > Nick
>> >> >
>> >>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Nick Burch-2
In reply to this post by Tyler Palsulich
On Sat, 6 Jun 2015, Tyler Palsulich wrote:
> (Devil's advocate hat slightly on.) My one hesitation about putting it
> all into tika-config is that the default might get to be a monstrosity
> -- difficult for new users to use.

Assuming you don't want any translators, and have no non-standard paths to
external parsers, and are happy with default parser orderings, then your
default config would be:

<properties/>

(The plan so far remains with using the service loader to find parsers,
detectors and friends, with the config just being used when you want to
override parsers or parser orderings)


My main worry with putting it all into config xml is that we accidently
end up re-inventing spring badly...

Nick
Reply | Threaded
Open this post in threaded view
|

RE: Configuring parsers and translators

Allison, Timothy B.
In reply to this post by Mattmann, Chris A (3010)
Tyler, I see your devil's advocate point.  

I strongly agree with Chris about the benefit of centralizing configuration and making it easy to dump and modify the TikaConfig file.

Even though the TikaConfig file might get ugly, it would be far better to have everything nailed down there than searching through service loaders...IMHO.

I opened TIKA-1508 a while ago and haven't had any time to work on it...this just deals with simple parameter settings for parsers, not the far more difficult/interesting stuff that we've discussed with composite parsers.

>> My main worry with putting it all into config xml is that we accidently end up re-inventing spring badly...

Yeah, or re-inventing Solr's parameter loading as my example does... :(

I think that basic parameter setting should at least be fairly trivial to code...time allowing...argh.


-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:[hidden email]]
Sent: Saturday, June 06, 2015 7:01 PM
To: [hidden email]
Subject: Re: Configuring parsers and translators

Hey Tyler,

I hear you, but balance that against all the hidden things here
and there, and everywhere, that I constantly keep discovering and
having to pour through lines of TikaConfig - service loaders, class
loaders.

When things work right - no problem. When something goes wrong;
HUGE waste of time.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Saturday, June 6, 2015 at 3:59 PM
To: "[hidden email]" <[hidden email]>
Subject: Re: Configuring parsers and translators

>(Devil's advocate hat slightly on.) My one hesitation about putting it all
>into tika-config is that the default might get to be a monstrosity --
>difficult for new users to use.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
>[hidden email]> wrote:
>
>> I think it would be great to have all this in the Tika Config.
>>
>> The one thing then is to provide an example default config and
>> to make it *hugely* clear rather than all the levels of indirection
>> that we currently have going on which makes it super hard when
>> there is a config error (SPI, swallowing print messages, etc.)
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [hidden email]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tyler Palsulich <[hidden email]>
>> Reply-To: "[hidden email]" <[hidden email]>
>> Date: Saturday, June 6, 2015 at 3:45 PM
>> To: "[hidden email]" <[hidden email]>
>> Subject: Re: Configuring parsers and translators
>>
>> >Hi Nick,
>> >
>> >I've been mulling this over since you sent the first message. But, I'm
>> >afraid I don't have a good solution or developed ideas.
>> >
>> >I agree, it would be very nice to consolidate all configuration for all
>> >parsers in the server and app.
>> >
>> >Is it feasible to put everything into tika-config? Then Parser
>> >implementations would read the config to pull out their own
>>configuration.
>> >Or, would it be better to keep some configuration separate?
>>Documentation
>> >would be an issue if every parser defines its own metadata keys...
>>But, it
>> >might be an improvement since we don't have "free form" properties and
>> >configuration files.
>> >
>> >Tyler
>> >
>> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]>
>>wrote:
>> >
>> >> Anyone have any thoughts on this?
>> >>
>> >> On Fri, 8 May 2015, Nick Burch wrote:
>> >> > Hi All
>> >> >
>> >> > This came up in TIKA-1623, but I thought it might be better brought
>> >>out
>> >> to
>> >> > the list for discussion
>> >> >
>> >> > To configure parsers on a per-document basis, such as setting PDF
>> >> > spacing tolerances, or telling Tesseract what language it should be
>> >> > OCRing for, we have the *Config objects. You create one of these,
>>use
>> >> > the setters to configure it for your document, pop it onto the
>>Parse
>> >> > context and it's used when processing your document
>> >> >
>> >> > To configure parsers and translators on a per-JVM basis, to apply
>>to
>> >>all
>> >> > documents processed, it's a bit less consistent. At least some look
>> >>for
>> >> > a properties file with a specific name, usually in the tika
>>namespace,
>> >> > and grab their settings / keys / etc out of that. At least some
>>expect
>> >> > to find a *Config with their program path on it, even though that
>> >> > remains constant between documents. None of them support getting
>>their
>> >> > settings from the Tika Config
>> >> >
>> >> >
>> >> > As part of our evolution of parser preferences, we're moving
>>towards
>> >> > people either being able to set their preferences in code, or being
>> >>able
>> >> > to supply a Tika Config xml which sets their parser preferences or
>> >> > overrides certain bits of the default. The code option works for
>> >>people
>> >> > who want to declare certain specific things, the Tika Config one
>>gives
>> >> > the same functionality but allows a consistent and clean way to
>>set it
>> >> > between Tika App, Tika Server and java code.
>> >> >
>> >> > Another related example is the External Parser support. Because you
>> >>can
>> >> > have multiple External Parser instances in your setup, one per
>>format
>> >>/
>> >> > program, we look for all the
>> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
>>the
>> >> > classpath, and create parser instances based on definitions in
>>there
>> >> >
>> >> >
>> >> > What do we think about setting executable paths and keys/logins for
>> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >> >
>> >> > Nick
>> >> >
>> >>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Tyler Palsulich
It seems like there are two goals here, both aiming to centralize
configuration:

1. Provide an easy mechanism to configure which parsers to use when
(TIKA-1509).
2. Configure all individual parser parameters in Tika Config (not in, for
example, TesseractOCRConfig.properties) (TIKA-1508).

I'm also in favor of consolidating everything in Tika Config.

Tyler

On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <[hidden email]>
wrote:

> Tyler, I see your devil's advocate point.
>
> I strongly agree with Chris about the benefit of centralizing
> configuration and making it easy to dump and modify the TikaConfig file.
>
> Even though the TikaConfig file might get ugly, it would be far better to
> have everything nailed down there than searching through service
> loaders...IMHO.
>
> I opened TIKA-1508 a while ago and haven't had any time to work on
> it...this just deals with simple parameter settings for parsers, not the
> far more difficult/interesting stuff that we've discussed with composite
> parsers.
>
> >> My main worry with putting it all into config xml is that we accidently
> end up re-inventing spring badly...
>
> Yeah, or re-inventing Solr's parameter loading as my example does... :(
>
> I think that basic parameter setting should at least be fairly trivial to
> code...time allowing...argh.
>
>
> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:[hidden email]]
> Sent: Saturday, June 06, 2015 7:01 PM
> To: [hidden email]
> Subject: Re: Configuring parsers and translators
>
> Hey Tyler,
>
> I hear you, but balance that against all the hidden things here
> and there, and everywhere, that I constantly keep discovering and
> having to pour through lines of TikaConfig - service loaders, class
> loaders.
>
> When things work right - no problem. When something goes wrong;
> HUGE waste of time.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [hidden email]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Saturday, June 6, 2015 at 3:59 PM
> To: "[hidden email]" <[hidden email]>
> Subject: Re: Configuring parsers and translators
>
> >(Devil's advocate hat slightly on.) My one hesitation about putting it all
> >into tika-config is that the default might get to be a monstrosity --
> >difficult for new users to use.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> >[hidden email]> wrote:
> >
> >> I think it would be great to have all this in the Tika Config.
> >>
> >> The one thing then is to provide an example default config and
> >> to make it *hugely* clear rather than all the levels of indirection
> >> that we currently have going on which makes it super hard when
> >> there is a config error (SPI, swallowing print messages, etc.)
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: [hidden email]
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tyler Palsulich <[hidden email]>
> >> Reply-To: "[hidden email]" <[hidden email]>
> >> Date: Saturday, June 6, 2015 at 3:45 PM
> >> To: "[hidden email]" <[hidden email]>
> >> Subject: Re: Configuring parsers and translators
> >>
> >> >Hi Nick,
> >> >
> >> >I've been mulling this over since you sent the first message. But, I'm
> >> >afraid I don't have a good solution or developed ideas.
> >> >
> >> >I agree, it would be very nice to consolidate all configuration for all
> >> >parsers in the server and app.
> >> >
> >> >Is it feasible to put everything into tika-config? Then Parser
> >> >implementations would read the config to pull out their own
> >>configuration.
> >> >Or, would it be better to keep some configuration separate?
> >>Documentation
> >> >would be an issue if every parser defines its own metadata keys...
> >>But, it
> >> >might be an improvement since we don't have "free form" properties and
> >> >configuration files.
> >> >
> >> >Tyler
> >> >
> >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]>
> >>wrote:
> >> >
> >> >> Anyone have any thoughts on this?
> >> >>
> >> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> >> > Hi All
> >> >> >
> >> >> > This came up in TIKA-1623, but I thought it might be better brought
> >> >>out
> >> >> to
> >> >> > the list for discussion
> >> >> >
> >> >> > To configure parsers on a per-document basis, such as setting PDF
> >> >> > spacing tolerances, or telling Tesseract what language it should be
> >> >> > OCRing for, we have the *Config objects. You create one of these,
> >>use
> >> >> > the setters to configure it for your document, pop it onto the
> >>Parse
> >> >> > context and it's used when processing your document
> >> >> >
> >> >> > To configure parsers and translators on a per-JVM basis, to apply
> >>to
> >> >>all
> >> >> > documents processed, it's a bit less consistent. At least some look
> >> >>for
> >> >> > a properties file with a specific name, usually in the tika
> >>namespace,
> >> >> > and grab their settings / keys / etc out of that. At least some
> >>expect
> >> >> > to find a *Config with their program path on it, even though that
> >> >> > remains constant between documents. None of them support getting
> >>their
> >> >> > settings from the Tika Config
> >> >> >
> >> >> >
> >> >> > As part of our evolution of parser preferences, we're moving
> >>towards
> >> >> > people either being able to set their preferences in code, or being
> >> >>able
> >> >> > to supply a Tika Config xml which sets their parser preferences or
> >> >> > overrides certain bits of the default. The code option works for
> >> >>people
> >> >> > who want to declare certain specific things, the Tika Config one
> >>gives
> >> >> > the same functionality but allows a consistent and clean way to
> >>set it
> >> >> > between Tika App, Tika Server and java code.
> >> >> >
> >> >> > Another related example is the External Parser support. Because you
> >> >>can
> >> >> > have multiple External Parser instances in your setup, one per
> >>format
> >> >>/
> >> >> > program, we look for all the
> >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
> >>the
> >> >> > classpath, and create parser instances based on definitions in
> >>there
> >> >> >
> >> >> >
> >> >> > What do we think about setting executable paths and keys/logins for
> >> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >> >
> >> >> > Nick
> >> >> >
> >> >>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Configuring parsers and translators

Allison, Timothy B.
Agreed.  They are two separate but related issues.  TIKA-1508 should be fairly straightforward.  Should I start coding it?  Any other recommendations/concerns?



-----Original Message-----
From: Tyler Palsulich [mailto:[hidden email]]
Sent: Saturday, June 13, 2015 12:54 PM
To: [hidden email]
Subject: Re: Configuring parsers and translators

It seems like there are two goals here, both aiming to centralize
configuration:

1. Provide an easy mechanism to configure which parsers to use when
(TIKA-1509).
2. Configure all individual parser parameters in Tika Config (not in, for
example, TesseractOCRConfig.properties) (TIKA-1508).

I'm also in favor of consolidating everything in Tika Config.

Tyler

On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <[hidden email]>
wrote:

> Tyler, I see your devil's advocate point.
>
> I strongly agree with Chris about the benefit of centralizing
> configuration and making it easy to dump and modify the TikaConfig file.
>
> Even though the TikaConfig file might get ugly, it would be far better to
> have everything nailed down there than searching through service
> loaders...IMHO.
>
> I opened TIKA-1508 a while ago and haven't had any time to work on
> it...this just deals with simple parameter settings for parsers, not the
> far more difficult/interesting stuff that we've discussed with composite
> parsers.
>
> >> My main worry with putting it all into config xml is that we accidently
> end up re-inventing spring badly...
>
> Yeah, or re-inventing Solr's parameter loading as my example does... :(
>
> I think that basic parameter setting should at least be fairly trivial to
> code...time allowing...argh.
>
>
> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:[hidden email]]
> Sent: Saturday, June 06, 2015 7:01 PM
> To: [hidden email]
> Subject: Re: Configuring parsers and translators
>
> Hey Tyler,
>
> I hear you, but balance that against all the hidden things here
> and there, and everywhere, that I constantly keep discovering and
> having to pour through lines of TikaConfig - service loaders, class
> loaders.
>
> When things work right - no problem. When something goes wrong;
> HUGE waste of time.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [hidden email]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Saturday, June 6, 2015 at 3:59 PM
> To: "[hidden email]" <[hidden email]>
> Subject: Re: Configuring parsers and translators
>
> >(Devil's advocate hat slightly on.) My one hesitation about putting it all
> >into tika-config is that the default might get to be a monstrosity --
> >difficult for new users to use.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> >[hidden email]> wrote:
> >
> >> I think it would be great to have all this in the Tika Config.
> >>
> >> The one thing then is to provide an example default config and
> >> to make it *hugely* clear rather than all the levels of indirection
> >> that we currently have going on which makes it super hard when
> >> there is a config error (SPI, swallowing print messages, etc.)
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: [hidden email]
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tyler Palsulich <[hidden email]>
> >> Reply-To: "[hidden email]" <[hidden email]>
> >> Date: Saturday, June 6, 2015 at 3:45 PM
> >> To: "[hidden email]" <[hidden email]>
> >> Subject: Re: Configuring parsers and translators
> >>
> >> >Hi Nick,
> >> >
> >> >I've been mulling this over since you sent the first message. But, I'm
> >> >afraid I don't have a good solution or developed ideas.
> >> >
> >> >I agree, it would be very nice to consolidate all configuration for all
> >> >parsers in the server and app.
> >> >
> >> >Is it feasible to put everything into tika-config? Then Parser
> >> >implementations would read the config to pull out their own
> >>configuration.
> >> >Or, would it be better to keep some configuration separate?
> >>Documentation
> >> >would be an issue if every parser defines its own metadata keys...
> >>But, it
> >> >might be an improvement since we don't have "free form" properties and
> >> >configuration files.
> >> >
> >> >Tyler
> >> >
> >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]>
> >>wrote:
> >> >
> >> >> Anyone have any thoughts on this?
> >> >>
> >> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> >> > Hi All
> >> >> >
> >> >> > This came up in TIKA-1623, but I thought it might be better brought
> >> >>out
> >> >> to
> >> >> > the list for discussion
> >> >> >
> >> >> > To configure parsers on a per-document basis, such as setting PDF
> >> >> > spacing tolerances, or telling Tesseract what language it should be
> >> >> > OCRing for, we have the *Config objects. You create one of these,
> >>use
> >> >> > the setters to configure it for your document, pop it onto the
> >>Parse
> >> >> > context and it's used when processing your document
> >> >> >
> >> >> > To configure parsers and translators on a per-JVM basis, to apply
> >>to
> >> >>all
> >> >> > documents processed, it's a bit less consistent. At least some look
> >> >>for
> >> >> > a properties file with a specific name, usually in the tika
> >>namespace,
> >> >> > and grab their settings / keys / etc out of that. At least some
> >>expect
> >> >> > to find a *Config with their program path on it, even though that
> >> >> > remains constant between documents. None of them support getting
> >>their
> >> >> > settings from the Tika Config
> >> >> >
> >> >> >
> >> >> > As part of our evolution of parser preferences, we're moving
> >>towards
> >> >> > people either being able to set their preferences in code, or being
> >> >>able
> >> >> > to supply a Tika Config xml which sets their parser preferences or
> >> >> > overrides certain bits of the default. The code option works for
> >> >>people
> >> >> > who want to declare certain specific things, the Tika Config one
> >>gives
> >> >> > the same functionality but allows a consistent and clean way to
> >>set it
> >> >> > between Tika App, Tika Server and java code.
> >> >> >
> >> >> > Another related example is the External Parser support. Because you
> >> >>can
> >> >> > have multiple External Parser instances in your setup, one per
> >>format
> >> >>/
> >> >> > program, we look for all the
> >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
> >>the
> >> >> > classpath, and create parser instances based on definitions in
> >>there
> >> >> >
> >> >> >
> >> >> > What do we think about setting executable paths and keys/logins for
> >> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >> >
> >> >> > Nick
> >> >> >
> >> >>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Configuring parsers and translators

Nick Burch-2
On Mon, 15 Jun 2015, Allison, Timothy B. wrote:
> Agreed.  They are two separate but related issues.  TIKA-1508 should be
> fairly straightforward.  Should I start coding it?  Any other
> recommendations/concerns?

My personal view is that properties/configuration which apply to all
documents of a type should be set at Parser creation time, either from a
Tika Config object or someone in code doing "Parser p = new FooParser();
p.setblah();". Properties/config which vary from document to document
should be set on the ParseContext

Not sure if we had consensus on that as a policy though?


In terms of TIKA-1508, any chance you could pick two parsers which are
currently configured some how, and update the issue to show how they are
configured now, and how you'd see them being configured in Tika Config? I
think it might be easier to review with some concrete cases, rather than
the abstract idea we have now

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Mattmann, Chris A (3010)
We also need to be mindful of Tika app and sever where there is no current way to see config other than Tika config file and multiple conflicting ways to set it...

Sent from my iPhone

> On Jun 15, 2015, at 8:02 AM, Nick Burch <[hidden email]> wrote:
>
>> On Mon, 15 Jun 2015, Allison, Timothy B. wrote:
>> Agreed.  They are two separate but related issues.  TIKA-1508 should be fairly straightforward.  Should I start coding it?  Any other recommendations/concerns?
>
> My personal view is that properties/configuration which apply to all documents of a type should be set at Parser creation time, either from a Tika Config object or someone in code doing "Parser p = new FooParser(); p.setblah();". Properties/config which vary from document to document should be set on the ParseContext
>
> Not sure if we had consensus on that as a policy though?
>
>
> In terms of TIKA-1508, any chance you could pick two parsers which are currently configured some how, and update the issue to show how they are configured now, and how you'd see them being configured in Tika Config? I think it might be easier to review with some concrete cases, rather than the abstract idea we have now
>
> Nick
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Konstantin Gribov
In reply to this post by Allison, Timothy B.
I think, there's a third concern should be taken in account: dynamic
configuration (e.g. based on metadata, like password provider on
per-document basis).
Currently you only can inject some dynamically configurable behavior via
ParseContext, but it adds complexity to recursive parser implementations.

--
Best regards,
Konstantin Gribov

пн, 15 июня 2015 г. в 16:52, Allison, Timothy B. <[hidden email]>:

> Agreed.  They are two separate but related issues.  TIKA-1508 should be
> fairly straightforward.  Should I start coding it?  Any other
> recommendations/concerns?
>
>
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:[hidden email]]
> Sent: Saturday, June 13, 2015 12:54 PM
> To: [hidden email]
> Subject: Re: Configuring parsers and translators
>
> It seems like there are two goals here, both aiming to centralize
> configuration:
>
> 1. Provide an easy mechanism to configure which parsers to use when
> (TIKA-1509).
> 2. Configure all individual parser parameters in Tika Config (not in, for
> example, TesseractOCRConfig.properties) (TIKA-1508).
>
> I'm also in favor of consolidating everything in Tika Config.
>
> Tyler
>
> On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <[hidden email]>
> wrote:
>
> > Tyler, I see your devil's advocate point.
> >
> > I strongly agree with Chris about the benefit of centralizing
> > configuration and making it easy to dump and modify the TikaConfig file.
> >
> > Even though the TikaConfig file might get ugly, it would be far better to
> > have everything nailed down there than searching through service
> > loaders...IMHO.
> >
> > I opened TIKA-1508 a while ago and haven't had any time to work on
> > it...this just deals with simple parameter settings for parsers, not the
> > far more difficult/interesting stuff that we've discussed with composite
> > parsers.
> >
> > >> My main worry with putting it all into config xml is that we
> accidently
> > end up re-inventing spring badly...
> >
> > Yeah, or re-inventing Solr's parameter loading as my example does... :(
> >
> > I think that basic parameter setting should at least be fairly trivial to
> > code...time allowing...argh.
> >
> >
> > -----Original Message-----
> > From: Mattmann, Chris A (3980) [mailto:[hidden email]]
> > Sent: Saturday, June 06, 2015 7:01 PM
> > To: [hidden email]
> > Subject: Re: Configuring parsers and translators
> >
> > Hey Tyler,
> >
> > I hear you, but balance that against all the hidden things here
> > and there, and everywhere, that I constantly keep discovering and
> > having to pour through lines of TikaConfig - service loaders, class
> > loaders.
> >
> > When things work right - no problem. When something goes wrong;
> > HUGE waste of time.
> >
> > Cheers,
> > Chris
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: [hidden email]
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> > -----Original Message-----
> > From: Tyler Palsulich <[hidden email]>
> > Reply-To: "[hidden email]" <[hidden email]>
> > Date: Saturday, June 6, 2015 at 3:59 PM
> > To: "[hidden email]" <[hidden email]>
> > Subject: Re: Configuring parsers and translators
> >
> > >(Devil's advocate hat slightly on.) My one hesitation about putting it
> all
> > >into tika-config is that the default might get to be a monstrosity --
> > >difficult for new users to use.
> > >
> > >Tyler
> > >
> > >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> > >[hidden email]> wrote:
> > >
> > >> I think it would be great to have all this in the Tika Config.
> > >>
> > >> The one thing then is to provide an example default config and
> > >> to make it *hugely* clear rather than all the levels of indirection
> > >> that we currently have going on which makes it super hard when
> > >> there is a config error (SPI, swallowing print messages, etc.)
> > >>
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: [hidden email]
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Tyler Palsulich <[hidden email]>
> > >> Reply-To: "[hidden email]" <[hidden email]>
> > >> Date: Saturday, June 6, 2015 at 3:45 PM
> > >> To: "[hidden email]" <[hidden email]>
> > >> Subject: Re: Configuring parsers and translators
> > >>
> > >> >Hi Nick,
> > >> >
> > >> >I've been mulling this over since you sent the first message. But,
> I'm
> > >> >afraid I don't have a good solution or developed ideas.
> > >> >
> > >> >I agree, it would be very nice to consolidate all configuration for
> all
> > >> >parsers in the server and app.
> > >> >
> > >> >Is it feasible to put everything into tika-config? Then Parser
> > >> >implementations would read the config to pull out their own
> > >>configuration.
> > >> >Or, would it be better to keep some configuration separate?
> > >>Documentation
> > >> >would be an issue if every parser defines its own metadata keys...
> > >>But, it
> > >> >might be an improvement since we don't have "free form" properties
> and
> > >> >configuration files.
> > >> >
> > >> >Tyler
> > >> >
> > >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[hidden email]>
> > >>wrote:
> > >> >
> > >> >> Anyone have any thoughts on this?
> > >> >>
> > >> >> On Fri, 8 May 2015, Nick Burch wrote:
> > >> >> > Hi All
> > >> >> >
> > >> >> > This came up in TIKA-1623, but I thought it might be better
> brought
> > >> >>out
> > >> >> to
> > >> >> > the list for discussion
> > >> >> >
> > >> >> > To configure parsers on a per-document basis, such as setting PDF
> > >> >> > spacing tolerances, or telling Tesseract what language it should
> be
> > >> >> > OCRing for, we have the *Config objects. You create one of these,
> > >>use
> > >> >> > the setters to configure it for your document, pop it onto the
> > >>Parse
> > >> >> > context and it's used when processing your document
> > >> >> >
> > >> >> > To configure parsers and translators on a per-JVM basis, to apply
> > >>to
> > >> >>all
> > >> >> > documents processed, it's a bit less consistent. At least some
> look
> > >> >>for
> > >> >> > a properties file with a specific name, usually in the tika
> > >>namespace,
> > >> >> > and grab their settings / keys / etc out of that. At least some
> > >>expect
> > >> >> > to find a *Config with their program path on it, even though that
> > >> >> > remains constant between documents. None of them support getting
> > >>their
> > >> >> > settings from the Tika Config
> > >> >> >
> > >> >> >
> > >> >> > As part of our evolution of parser preferences, we're moving
> > >>towards
> > >> >> > people either being able to set their preferences in code, or
> being
> > >> >>able
> > >> >> > to supply a Tika Config xml which sets their parser preferences
> or
> > >> >> > overrides certain bits of the default. The code option works for
> > >> >>people
> > >> >> > who want to declare certain specific things, the Tika Config one
> > >>gives
> > >> >> > the same functionality but allows a consistent and clean way to
> > >>set it
> > >> >> > between Tika App, Tika Server and java code.
> > >> >> >
> > >> >> > Another related example is the External Parser support. Because
> you
> > >> >>can
> > >> >> > have multiple External Parser instances in your setup, one per
> > >>format
> > >> >>/
> > >> >> > program, we look for all the
> > >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files
> on
> > >>the
> > >> >> > classpath, and create parser instances based on definitions in
> > >>there
> > >> >> >
> > >> >> >
> > >> >> > What do we think about setting executable paths and keys/logins
> for
> > >> >> > parsers like OCR, Strings, Translators etc? Always on
> ParseContext?
> > >> >> > Properties? Custom xml config? Tika config xml? Other?
> Combination?
> > >> >> >
> > >> >> > Nick
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Nick Burch-2
In reply to this post by Mattmann, Chris A (3010)
On Mon, 15 Jun 2015, Mattmann, Chris A (3980) wrote:
> We also need to be mindful of Tika app and sever where there is no
> current way to see config other than Tika config file and multiple
> conflicting ways to set it...

Hmm? Both app and server optionally take a config file, don't they? And
both offer an option/flag/endpoint to tell you the parsers they found, the
detectors they found, parser decorations etc.

I'd say that the app and the server are the easiest ways to know what's
going on with your Tika install, it's the pure-Java case where it's harder
to know what you do or don't have!

Or have I mis-understood the use case?

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Mattmann, Chris A (3010)
Hey nick I guess my point is that parser context aka config properties for parsers and custom config files e.g., x.properties loaded from the classpath aren't configured from Tika app or server

Sent from my iPhone

> On Jun 15, 2015, at 8:34 AM, Nick Burch <[hidden email]> wrote:
>
>> On Mon, 15 Jun 2015, Mattmann, Chris A (3980) wrote:
>> We also need to be mindful of Tika app and sever where there is no current way to see config other than Tika config file and multiple conflicting ways to set it...
>
> Hmm? Both app and server optionally take a config file, don't they? And both offer an option/flag/endpoint to tell you the parsers they found, the detectors they found, parser decorations etc.
>
> I'd say that the app and the server are the easiest ways to know what's going on with your Tika install, it's the pure-Java case where it's harder to know what you do or don't have!
>
> Or have I mis-understood the use case?
>
> Nick
Reply | Threaded
Open this post in threaded view
|

Re: Configuring parsers and translators

Nick Burch-2
On Mon, 15 Jun 2015, Mattmann, Chris A (3980) wrote:
> Hey nick I guess my point is that parser context aka config properties
> for parsers and custom config files e.g., x.properties loaded from the
> classpath aren't configured from Tika app or server

Ah, good point. In my ideal world, you'd set the "all documents of this
kind" settings (eg paths) in the config, then set this "this document
only" settings (eg pdf column count, pdf inline image settings) via a
command line option to the app / request header to the server, converted
into ParseContext options[1]. That would then be largely the same as for
the pure-Java users.

Hopefully there aren't too many settings which are debatable as to what
they are!

Not sure how huge a tika config file this would all lead to...

I could see some value in properties files, for things that don't change
between machines but do need configuration, eg the mappings for external
parsers. Since it isn't obvious if you've missed one, I'm not sure we want
to use them heavily for customisations for paths etc


Also, since you mention having been caught out by missing jars or missing
service files, maybe we need to put something on the wiki about how to
check if you have what you expected? (IIRC we log if a parser can't be
found or can't be loaded, so mostly it's about how to enable that)

Nick

[1] Do we have tickets for adding these in yet?