Customzing TikaConfig or rather getParser

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Customzing TikaConfig or rather getParser

Michael Wechner
Hi

We are currently using Tika and it works great so far, but now we would
like to have a getParser() method which doesn't depend on the mime-type
but rather on a node path or whatever. For example we have different XML
within the filesystem, but they all have the same mime-type
application/xml, so the only way to differentiate is the path. Also it
doesn't seem like one should overwrite TikaConfig

http://incubator.apache.org/tika/apidocs/org/apache/tika/config/TikaConfig.html

How do other people handle such situations?

Thanks

Michael
Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

thorsten
On Tue, 2008-08-19 at 16:54 +0200, Michael Wechner wrote:
> Hi
>

Hi Michi,

> We are currently using Tika and it works great so far, but now we would
> like to have a getParser() method which doesn't depend on the mime-type
> but rather on a node path or whatever.

I suppose the doc-type would be a good determination?

> For example we have different XML
> within the filesystem, but they all have the same mime-type
> application/xml, so the only way to differentiate is the path.

Or their doc-type?

> Also it
> doesn't seem like one should overwrite TikaConfig
>
> http://incubator.apache.org/tika/apidocs/org/apache/tika/config/TikaConfig.html
>
> How do other people handle such situations?

I would reuse the config and create a config file
("/PathTo/myConfig.xml") like follow. I asked about the if doc-type is a
possibility since it would make configuration much easier.

Instead to use the plain mime type I would use the doc type:

<parser name="parse-myDocType"
class="org.apache.tika.parser.docType.MyDocTypeParser">
  <mime>myDoctype</mime>
</parser>

and then from your code call
TikaConfig config = new TikaConfig("/PathTo/myConfig.xml");
Parser parser = config.getParser("myDoctype");
...

However this is to reuse the current code more then find a definitive
solution, but maybe somebody else has another idea.

HTH

salu2

> Thanks
>
> Michael
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Michael Wechner
Thorsten Scherler schrieb:
> On Tue, 2008-08-19 at 16:54 +0200, Michael Wechner wrote:
>  
>> Hi
>>
>>    
>
> Hi Michi,
>  

Hello Thorsten :-)
>  
>
> I would reuse the config and create a config file
> ("/PathTo/myConfig.xml") like follow. I asked about the if doc-type is a
> possibility since it would make configuration much easier.
>
> Instead to use the plain mime type I would use the doc type:
>  

what exactly do mean with doc type?
> <parser name="parse-myDocType"
> class="org.apache.tika.parser.docType.MyDocTypeParser">
>   <mime>myDoctype</mime>
> </parser>
>
> and then from your code call
> TikaConfig config = new TikaConfig("/PathTo/myConfig.xml");
> Parser parser = config.getParser("myDoctype");
>  

I think this is where the problem is, I mean the getParser(String) method.

I would like to overwrite this method by implementing my own chain of
responsibility.

Hence I think it would be nice to enhance this by introducing a new method

TikaConfig.getParser(ParserSelector)

(similar to
http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#listFiles(java.io.FileFilter))

and ParserSelector would be an interface

(similar to http://java.sun.com/j2se/1.4.2/docs/api/java/io/FileFilter.html)

WDYT?

Thanks

Michael

> ...
>
> However this is to reuse the current code more then find a definitive
> solution, but maybe somebody else has another idea.
>
> HTH
>
> salu2
>
>  
>> Thanks
>>
>> Michael
>>    

Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Jukka Zitting
Hi,

On Mon, Aug 25, 2008 at 9:06 AM, Michael Wechner
<[hidden email]> wrote:
> I think this is where the problem is, I mean the getParser(String) method.
>
> I would like to overwrite this method by implementing my own chain of
> responsibility.

How about the following:

    public class MyCustomParser extends CompositeParser {

        public MyCustomParser throws TikaException {
            setConfig(TikaConfig.getDefaultConfig());
            // or whatever config you want
        }

        protected Parser getParser(Metadata metadata) {
            // Custom code to select an appropriate parser
            // based on the input metadata (mime type,
            // document path, whatever) passed by the client.
            // Or fallback to:
            return super.getParser(metadata);
        }

    }

Your client code would then look like:

    private Parser parser = new MyCustomParser();

    Metadata metadata = new Metadata();
    metadata.set(Metadata.CONTENT_TYPE);
    // plus whatever other metadata you need in MyCustomParser

    parser.parse(stream, handler, metadata);

One of my design goals for the current Parser interface was was that
you can encapsulate this sort of functionality inside it.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Michael Wechner
Jukka Zitting schrieb:

> Hi,
>
> On Mon, Aug 25, 2008 at 9:06 AM, Michael Wechner
> <[hidden email]> wrote:
>  
>> I think this is where the problem is, I mean the getParser(String) method.
>>
>> I would like to overwrite this method by implementing my own chain of
>> responsibility.
>>    
>
> How about the following:
>
>     public class MyCustomParser extends CompositeParser {
>
>         public MyCustomParser throws TikaException {
>             setConfig(TikaConfig.getDefaultConfig());
>             // or whatever config you want
>         }
>
>         protected Parser getParser(Metadata metadata) {
>             // Custom code to select an appropriate parser
>             // based on the input metadata (mime type,
>             // document path, whatever) passed by the client.
>             // Or fallback to:
>             return super.getParser(metadata);
>         }
>
>     }
>
> Your client code would then look like:
>
>     private Parser parser = new MyCustomParser();
>
>     Metadata metadata = new Metadata();
>     metadata.set(Metadata.CONTENT_TYPE);
>     // plus whatever other metadata you need in MyCustomParser
>
>     parser.parse(stream, handler, metadata);
>
> One of my design goals for the current Parser interface was was that
> you can encapsulate this sort of functionality inside it.
>  

thanks for the suggestions. Will give it a try and keep you posted on my
findings.

Thanks

Michael
> BR,
>
> Jukka Zitting
>  

Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Michael Wechner
In reply to this post by Jukka Zitting
Jukka Zitting schrieb:

> Hi,
>
> On Mon, Aug 25, 2008 at 9:06 AM, Michael Wechner
> <[hidden email]> wrote:
>  
>> I think this is where the problem is, I mean the getParser(String) method.
>>
>> I would like to overwrite this method by implementing my own chain of
>> responsibility.
>>    
>
> How about the following:
>
>     public class MyCustomParser extends CompositeParser {
>
>         public MyCustomParser throws TikaException {
>             setConfig(TikaConfig.getDefaultConfig());
>             // or whatever config you want
>         }
>
>         protected Parser getParser(Metadata metadata) {
>             // Custom code to select an appropriate parser
>             // based on the input metadata (mime type,
>             // document path, whatever) passed by the client.
>             // Or fallback to:
>             return super.getParser(metadata);
>         }
>
>     }
>
> Your client code would then look like:
>
>     private Parser parser = new MyCustomParser();
>
>     Metadata metadata = new Metadata();
>     metadata.set(Metadata.CONTENT_TYPE);
>     // plus whatever other metadata you need in MyCustomParser
>
>     parser.parse(stream, handler, metadata);
>
> One of my design goals for the current Parser interface was was that
> you can encapsulate this sort of functionality inside it.
>  

this seems to work for our usecase, but it seems to me that the actual
problem is just transfered one step further down.

I think it would be better to separate the parser actual selection (via
chain of responsibility) from passing in metadata.

Cheers

Michael
> BR,
>
> Jukka Zitting
>  

Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Jukka Zitting
Hi,

On Thu, Sep 4, 2008 at 11:31 AM, Michael Wechner
<[hidden email]> wrote:
> this seems to work for our usecase, but it seems to me that the actual
> problem is just transfered one step further down.

"There are few problems in computer science that can not be solved by
adding another level of indirection." -Tom Christansen

> I think it would be better to separate the parser actual selection (via
> chain of responsibility) from passing in metadata.

The way I see it, an application should ideally only deal with a
single Parser instance, that would be smart enough to select the
appropriate parsing mechanism for each incoming document based on the
associated metadata.

The reason for making the Metadata object a modifiable input/output
parameter (instead of just a return value) of the parse() method was
that a client application could feed extra metadata to the parsing
process. In your use case that extra metadata would be the path of the
document.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Michael Wechner
Jukka Zitting schrieb:

> Hi,
>
> On Thu, Sep 4, 2008 at 11:31 AM, Michael Wechner
> <[hidden email]> wrote:
>  
>> this seems to work for our usecase, but it seems to me that the actual
>> problem is just transfered one step further down.
>>    
>
> "There are few problems in computer science that can not be solved by
> adding another level of indirection." -Tom Christansen
>
>  
>> I think it would be better to separate the parser actual selection (via
>> chain of responsibility) from passing in metadata.
>>    
>
> The way I see it, an application should ideally only deal with a
> single Parser instance, that would be smart enough to select the
> appropriate parsing mechanism for each incoming document based on the
> associated metadata.
>  

I am afraid that this makes the parsers less usable, but of course we
could introduce a meta-parser and then re-use the actual data parsers.
But then again one might have to ask why handle mime-type exceptionally ;-)
> The reason for making the Metadata object a modifiable input/output
> parameter (instead of just a return value) of the parse() method was
> that a client application could feed extra metadata to the parsing
> process. In your use case that extra metadata would be the path of the
> document.
>  

this is how we are now using it.

Thanks

Michael
> BR,
>
> Jukka Zitting
>  

Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Jukka Zitting
Hi,

On Thu, Sep 4, 2008 at 1:50 PM, Michael Wechner
<[hidden email]> wrote:
> Jukka Zitting schrieb:
>> The way I see it, an application should ideally only deal with a
>> single Parser instance, that would be smart enough to select the
>> appropriate parsing mechanism for each incoming document based on the
>> associated metadata.
>
> I am afraid that this makes the parsers less usable, but of course we could
> introduce a meta-parser and then re-use the actual data parsers.

That's pretty much what the AutoDetectParser and CompositeParser
classes are designed to do.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Customzing TikaConfig or rather getParser

Michael Wechner
Jukka Zitting schrieb:

> Hi,
>
> On Thu, Sep 4, 2008 at 1:50 PM, Michael Wechner
> <[hidden email]> wrote:
>  
>> Jukka Zitting schrieb:
>>    
>>> The way I see it, an application should ideally only deal with a
>>> single Parser instance, that would be smart enough to select the
>>> appropriate parsing mechanism for each incoming document based on the
>>> associated metadata.
>>>      
>> I am afraid that this makes the parsers less usable, but of course we could
>> introduce a meta-parser and then re-use the actual data parsers.
>>    
>
> That's pretty much what the AutoDetectParser and CompositeParser
> classes are designed to do.
>  

ok, thanks for this info

Michael
> BR,
>
> Jukka Zitting
>