Re: [EXTERNAL] Tika Python questions

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Tika Python questions

Chris Mattmann
Hi,

 

Thanks for your question. Yes, the same way you set the byte size property in Tika-App (I think through
parser configuration) is how you would do it for Tika-Server. You would just start the Tika Server yourself
with a custom config file that set this property and then start it on the default port (making sure any other
ones were killed first). Then Tika-Python will use your own Tika Server with custom config.

 

As for catching errors, it will try its best to do that, but it does not catch all of them and if you find
something it doesn’t catch let us know and we will work to fix it.

 

Thanks,

Chris

 

 

 

 

From: "[hidden email]" <[hidden email]>
Organization: Avident-IT
Date: Tuesday, October 8, 2019 at 6:06 AM
To: "Mattmann, Chris A (US 1761)" <[hidden email]>
Subject: [EXTERNAL] Tika Python questions

 

Hi

I have had the pleasure of testing the Tika-python library. I am testing it out in a new application that are developed for customers.

It has very good performance, especially for parsing XLSX and XLS files.

 

However, I have two questions:
The Tika-Server handles only files with a maximum byte size. I get this error:
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1186956, but 1000000 is the maximum for this record type.

increasing the maximum allowable size for this record type.

As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

I have tried the Tika-App python (jar file) and it does handle the file size where files are larger than 1000000.

In the Tika documentation it says to set MaxBytes to -1 to override and handle larger files.

Is there any way to handle this via Tika-Python? To set max files size to unlimited as the “Tika-App” handles it?

 
How is it possible to catch errors via the Tika-python library, like if files are encrypted, corrupt etc.?
 

 

Kind regards

 

HANS MEIJER

 

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Tika Python questions

Luís Filipe Nassif
I think it is not related to file size, but maximum record size handled by
POI. It is a protection against OutOfMemoryErrors. I increased this limit
to 10M because was seeing many of them. I do not know if it is configurable
in tika server.

Regards,
Luis

Em ter, 8 de out de 2019 17:46, Chris Mattmann <[hidden email]>
escreveu:

> Hi,
>
>
>
> Thanks for your question. Yes, the same way you set the byte size property
> in Tika-App (I think through
> parser configuration) is how you would do it for Tika-Server. You would
> just start the Tika Server yourself
> with a custom config file that set this property and then start it on the
> default port (making sure any other
> ones were killed first). Then Tika-Python will use your own Tika Server
> with custom config.
>
>
>
> As for catching errors, it will try its best to do that, but it does not
> catch all of them and if you find
> something it doesn’t catch let us know and we will work to fix it.
>
>
>
> Thanks,
>
> Chris
>
>
>
>
>
>
>
>
>
> From: "[hidden email]" <[hidden email]>
> Organization: Avident-IT
> Date: Tuesday, October 8, 2019 at 6:06 AM
> To: "Mattmann, Chris A (US 1761)" <[hidden email]>
> Subject: [EXTERNAL] Tika Python questions
>
>
>
> Hi
>
> I have had the pleasure of testing the Tika-python library. I am testing
> it out in a new application that are developed for customers.
>
> It has very good performance, especially for parsing XLSX and XLS files.
>
>
>
> However, I have two questions:
> The Tika-Server handles only files with a maximum byte size. I get this
> error:
> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> length 1186956, but 1000000 is the maximum for this record type.
>
> increasing the maximum allowable size for this record type.
>
> As a temporary workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>
> I have tried the Tika-App python (jar file) and it does handle the file
> size where files are larger than 1000000.
>
> In the Tika documentation it says to set MaxBytes to -1 to override and
> handle larger files.
>
> Is there any way to handle this via Tika-Python? To set max files size to
> unlimited as the “Tika-App” handles it?
>
>
> How is it possible to catch errors via the Tika-python library, like if
> files are encrypted, corrupt etc.?
>
>
>
>
> Kind regards
>
>
>
> HANS MEIJER
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Tika Python questions

Tim Allison
Yep, that's why we added those limits.

Hans, if you can send the full stacktrace that will allow me to see
what record type you're running into this with, we may be able to
increase it in POI before the next release.

On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[hidden email]> wrote:

>
> I think it is not related to file size, but maximum record size handled by
> POI. It is a protection against OutOfMemoryErrors. I increased this limit
> to 10M because was seeing many of them. I do not know if it is configurable
> in tika server.
>
> Regards,
> Luis
>
> Em ter, 8 de out de 2019 17:46, Chris Mattmann <[hidden email]>
> escreveu:
>
> > Hi,
> >
> >
> >
> > Thanks for your question. Yes, the same way you set the byte size property
> > in Tika-App (I think through
> > parser configuration) is how you would do it for Tika-Server. You would
> > just start the Tika Server yourself
> > with a custom config file that set this property and then start it on the
> > default port (making sure any other
> > ones were killed first). Then Tika-Python will use your own Tika Server
> > with custom config.
> >
> >
> >
> > As for catching errors, it will try its best to do that, but it does not
> > catch all of them and if you find
> > something it doesn’t catch let us know and we will work to fix it.
> >
> >
> >
> > Thanks,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: "[hidden email]" <[hidden email]>
> > Organization: Avident-IT
> > Date: Tuesday, October 8, 2019 at 6:06 AM
> > To: "Mattmann, Chris A (US 1761)" <[hidden email]>
> > Subject: [EXTERNAL] Tika Python questions
> >
> >
> >
> > Hi
> >
> > I have had the pleasure of testing the Tika-python library. I am testing
> > it out in a new application that are developed for customers.
> >
> > It has very good performance, especially for parsing XLSX and XLS files.
> >
> >
> >
> > However, I have two questions:
> > The Tika-Server handles only files with a maximum byte size. I get this
> > error:
> > org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> > length 1186956, but 1000000 is the maximum for this record type.
> >
> > increasing the maximum allowable size for this record type.
> >
> > As a temporary workaround, consider setting a higher override value with
> > IOUtils.setByteArrayMaxOverride()
> >
> > I have tried the Tika-App python (jar file) and it does handle the file
> > size where files are larger than 1000000.
> >
> > In the Tika documentation it says to set MaxBytes to -1 to override and
> > handle larger files.
> >
> > Is there any way to handle this via Tika-Python? To set max files size to
> > unlimited as the “Tika-App” handles it?
> >
> >
> > How is it possible to catch errors via the Tika-python library, like if
> > files are encrypted, corrupt etc.?
> >
> >
> >
> >
> > Kind regards
> >
> >
> >
> > HANS MEIJER
> >
> >
> >
> >
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Tika Python questions

Tim Allison
Thank you for this report!  I just bumped the max record length for a blob
by 10x in POI, which should be released fairly soon.

r1868211

On Wed, Oct 9, 2019 at 10:20 AM <[hidden email]> wrote:

> Hi,
> This is an "old" excel spreadsheet, .xls, that is causing it. If you would
> like to I can send that as well.
>
> I hope this gives you what you need from the tika-server stacktrace:
> INFO  rmeta/text (autodetecting type)
> WARN  Ignoring unexpected exception while parsing summary entry
> DocumentSummaryInformation
> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> length 1186956, but 1000000 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>         at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
>         at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
>         at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
>         at org.apache.poi.hpsf.Blob.read(Blob.java:33)
>         at
> org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166)
>         at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176)
>         at org.apache.poi.hpsf.Property.<init>(Property.java:179)
>         at org.apache.poi.hpsf.Section.<init>(Section.java:241)
>         at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
>         at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>         at
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232)
>         at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
>         at
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)
>         at
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>         at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)
>         at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>         at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>         at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>         at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>         at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>         at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>         at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>         at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>         at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>         at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>         at org.eclipse.jetty.server.Server.handle(Server.java:505)
>         at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>         at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>         at org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>         at org.eclipse.jetty.io
> .FillInterest.fillable(FillInterest.java:103)
>         at org.eclipse.jetty.io
> .ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>         at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
>         at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
>         at java.lang.Thread.run(Thread.java:748)
> xterm
>
> /Kind regards
> Hans
>
> -----Ursprungligt meddelande-----
> Från: Tim Allison <[hidden email]>
> Skickat: den 9 oktober 2019 14:04
> Till: Luís Filipe Nassif <[hidden email]>
> Kopia: <[hidden email]> <[hidden email]>;
> [hidden email]
> Ämne: Re: [EXTERNAL] Tika Python questions
>
> Yep, that's why we added those limits.
>
> Hans, if you can send the full stacktrace that will allow me to see what
> record type you're running into this with, we may be able to increase it in
> POI before the next release.
>
> On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[hidden email]>
> wrote:
> >
> > I think it is not related to file size, but maximum record size
> > handled by POI. It is a protection against OutOfMemoryErrors. I
> > increased this limit to 10M because was seeing many of them. I do not
> > know if it is configurable in tika server.
> >
> > Regards,
> > Luis
> >
> > Em ter, 8 de out de 2019 17:46, Chris Mattmann <[hidden email]>
> > escreveu:
> >
> > > Hi,
> > >
> > >
> > >
> > > Thanks for your question. Yes, the same way you set the byte size
> > > property in Tika-App (I think through parser configuration) is how
> > > you would do it for Tika-Server. You would just start the Tika
> > > Server yourself with a custom config file that set this property and
> > > then start it on the default port (making sure any other ones were
> > > killed first). Then Tika-Python will use your own Tika Server with
> > > custom config.
> > >
> > >
> > >
> > > As for catching errors, it will try its best to do that, but it does
> > > not catch all of them and if you find something it doesn’t catch let
> > > us know and we will work to fix it.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Chris
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: "[hidden email]" <[hidden email]>
> > > Organization: Avident-IT
> > > Date: Tuesday, October 8, 2019 at 6:06 AM
> > > To: "Mattmann, Chris A (US 1761)" <[hidden email]>
> > > Subject: [EXTERNAL] Tika Python questions
> > >
> > >
> > >
> > > Hi
> > >
> > > I have had the pleasure of testing the Tika-python library. I am
> > > testing it out in a new application that are developed for customers.
> > >
> > > It has very good performance, especially for parsing XLSX and XLS
> files.
> > >
> > >
> > >
> > > However, I have two questions:
> > > The Tika-Server handles only files with a maximum byte size. I get
> > > this
> > > error:
> > > org.apache.poi.util.RecordFormatException: Tried to allocate an
> > > array of length 1186956, but 1000000 is the maximum for this record
> type.
> > >
> > > increasing the maximum allowable size for this record type.
> > >
> > > As a temporary workaround, consider setting a higher override value
> > > with
> > > IOUtils.setByteArrayMaxOverride()
> > >
> > > I have tried the Tika-App python (jar file) and it does handle the
> > > file size where files are larger than 1000000.
> > >
> > > In the Tika documentation it says to set MaxBytes to -1 to override
> > > and handle larger files.
> > >
> > > Is there any way to handle this via Tika-Python? To set max files
> > > size to unlimited as the “Tika-App” handles it?
> > >
> > >
> > > How is it possible to catch errors via the Tika-python library, like
> > > if files are encrypted, corrupt etc.?
> > >
> > >
> > >
> > >
> > > Kind regards
> > >
> > >
> > >
> > > HANS MEIJER
> > >
> > >
> > >
> > >
>
>
Reply | Threaded
Open this post in threaded view
|

Sv: [EXTERNAL] Tika Python questions

hans.meijer
Hi,

Sorry for disturbing, I do see the commit but any hints on when it can be released?

I assume it will be a new version of Apache Tika, current version seems to be 1.22, so this would be in 1.23?

 

Kind regards

Hans

 

Från: Tim Allison <[hidden email]>
Skickat: den 10 oktober 2019 05:05
Till: [hidden email]
Kopia: <[hidden email]> <[hidden email]>
Ämne: Re: [EXTERNAL] Tika Python questions

 

Thank you for this report!  I just bumped the max record length for a blob by 10x in POI, which should be released fairly soon.

 

r1868211

 

On Wed, Oct 9, 2019 at 10:20 AM <[hidden email] <mailto:[hidden email]> > wrote:

Hi,
This is an "old" excel spreadsheet, .xls, that is causing it. If you would like to I can send that as well.

I hope this gives you what you need from the tika-server stacktrace:
INFO  rmeta/text (autodetecting type)
WARN  Ignoring unexpected exception while parsing summary entry DocumentSummaryInformation
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1186956, but 1000000 is the maximum for this record type.
If the file is not corrupt, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
        at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
        at org.apache.poi.hpsf.Blob.read(Blob.java:33)
        at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166)
        at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176)
        at org.apache.poi.hpsf.Property.<init>(Property.java:179)
        at org.apache.poi.hpsf.Section.<init>(Section.java:241)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232)
        at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
        at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)
        at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
        at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
        at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
        at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.Server.handle(Server.java:505)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .FillInterest.fillable(FillInterest.java:103)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .ChannelEndPoint$2.run(ChannelEndPoint.java:117)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
        at java.lang.Thread.run(Thread.java:748)
xterm

/Kind regards
Hans

-----Ursprungligt meddelande-----
Från: Tim Allison <[hidden email] <mailto:[hidden email]> >
Skickat: den 9 oktober 2019 14:04
Till: Luís Filipe Nassif <[hidden email] <mailto:[hidden email]> >
Kopia: <[hidden email] <mailto:[hidden email]> > <[hidden email] <mailto:[hidden email]> >; [hidden email] <mailto:[hidden email]>
Ämne: Re: [EXTERNAL] Tika Python questions

Yep, that's why we added those limits.

Hans, if you can send the full stacktrace that will allow me to see what record type you're running into this with, we may be able to increase it in POI before the next release.

On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[hidden email] <mailto:[hidden email]> > wrote:

>
> I think it is not related to file size, but maximum record size
> handled by POI. It is a protection against OutOfMemoryErrors. I
> increased this limit to 10M because was seeing many of them. I do not
> know if it is configurable in tika server.
>
> Regards,
> Luis
>
> Em ter, 8 de out de 2019 17:46, Chris Mattmann <[hidden email] <mailto:[hidden email]> >
> escreveu:
>
> > Hi,
> >
> >
> >
> > Thanks for your question. Yes, the same way you set the byte size
> > property in Tika-App (I think through parser configuration) is how
> > you would do it for Tika-Server. You would just start the Tika
> > Server yourself with a custom config file that set this property and
> > then start it on the default port (making sure any other ones were
> > killed first). Then Tika-Python will use your own Tika Server with
> > custom config.
> >
> >
> >
> > As for catching errors, it will try its best to do that, but it does
> > not catch all of them and if you find something it doesn’t catch let
> > us know and we will work to fix it.
> >
> >
> >
> > Thanks,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: "[hidden email] <mailto:[hidden email]> " <[hidden email] <mailto:[hidden email]> >
> > Organization: Avident-IT
> > Date: Tuesday, October 8, 2019 at 6:06 AM
> > To: "Mattmann, Chris A (US 1761)" <[hidden email] <mailto:[hidden email]> >
> > Subject: [EXTERNAL] Tika Python questions
> >
> >
> >
> > Hi
> >
> > I have had the pleasure of testing the Tika-python library. I am
> > testing it out in a new application that are developed for customers.
> >
> > It has very good performance, especially for parsing XLSX and XLS files.
> >
> >
> >
> > However, I have two questions:
> > The Tika-Server handles only files with a maximum byte size. I get
> > this
> > error:
> > org.apache.poi.util.RecordFormatException: Tried to allocate an
> > array of length 1186956, but 1000000 is the maximum for this record type.
> >
> > increasing the maximum allowable size for this record type.
> >
> > As a temporary workaround, consider setting a higher override value
> > with
> > IOUtils.setByteArrayMaxOverride()
> >
> > I have tried the Tika-App python (jar file) and it does handle the
> > file size where files are larger than 1000000.
> >
> > In the Tika documentation it says to set MaxBytes to -1 to override
> > and handle larger files.
> >
> > Is there any way to handle this via Tika-Python? To set max files
> > size to unlimited as the “Tika-App” handles it?
> >
> >
> > How is it possible to catch errors via the Tika-python library, like
> > if files are encrypted, corrupt etc.?
> >
> >
> >
> >
> > Kind regards
> >
> >
> >
> > HANS MEIJER
> >
> >
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Tika Python questions

Tim Allison
Sorry for the late reply. Once POI is released, we’ll probably roll out
1.23...probably 3-4 weeks?

Fellow devs, WDYT?

On Mon, Oct 14, 2019 at 6:55 AM <[hidden email]> wrote:

> Hi,
>
> Sorry for disturbing, I do see the commit but any hints on when it can be
> released?
>
> I assume it will be a new version of Apache Tika, current version seems to
> be 1.22, so this would be in 1.23?
>
>
>
> Kind regards
>
> Hans
>
>
>
> *Från:* Tim Allison <[hidden email]>
> *Skickat:* den 10 oktober 2019 05:05
> *Till:* [hidden email]
> *Kopia:* <[hidden email]> <[hidden email]>
> *Ämne:* Re: [EXTERNAL] Tika Python questions
>
>
>
> Thank you for this report!  I just bumped the max record length for a blob
> by 10x in POI, which should be released fairly soon.
>
>
>
> r1868211
>
>
>
> On Wed, Oct 9, 2019 at 10:20 AM <[hidden email]> wrote:
>
> Hi,
> This is an "old" excel spreadsheet, .xls, that is causing it. If you would
> like to I can send that as well.
>
> I hope this gives you what you need from the tika-server stacktrace:
> INFO  rmeta/text (autodetecting type)
> WARN  Ignoring unexpected exception while parsing summary entry
> DocumentSummaryInformation
> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> length 1186956, but 1000000 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>         at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
>         at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
>         at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
>         at org.apache.poi.hpsf.Blob.read(Blob.java:33)
>         at
> org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166)
>         at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176)
>         at org.apache.poi.hpsf.Property.<init>(Property.java:179)
>         at org.apache.poi.hpsf.Section.<init>(Section.java:241)
>         at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
>         at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>         at
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232)
>         at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
>         at
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)
>         at
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>         at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)
>         at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>         at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>         at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>         at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>         at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>         at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>         at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>         at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>         at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>         at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>         at org.eclipse.jetty.server.Server.handle(Server.java:505)
>         at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>         at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>         at org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>         at org.eclipse.jetty.io
> .FillInterest.fillable(FillInterest.java:103)
>         at org.eclipse.jetty.io
> .ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>         at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
>         at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
>         at java.lang.Thread.run(Thread.java:748)
> xterm
>
> /Kind regards
> Hans
>
> -----Ursprungligt meddelande-----
> Från: Tim Allison <[hidden email]>
> Skickat: den 9 oktober 2019 14:04
> Till: Luís Filipe Nassif <[hidden email]>
> Kopia: <[hidden email]> <[hidden email]>;
> [hidden email]
> Ämne: Re: [EXTERNAL] Tika Python questions
>
> Yep, that's why we added those limits.
>
> Hans, if you can send the full stacktrace that will allow me to see what
> record type you're running into this with, we may be able to increase it in
> POI before the next release.
>
> On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[hidden email]>
> wrote:
> >
> > I think it is not related to file size, but maximum record size
> > handled by POI. It is a protection against OutOfMemoryErrors. I
> > increased this limit to 10M because was seeing many of them. I do not
> > know if it is configurable in tika server.
> >
> > Regards,
> > Luis
> >
> > Em ter, 8 de out de 2019 17:46, Chris Mattmann <[hidden email]>
> > escreveu:
> >
> > > Hi,
> > >
> > >
> > >
> > > Thanks for your question. Yes, the same way you set the byte size
> > > property in Tika-App (I think through parser configuration) is how
> > > you would do it for Tika-Server. You would just start the Tika
> > > Server yourself with a custom config file that set this property and
> > > then start it on the default port (making sure any other ones were
> > > killed first). Then Tika-Python will use your own Tika Server with
> > > custom config.
> > >
> > >
> > >
> > > As for catching errors, it will try its best to do that, but it does
> > > not catch all of them and if you find something it doesn’t catch let
> > > us know and we will work to fix it.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Chris
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: "[hidden email]" <[hidden email]>
> > > Organization: Avident-IT
> > > Date: Tuesday, October 8, 2019 at 6:06 AM
> > > To: "Mattmann, Chris A (US 1761)" <[hidden email]>
> > > Subject: [EXTERNAL] Tika Python questions
> > >
> > >
> > >
> > > Hi
> > >
> > > I have had the pleasure of testing the Tika-python library. I am
> > > testing it out in a new application that are developed for customers.
> > >
> > > It has very good performance, especially for parsing XLSX and XLS
> files.
> > >
> > >
> > >
> > > However, I have two questions:
> > > The Tika-Server handles only files with a maximum byte size. I get
> > > this
> > > error:
> > > org.apache.poi.util.RecordFormatException: Tried to allocate an
> > > array of length 1186956, but 1000000 is the maximum for this record
> type.
> > >
> > > increasing the maximum allowable size for this record type.
> > >
> > > As a temporary workaround, consider setting a higher override value
> > > with
> > > IOUtils.setByteArrayMaxOverride()
> > >
> > > I have tried the Tika-App python (jar file) and it does handle the
> > > file size where files are larger than 1000000.
> > >
> > > In the Tika documentation it says to set MaxBytes to -1 to override
> > > and handle larger files.
> > >
> > > Is there any way to handle this via Tika-Python? To set max files
> > > size to unlimited as the “Tika-App” handles it?
> > >
> > >
> > > How is it possible to catch errors via the Tika-python library, like
> > > if files are encrypted, corrupt etc.?
> > >
> > >
> > >
> > >
> > > Kind regards
> > >
> > >
> > >
> > > HANS MEIJER
> > >
> > >
> > >
> > >
>
>
Reply | Threaded
Open this post in threaded view
|

Sv: [EXTERNAL] Tika Python questions

hans.meijer
Thanks!

I appreciate the answer.

Reason I ask I because I am stopped with my development a bit due to the issue, so I am really interested in the release.

 

 

Kind regards

Hans

 

Från: Tim Allison <[hidden email]>
Skickat: den 14 oktober 2019 13:53
Till: [hidden email]
Kopia: [hidden email]
Ämne: Re: [EXTERNAL] Tika Python questions

 

Sorry for the late reply. Once POI is released, we’ll probably roll out 1.23...probably 3-4 weeks?

 

Fellow devs, WDYT?

 

On Mon, Oct 14, 2019 at 6:55 AM <[hidden email] <mailto:[hidden email]> > wrote:

Hi,

Sorry for disturbing, I do see the commit but any hints on when it can be released?

I assume it will be a new version of Apache Tika, current version seems to be 1.22, so this would be in 1.23?

 

Kind regards

Hans

 

Från: Tim Allison <[hidden email] <mailto:[hidden email]> >
Skickat: den 10 oktober 2019 05:05
Till: [hidden email] <mailto:[hidden email]>
Kopia: <[hidden email] <mailto:[hidden email]> > <[hidden email] <mailto:[hidden email]> >
Ämne: Re: [EXTERNAL] Tika Python questions

 

Thank you for this report!  I just bumped the max record length for a blob by 10x in POI, which should be released fairly soon.

 

r1868211

 

On Wed, Oct 9, 2019 at 10:20 AM <[hidden email] <mailto:[hidden email]> > wrote:

Hi,
This is an "old" excel spreadsheet, .xls, that is causing it. If you would like to I can send that as well.

I hope this gives you what you need from the tika-server stacktrace:
INFO  rmeta/text (autodetecting type)
WARN  Ignoring unexpected exception while parsing summary entry DocumentSummaryInformation
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1186956, but 1000000 is the maximum for this record type.
If the file is not corrupt, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
        at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
        at org.apache.poi.hpsf.Blob.read(Blob.java:33)
        at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166)
        at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176)
        at org.apache.poi.hpsf.Property.<init>(Property.java:179)
        at org.apache.poi.hpsf.Section.<init>(Section.java:241)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232)
        at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
        at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)
        at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
        at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
        at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
        at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.Server.handle(Server.java:505)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .FillInterest.fillable(FillInterest.java:103)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .ChannelEndPoint$2.run(ChannelEndPoint.java:117)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
        at java.lang.Thread.run(Thread.java:748)
xterm

/Kind regards
Hans

-----Ursprungligt meddelande-----
Från: Tim Allison <[hidden email] <mailto:[hidden email]> >
Skickat: den 9 oktober 2019 14:04
Till: Luís Filipe Nassif <[hidden email] <mailto:[hidden email]> >
Kopia: <[hidden email] <mailto:[hidden email]> > <[hidden email] <mailto:[hidden email]> >; [hidden email] <mailto:[hidden email]>
Ämne: Re: [EXTERNAL] Tika Python questions

Yep, that's why we added those limits.

Hans, if you can send the full stacktrace that will allow me to see what record type you're running into this with, we may be able to increase it in POI before the next release.

On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[hidden email] <mailto:[hidden email]> > wrote:

>
> I think it is not related to file size, but maximum record size
> handled by POI. It is a protection against OutOfMemoryErrors. I
> increased this limit to 10M because was seeing many of them. I do not
> know if it is configurable in tika server.
>
> Regards,
> Luis
>
> Em ter, 8 de out de 2019 17:46, Chris Mattmann <[hidden email] <mailto:[hidden email]> >
> escreveu:
>
> > Hi,
> >
> >
> >
> > Thanks for your question. Yes, the same way you set the byte size
> > property in Tika-App (I think through parser configuration) is how
> > you would do it for Tika-Server. You would just start the Tika
> > Server yourself with a custom config file that set this property and
> > then start it on the default port (making sure any other ones were
> > killed first). Then Tika-Python will use your own Tika Server with
> > custom config.
> >
> >
> >
> > As for catching errors, it will try its best to do that, but it does
> > not catch all of them and if you find something it doesn’t catch let
> > us know and we will work to fix it.
> >
> >
> >
> > Thanks,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: "[hidden email] <mailto:[hidden email]> " <[hidden email] <mailto:[hidden email]> >
> > Organization: Avident-IT
> > Date: Tuesday, October 8, 2019 at 6:06 AM
> > To: "Mattmann, Chris A (US 1761)" <[hidden email] <mailto:[hidden email]> >
> > Subject: [EXTERNAL] Tika Python questions
> >
> >
> >
> > Hi
> >
> > I have had the pleasure of testing the Tika-python library. I am
> > testing it out in a new application that are developed for customers.
> >
> > It has very good performance, especially for parsing XLSX and XLS files.
> >
> >
> >
> > However, I have two questions:
> > The Tika-Server handles only files with a maximum byte size. I get
> > this
> > error:
> > org.apache.poi.util.RecordFormatException: Tried to allocate an
> > array of length 1186956, but 1000000 is the maximum for this record type.
> >
> > increasing the maximum allowable size for this record type.
> >
> > As a temporary workaround, consider setting a higher override value
> > with
> > IOUtils.setByteArrayMaxOverride()
> >
> > I have tried the Tika-App python (jar file) and it does handle the
> > file size where files are larger than 1000000.
> >
> > In the Tika documentation it says to set MaxBytes to -1 to override
> > and handle larger files.
> >
> > Is there any way to handle this via Tika-Python? To set max files
> > size to unlimited as the “Tika-App” handles it?
> >
> >
> > How is it possible to catch errors via the Tika-python library, like
> > if files are encrypted, corrupt etc.?
> >
> >
> >
> >
> > Kind regards
> >
> >
> >
> > HANS MEIJER
> >
> >
> >
> >