partial file parsing

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

partial file parsing

Baranee
Hi Tika-dev community,

I'm new to Tika, We are using AutoDetectParser (from Tika 0.9)for parsing the files and sending the parsed contents to Solr. We are facing severe performance issues while some large sized .xlsx, .docx and .pptx files getting parsed. Hence it is decided to parse files partially like first 10 paragraphs of a doc or first 1000 words or first 2MB of contents like that.

Please let me know is there any way to say Tika to parse part of a file.

Regards,
Baranee
Reply | Threaded
Open this post in threaded view
|

TikaInputStream customization

Baranee
Can anyone pls let me know how to customize TikaInputStream to read only first 1000bytes from a given InputStream.

Regards,
Baranee
Reply | Threaded
Open this post in threaded view
|

Re: TikaInputStream customization

Jukka Zitting
Hi,

On Wed, Jun 6, 2012 at 12:30 PM, K, Baraneetharan
<[hidden email]> wrote:
> Can anyone pls let me know how to customize TikaInputStream to read only first
> 1000bytes from a given InputStream.

You can use the BoundedInputStream [1] class from Commons IO:

    TikaInputStream.get(new BoundedInputStream(stream, 1000));

However, see the concern in TIKA-307 [2]. Passing a truncated stream
to Tika may produce unexpected results.

[1] http://commons.apache.org/io/api-release/org/apache/commons/io/input/BoundedInputStream.html
[2] https://issues.apache.org/jira/browse/TIKA-307

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: TikaInputStream customization

Baranee
Thanks Zukka for your reply.

Can u pls tell me how to use the beforeRead() method in TikaInputStream to set readlimit for reading bytes from a stream.

Baranee
Reply | Threaded
Open this post in threaded view
|

Re: TikaInputStream customization

Jukka Zitting
Hi,

On Wed, Jun 6, 2012 at 2:15 PM, Baranee <[hidden email]> wrote:
> Can u pls tell me how to use the beforeRead() method in TikaInputStream to
> set readlimit for reading bytes from a stream.

http://people.apache.org/~hossman/#xyproblem

Why do you want to use TikaInputStream like this?

BR,

Jukka Zitting