Parsing incomplete PDF and Office files

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing incomplete PDF and Office files

Milos Kovacevic
Hello,

I would like to download just a few kilobytes of a PDF(doc) file and to
extract the text from it. I do not want to download the whole file and then
to parse it, just truncated first N Kbs. Is it possible with Tika or not? If
not how should I do that?

Regards, Milos
Reply | Threaded
Open this post in threaded view
|

Re: Parsing incomplete PDF and Office files

Jukka Zitting
Hi,

On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic <[hidden email]> wrote:
> I would like to download just a few kilobytes of a PDF(doc) file and to
> extract the text from it. I do not want to download the whole file and then
> to parse it, just truncated first N Kbs. Is it possible with Tika or not? If
> not how should I do that?

That's currently not possible, but AFAIK there is support for
page-by-page streaming in PDFBox (for PDF documents that support that,
not all of them do). It would be nice if Tika could leverage that
functionality in PDFBox.

However, I'm not sure how well that would work with truncated streams.
I guess the reasonable approach would be to stream as much text as can
be parsed, and then fail with a TikaException if the input stream ends
unexpectedly. Your application would then need to be aware of this
error condition and handle it appropriately.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Parsing incomplete PDF and Office files

Jonathan Koren
On a related note, does Tika support full text extraction of PDFs?

On Nov 13, 2008, at 1:52 PM, Jukka Zitting wrote:

> Hi,
>
> On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic  
> <[hidden email]> wrote:
>> I would like to download just a few kilobytes of a PDF(doc) file  
>> and to
>> extract the text from it. I do not want to download the whole file  
>> and then
>> to parse it, just truncated first N Kbs. Is it possible with Tika  
>> or not? If
>> not how should I do that?
>
> That's currently not possible, but AFAIK there is support for
> page-by-page streaming in PDFBox (for PDF documents that support that,
> not all of them do). It would be nice if Tika could leverage that
> functionality in PDFBox.
>
> However, I'm not sure how well that would work with truncated streams.
> I guess the reasonable approach would be to stream as much text as can
> be parsed, and then fail with a TikaException if the input stream ends
> unexpectedly. Your application would then need to be aware of this
> error condition and handle it appropriately.
>
> BR,
>
> Jukka Zitting

--
Jonathan Koren
[hidden email]
http://www.soe.ucsc.edu/~jonathan/


Reply | Threaded
Open this post in threaded view
|

Re: Parsing incomplete PDF and Office files

Milos Kovacevic
In reply to this post by Jukka Zitting
Hello,


> That's currently not possible, but AFAIK there is support for
> page-by-page streaming in PDFBox (for PDF documents that support that,
> not all of them do). It would be nice if Tika could leverage that
> functionality in PDFBox.
>

could you please give an example how to parse PDF page-by-page?
Thanks, Milos
Reply | Threaded
Open this post in threaded view
|

Re: Parsing incomplete PDF and Office files

Jukka Zitting
In reply to this post by Jonathan Koren
Hi,

On Fri, Nov 14, 2008 at 1:22 AM, Jonathan Koren <[hidden email]> wrote:
> On a related note, does Tika support full text extraction of PDFs?

Yes. See http://incubator.apache.org/tika/formats.html (to be moved to
lucene.apache.org) for all the supported formats.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Parsing incomplete PDF and Office files

Jukka Zitting
In reply to this post by Milos Kovacevic
Hi,

On Fri, Nov 14, 2008 at 8:32 AM, Milos Kovacevic <[hidden email]> wrote:
> could you please give an example how to parse PDF page-by-page?

You'll want to contact [hidden email] for that.

I know that PDFBox is able to parse linear PDF documents (i.e. ones
that are internally stored in a page-by-page order), but AFAIK that
streaming capability is currently not used in the higher level
features like the PDFTextStripper class (even though it already does
use an event model).

BR,

Jukka Zitting