I am actually working on PDF parsing technology, and have posted the
following message to 2 Open source pdf projects (PDFBox and iText). If
there is interested from nutch developers on what responses I have
received , and how a collaborative solution may be reached, let me know.
The results were actually quite impressive, as he managed to deal with
columns, etc using what he referred to Intelligent text extraction
algorithm which uses positions to preserve text flow. He used Jpedal as
his underlying PDF library.
Unfortunately his program was written with an old version of Jpedal and
does not run with the new Jpedal. This is due to the fact that the
PDFGenericGrouping class he used was changed to PDFGroupingAlgorithms
and moved to non-GPL Jpedal. The new class also changed some of the old
classes' members from public to private, and deleted some members, which
would make rewriting his app nessesary.
Fast forward to 2005, Christian Leinberger, a colleague of Tamirs,
writes a paper entitled Ideas for extracting data from an unstructured
http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data _from_unstructured_documents.pdf. Christian indicated that he is using
the open source BSD PDFBox as his library for experiementing with
algortihms that can be used to extract text reliabily out of
I have contacted these guys and hopefully they will be willing to share
their developments with the PDF community.
As more and more content gets "pushed" into PDF it looses its meaning to
anyone else other than a human reader or a printer. Machines do not
have the ability to read and parse it reliably in a generic context, and
it requires sophisticated AI algortihms based on ontologies, or other
big words, to get it out. If your lucky, you can hack through it and
get what you need. Something to think about the next time you push
content into a PDF, or even HTML. PDF is a great way to present content
for priting, but it !@#$ , pardon my french, as a primary mechanism for
presenting data that may need to be used by a machine somewhere
Getting it out has turned into big business for companies who have
developed technology to get into the PDF and get important data out of
it and into another format, usually XML. This is a growing space and I
hope that there are some more developers interested in solving the
problem created by PDF crazy folks who have managed to shove valuable
data into PDF while failing to maintain that same data in another more
usable format (e.g. XML , or at least tagged PDF ). It is best that
this is done in an open format, because the value of such technolgy is
very high, it is complicated to produce, and very useful to the general
RE: FW: Good reading/research on PDF text extraction
What developments have been done so far to enable nutch to parse PDFs?
Have you read through Tamir's Whitepaper?
PS. Here are some comments from Ben Litchfiled, developer of open source
PDF Box (java), followed by some comments from Tamir, who wrote the PDF
extraction algorithm :
Are you saying you want to head this type of project up and are looking
for help or are you requesting this functionality be added to existing
I have worked on a couple different 'custom' text extraction projects
using PDFBox and need to organize those changes before I can commit them
to the PDFBox project. Right now they are very specific/custom so I need
to extract the generic parts out and make them part of the core PDFBox.
Just need to find the time to do it.
Certainly if Christian Leinberger has made some progress I would be
willing to work with him to add some features to the PDFBox core.
I agree that this is important functionality and requires more than just
simple text extraction but advanced AI concepts.
I am requesting this functionality be added to existing projects. I am
saying I am available to code, discuss, document, test, support, or
otherwise do whatever else I can do to get some good technology in the
public domain in this area.
>Certainly if Christian Leinberger has made some progress I would be
>willing to work with him to add some features to the PDFBox core.
Hopefully they will get back to us all. I would like to see the results.
I would also like to ask Ben, et al if PDFBox supports reading of
"tagged" PDF, and if so in what classes?