Nutch Parsing PDFs, and general PDF extraction

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch Parsing PDFs, and general PDF extraction

Richard Braman
I see that there is a class for parsing pdfs in nutch using pdfbox
 
<http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa
ge-summary.html> org.apache.nutch.parse.pdf (Nutch 0.7.1 API)
but I dont see it in the source of 0.7.1 downloaded
 
I see it on cvs here:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s
rc/java/net/nutch/parse/pdf/
but my nutch doesn't seem to run the pdf parse class as my log file
shows it fecthing pdfs, but saying nutch is unable to parse content type
application/pdf
Why is this?  Was it left out because of performace?
 
IMO, The class used by nutch (shown below) wont cut it for most pdfs
though, as the pdf structure is usually too complicated.  Please see
some of resources I cited in my last posts including
http://www.tamirhassan.dsl.pipex.com/final.pdf
and
http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data
_from_unstructured_documents.pdf.  
 
 as they bring up some good algorithms for parsing pdf.  90% of PDFs are
unstrcutured, they dont contain any XML content that describes how the
pages flow.  The content could be in any order, and that might make
searching for literals throw innacurate results.
 
the other 10% of PDFs use tagging, and nutch could use this to parse
through the tagged ones quite easily using PDFBox.
 
We need to have nutch/lucense parsing pdfs, it is one of the features of
google that users value, and there is simplying to much pdf content to
ignore.  IN addition it would be nice to have nutch be able to show the
pdf as html like google does.  I think Tamirs paper is a good read on
this because he does a good analysis of the googles functionality here
and his original objective was to format the PDF as an html, whihc
requires correctly parsing the pdf.
 
Some more references
http://snowtide.com/home/PDFTextStream/techtips/easy_lucene_integration
http://www.jguru.com/faq/view.jsp?EID=862443
/* Copyright (c) 2004 The Nutch Organization. All rights reserved. */

/* Use subject to the conditions in http://www.nutch.org/LICENSE.txt. */

package net.nutch.parse.pdf;

import org.pdfbox.encryption.DocumentEncryption;

import org.pdfbox.pdfparser.PDFParser;

import org.pdfbox.pdmodel.PDDocument;

import org.pdfbox.pdmodel.PDDocumentInformation;

import org.pdfbox.util.PDFTextStripper;

import org.pdfbox.exceptions.CryptographyException;

import org.pdfbox.exceptions.InvalidPasswordException;

import net.nutch.protocol.Content;

import net.nutch.util.LogFormatter;

import net.nutch.parse.Parser;

import net.nutch.parse.Parse;

import net.nutch.parse.ParseData;

import net.nutch.parse.ParseImpl;

import net.nutch.parse.Outlink;

import net.nutch.parse.ParseException;

import java.text.SimpleDateFormat;

import java.util.Calendar;

import java.util.Properties;

import java.util.logging.Logger;

import java.io.ByteArrayInputStream;

import java.io.IOException;

/*********************************************

* parser for mime type application/pdf.

* It is based on org.pdfbox.*. We have to see how well it does the job.

*

* @author John Xing

*

* Note on 20040614 by Xing:

* Some codes are stacked here for convenience (see inline comments).

* They may be moved to more appropriate places when new codebase

* stabilizes, especially after code for indexing is written.

*

*********************************************/

public class PdfParser implements Parser {

public static final Logger LOG =

LogFormatter.getLogger("net.nutch.parse.pdf");

public PdfParser () {

// redirect org.apache.log4j.Logger to java's native logger, in order

// to, at least, suppress annoying log4j warnings.

// Note on 20040614 by Xing:

// log4j is used by pdfbox. This snippet'd better be moved

// to a common place shared by all parsers that use log4j.

org.apache.log4j.Logger rootLogger =

org.apache.log4j.Logger.getRootLogger();

rootLogger.setLevel(org.apache.log4j.Level.INFO);

org.apache.log4j.Appender appender = new
org.apache.log4j.WriterAppender(

new org.apache.log4j.SimpleLayout(),

net.nutch.util.LogFormatter.getLogStream(

this.LOG, java.util.logging.Level.INFO));

rootLogger.addAppender(appender);

}

public Parse getParse(Content content) throws ParseException {

// check that contentType is one we can handle

String contentType = content.getContentType();

if (contentType != null && !contentType.startsWith("application/pdf"))

throw new ParseException(

"Content-Type not application/pdf: "+contentType);

// in memory representation of pdf file

PDDocument pdf = null;

String text = null;

String title = null;

try {

byte[] raw = content.getContent();

String contentLength = content.get("Content-Length");

if (contentLength != null

&& raw.length != Integer.parseInt(contentLength)) {

throw new ParseException("Content truncated at "+raw.length

+" bytes. Parser can't handle incomplete pdf file.");

}

PDFParser parser = new PDFParser(

new ByteArrayInputStream(raw));

parser.parse();

pdf = parser.getPDDocument();

if (pdf.isEncrypted()) {

DocumentEncryption decryptor = new DocumentEncryption(pdf);

//Just try using the default password and move on

decryptor.decryptDocument("");

}

// collect text

PDFTextStripper stripper = new PDFTextStripper();

text = stripper.getText(pdf);

// collect title

PDDocumentInformation info = pdf.getDocumentInformation();

title = info.getTitle();

// more useful info, currently not used. please keep them for future
use.

// pdf.getPageCount();

// info.getAuthor()

// info.getSubject()

// info.getKeywords()

// info.getCreator()

// info.getProducer()

// info.getTrapped()

// formatDate(info.getCreationDate())

// formatDate(info.getModificationDate())

} catch (ParseException e) {

throw e;

} catch (CryptographyException e) {

throw new ParseException("Error decrypting document. "+e);

} catch (InvalidPasswordException e) {

throw new ParseException("Can't decrypt document. "+e);

} catch (Exception e) { // run time exception

throw new ParseException("Can't be handled as pdf document. "+e);

} finally {

try {

if (pdf != null)

pdf.close();

} catch (IOException e) {

// nothing to do

}

}

if (text == null)

text = "";

if (title == null)

title = "";

// collect outlink

Outlink[] outlinks = new Outlink[0];

// collect meta data

Properties metadata = new Properties();

metadata.putAll(content.getMetadata()); // copy through

ParseData parseData = new ParseData(title, outlinks, metadata);

return new ParseImpl(text, parseData);

// any filter?

//return HtmlParseFilters.filter(content, parse, root);

}

// format date

// currently not used. please keep it for future use.

private String formatDate(Calendar date) {

String retval = null;

if(date != null) {

SimpleDateFormat formatter = new SimpleDateFormat();

retval = formatter.format(date.getTime());

}

return retval;

}

}

Richard Braman
mailto:[hidden email]
561.748.4002 (voice)

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software

 
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Parsing PDFs, and general PDF extraction

luti
Richard Braman wrotte:

>but my nutch doesn't seem to run the pdf parse class as my log file
>shows it fecthing pdfs, but saying nutch is unable to parse content type
>application/pdf
>  
>
Can you send the complette error message?
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
060228 045534 fetch okay, but can't parse
http://www.irs.gov/pub/irs-pdf/f1040sab.pdf?portlet=3, reason:
failed(2,203): Content-Type not text/html: application/pdf

-----Original Message-----
From: YourSoft [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 8:00 AM
To: [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


Richard Braman wrotte:

>but my nutch doesn't seem to run the pdf parse class as my log file
>shows it fecthing pdfs, but saying nutch is unable to parse content
>type application/pdf
>  
>
Can you send the complette error message?

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
I don't have the plugin configured,  whats the code for doing that?


-----Original Message-----
From: Richard Braman [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 7:52 AM
To: [hidden email]
Subject: RE: Nutch Parsing PDFs, and general PDF extraction


060228 045534 fetch okay, but can't parse
http://www.irs.gov/pub/irs-pdf/f1040sab.pdf?portlet=3, reason:
failed(2,203): Content-Type not text/html: application/pdf

-----Original Message-----
From: YourSoft [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 8:00 AM
To: [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


Richard Braman wrotte:

>but my nutch doesn't seem to run the pdf parse class as my log file
>shows it fecthing pdfs, but saying nutch is unable to parse content
>type application/pdf
>  
>
Can you send the complette error message?

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Parsing PDFs, and general PDF extraction

Jérôme Charron
In reply to this post by Richard Braman
> <http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa
> ge-summary.html> org.apache.nutch.parse.pdf (Nutch 0.7.1 API)
> but I dont see it in the source of 0.7.1 downloaded
>
> I see it on cvs here:
> http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s
> rc/java/net/nutch/parse/pdf/

First of all, the nutch source code is no more hosted on sourceforge, but on
apache:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/

The classes packages has also been changed to org.apache.nutch

but my nutch doesn't seem to run the pdf parse class as my log file
> shows it fecthing pdfs, but saying nutch is unable to parse content type
> application/pdf
> Why is this?  Was it left out because of performace?

Do you have activated the parse-pdf plugin in conf/nutch-default.xml or
conf/nutch-site.xml ?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
In reply to this post by Richard Braman
Should I add this to nutch site:

<plugin
   id="parse-pdf"
   name="Pdf Parse Plug-in"
   version="1.0.0"
   provider-name="nutch.org">


   <runtime>
      <library name="parse-pdf.jar">
         <export name="*"/>
      </library>
      <library name="PDFBox-0.7.2-log4j.jar"/>
      <library name="log4j-1.2.9.jar"/>
   </runtime>

   <extension id="org.apache.nutch.parse.pdf"
              name="PdfParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="org.apache.nutch.parse.pdf.PdfParser"
                      class="org.apache.nutch.parse.pdf.PdfParser"
                      contentType="application/pdf"
                      pathSuffix=""/>

   </extension>

</plugin>

-----Original Message-----
From: Richard Braman [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 7:58 AM
To: [hidden email]; [hidden email]
Subject: RE: Nutch Parsing PDFs, and general PDF extraction


I don't have the plugin configured,  whats the code for doing that?


-----Original Message-----
From: Richard Braman [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 7:52 AM
To: [hidden email]
Subject: RE: Nutch Parsing PDFs, and general PDF extraction


060228 045534 fetch okay, but can't parse
http://www.irs.gov/pub/irs-pdf/f1040sab.pdf?portlet=3, reason:
failed(2,203): Content-Type not text/html: application/pdf

-----Original Message-----
From: YourSoft [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 8:00 AM
To: [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


Richard Braman wrotte:

>but my nutch doesn't seem to run the pdf parse class as my log file
>shows it fecthing pdfs, but saying nutch is unable to parse content
>type application/pdf
>  
>
Can you send the complette error message?

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Parsing PDFs, and general PDF extraction

luti
Richard Braman wrotte:

No, you should be add to "plugin include" (in nutch-site.xml) e.g.:
<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-(regex|prefix)|parse-(text|html|pdf)<description>Regular
expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
Putting the wellformed version of the plugin code you provided generated
the follwong exception:

060228 083159 SEVERE org.apache.nutch.plugin.PluginRuntimeException:
extension point: org.apache.nutch.parse.Parser does not exist.
Exception in thread "main" java.lang.ExceptionInInitializerError
        at
org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
        at
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
        at
org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException: extension point:
org.apache.nutch.parse.Parser does not exist.
        at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.ja
va:147)
        at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
        ... 4 more
Caused by: org.apache.nutch.plugin.PluginRuntimeException: extension
point: org.apache.nutch.parse.Parser does not exist.
        at
org.apache.nutch.plugin.PluginRepository.installExtensions(PluginReposit
ory.java:78)
        at
org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:61
)
        at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.ja
va:144)

-----Original Message-----
From: YourSoft [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 8:20 AM
To: [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


Richard Braman wrotte:

No, you should be add to "plugin include" (in nutch-site.xml) e.g.:
<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-(regex|prefix)|parse-(text|html|pdf)<desc
ription>Regular
expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Parsing PDFs, and general PDF extraction

Jérôme Charron
> Putting the wellformed version of the plugin code you provided generated
> the follwong exception:

Does the nutch-extensionpoints plugin is activated?
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
I don’t know it seems to be working now.

-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 8:46 AM
To: [hidden email]; [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


> Putting the wellformed version of the plugin code you provided
> generated the follwong exception:

Does the nutch-extensionpoints plugin is activated?

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
In reply to this post by Richard Braman
thanks for the help.  I dont know what happenned , but it is working no.
Did any other contributros read what I sent about parsing PDFs?
I dont think nutch is capable with this based on the text stripper code
in parse pdf
 
http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-pd
f/f1040.pdf+irs+1040+pdf
<http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p
df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
&hl=en&gl=us&ct=clnk&cd=1
 
 
Its time to implement some real pdf parsing technology.
any other takers?
 
 

-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 9:49 AM
To: [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


In the attached files, nutch-default.xml contains :
protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|que
ry-(basic|site|url)
No parse-pdf is specified....
(the nutch-extensionpoints is not mandatory since the
plugin.autoactivation property is true. The plugins needed by other ones
that are manually activated will be automatically activated).
Is there some plugins in your plugins folder? ( build/plugins)


On 2/28/06, Richard Braman <[hidden email]
<mailto:[hidden email]> > wrote:

In nutchdefault
 
<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml|pdf)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints
plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
 
I moved it into nutchdefault from nutch site in an effort to fix the
error, whihc didn;t work.  I want this feature to to be default.
 
Rich
 
-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 9:27 AM
To: [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction




Could you please send me the value of the plugin.includes property (in
nutch-default.xml and nutch-site.xml)


On 2/28/06, Richard Braman <  <mailto:[hidden email]>
[hidden email]> wrote:

note ana quick search of the archive didn;t reveal the code to that.
please provide.


-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 8:46 AM
To: [hidden email]; [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction





Putting the wellformed version of the plugin code you provided generated
the follwong exception:

Does the nutch-extensionpoints plugin is activated?







--
http://motrech.free.fr/
http://www.frutch.org/






--
http://motrech.free.fr/
http://www.frutch.org/


Reply | Threaded
Open this post in threaded view
|

Re: Nutch Parsing PDFs, and general PDF extraction

John X
On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:

> thanks for the help.  I dont know what happenned , but it is working no.
> Did any other contributros read what I sent about parsing PDFs?
> I dont think nutch is capable with this based on the text stripper code
> in parse pdf
>  
> http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-pd
> f/f1040.pdf+irs+1040+pdf
> <http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p
> df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
> &hl=en&gl=us&ct=clnk&cd=1
>  
>  
> Its time to implement some real pdf parsing technology.
> any other takers?

Nutch is about search and it relies on 3rd party libraries
to extract text from various mimetypes, including application/pdf.
Whether nutch can correctly extract text from a pdf file largely
depends on the pdf parsing library it uses, currently PDFBox.
It won't be very difficult to switch to other libraries.
However it seems hard to find a free/open implementation
that can parse every pdf file in the wild. There is an alternative:
use nutch's parse-ext with a command line pdf parser/converter,
which can just be an executable.

John
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
It is possible to come up with some better parsing algorithms , than
simply doing a Stripper.get text, which is what nutch does right now.  I
am not recommending switching from PDFBox.  I think most important is
that the algorith used in the page does the best  job possible in
preserving the flow of text.  If the text doesn't flow correctly, search
results may be altered, which is why if nutch is about search it must be
able to parse PDF correctly.  Ben Litchfield, the developer of PDFbox,
has noted that he has developed some better parsing technology, and
hopes to share those with us soon.

Another thing to consider is if the pdf is "tagged" then it carries a
XML markup that desribes the flow of text, which was designed to be use
for accessability under section 508.  I think Ben also noted that PDFBOx
did not support pdf tags.
http://www.planetpdf.com/enterprise/article.asp?ContentID=6067

A better parsing strategy may involve the following pseducode:

Determine whther pdf contains tagged content

        If so,
                parse tagged content so that returned text flows
correctly

        If not

                Determine whether the pdf contains bounding boxes that
indicate that content is contained in tablular format.

                If not,
                        parse getting stripper.get text

                If so, implement algorithm to extract text from pdf
preserving flow of text


An adiditonal feature may include saving the pdf as html as nutch crawls
the web.


An example of such algortithms may be found at:
www.tamirhassan.com/final.pdf
http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data
_from_unstructured_documents.pdf.


This is something google does very well, and something nutch must match
to compete.

-----Original Message-----
From: John X [mailto:[hidden email]]
Sent: Wednesday, March 01, 2006 2:12 AM
To: [hidden email]; [hidden email]
Cc: [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:
> thanks for the help.  I dont know what happenned , but it is working
> no. Did any other contributros read what I sent about parsing PDFs? I
> dont think nutch is capable with this based on the text stripper code
> in parse pdf
>  
> http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-
> pd
> f/f1040.pdf+irs+1040+pdf
>
<http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p
> df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
> &hl=en&gl=us&ct=clnk&cd=1
>  
>  
> Its time to implement some real pdf parsing technology.
> any other takers?

Nutch is about search and it relies on 3rd party libraries
to extract text from various mimetypes, including application/pdf.
Whether nutch can correctly extract text from a pdf file largely depends
on the pdf parsing library it uses, currently PDFBox. It won't be very
difficult to switch to other libraries. However it seems hard to find a
free/open implementation that can parse every pdf file in the wild.
There is an alternative: use nutch's parse-ext with a command line pdf
parser/converter, which can just be an executable.

John

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Parsing PDFs, and general PDF extraction

Jérôme Charron
> This is something google does very well, and something nutch must match
> to compete.

Richard, it seems you are a real pdf guru, so any code contribution to nutch
is welcome.
;-)

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
I am not a pdf guru, but I have amassed quite a bit of information on
the topic.  I have pinged around asking the pdf mavens of the world what
the issues are about parsing pdf and reading up on the subject to get a
better understanding.  I have contributed all of this to mailing lists,
but coding this is not something I would feel confortable doing at this
point.  Maybe it would be best for a coordinated lucene-nutch-pdfbox
development to produce some good code to do this.  I am trying to get
some dialog going.

Here is some code I was asked to debug by  another interested developer
that uses PDFBox to extract pdf tabular data, it seems to have some bugs
in it that I am trying to figure out.


        try
        {
        int i =1;
        String wordsep = null;
        String str = null;
        boolean flag = false;
                Writer output = null;
        PDDocument document = null;
        document = PDDocument.load( "53 Nostro Ofc Cofc Daily
Position_AUS.pdf" );
       
        PDDocumentOutline root =
document.getDocumentCatalog().getDocumentOutline();
                PDOutlineItem item = root.getFirstChild();
                PDOutlineItem item1 = item.getNextSibling();
               
                while( item1 != null )
      {      
      System.out.println( "Item:" + item.getTitle() );
      System.out.println( "Item1:" + item1.getTitle()
);
      output = new OutputStreamWriter(new
FileOutputStream( "simple"+i+".txt" ) );
      PDFTextStripperByArea stripper= null;
      stripper=new PDFTextStripperByArea();
      List reg = stripper.getRegions();
      System.out.println(reg.size());
         
            // PDFTextStripper stripper = null;
            //stripper = new PDFTextStripper();
            wordsep = stripper.getWordSeparator();
            //stripper.setSortByPosition(true);
       
            stripper.setStartBookmark(item);
           
           
            stripper.setLineSeparator("\n");
            stripper.setWordSeparator("  ");
            stripper.setPageSeparator("\n\n\n\n");
            stripper.setWordSeparator("   ");
            stripper.setEndBookmark(item1);
            //str = stripper.getText(document);
            //output.write( str, 0, str.length());
           
            stripper.writeText( document, output );
            i++;
      item = item.getNextSibling();
              item1 = item1.getNextSibling();
         
      }
      PDOutlineItem child = item.getFirstChild();
      PDOutlineItem child1 = new PDOutlineItem();
          while( child != null )
          {
          child1 = child;
          child = child.getNextSibling();
         
          }
          System.out.println( "Item:" + item.getTitle() );
          System.out.println( "Item1:" + child1.getTitle()
);
      output = new OutputStreamWriter(new
FileOutputStream( "simple"+i+".txt" ) );
            PDFTextStripperByArea stripper= null;
      stripper=new PDFTextStripperByArea();
           
            System.out.println("The word separator
is"+flag);
           
            //stripper.setSortByPosition(true);
           
         
            stripper.setLineSeparator("\n");
           
            stripper.setPageSeparator("\n\n\n\n");
            stripper.setWordSeparator("  ");
            stripper.setStartBookmark(item);
            stripper.setEndBookmark(child1);
            //str = stripper.getText(document);
 
stripper.setShouldSeparateByBeads(stripper.shouldSeparateByBeads());
            stripper.writeText( document, output );
     
            output.close();  
            document.close();
        }
         catch(Exception ex)
        {
        System.out.println(ex);
        }
    }    

-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Thursday, March 02, 2006 3:42 AM
To: [hidden email]; [hidden email]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


> This is something google does very well, and something nutch must
> match to compete.

Richard, it seems you are a real pdf guru, so any code contribution to
nutch is welcome.
;-)

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Ben Litchfield-3
In reply to this post by Richard Braman

To chime in and give my comments.

It is true that better search engine results could be obtained by first
analysing each PDF page and converting it to some other
structure(XML/HTML) before the indexing process.  But that the cost of
converting PDF to text is already resource intensive and some users may
not want to pay the additional cost to analyze each page.

While PDFs are unstructured, most documents give pretty good results with
the default text extraction.  Usually the extracted text is already in
reading order.

An extremely small percent of PDFs actually include tagged information

Converting a PDF to HTML is something that needs to get implemented in
PDFBox, then it is trivial for Nutch to include it.

Overall, the easiest thing to do would be to implement good PDF->HTML
conversion capabilities to PDFBox, then Nutch just uses that resulting
HTML for indexing and for preview mode.  Until that is done there is not
much the Nutch developers can do.

Ben


On Thu, 2 Mar 2006, Richard Braman wrote:

> It is possible to come up with some better parsing algorithms , than
> simply doing a Stripper.get text, which is what nutch does right now.  I
> am not recommending switching from PDFBox.  I think most important is
> that the algorith used in the page does the best  job possible in
> preserving the flow of text.  If the text doesn't flow correctly, search
> results may be altered, which is why if nutch is about search it must be
> able to parse PDF correctly.  Ben Litchfield, the developer of PDFbox,
> has noted that he has developed some better parsing technology, and
> hopes to share those with us soon.
>
> Another thing to consider is if the pdf is "tagged" then it carries a
> XML markup that desribes the flow of text, which was designed to be use
> for accessability under section 508.  I think Ben also noted that PDFBOx
> did not support pdf tags.
> http://www.planetpdf.com/enterprise/article.asp?ContentID=6067
>
> A better parsing strategy may involve the following pseducode:
>
> Determine whther pdf contains tagged content
>
> If so,
> parse tagged content so that returned text flows
> correctly
>
> If not
>
> Determine whether the pdf contains bounding boxes that
> indicate that content is contained in tablular format.
>
> If not,
> parse getting stripper.get text
>
> If so, implement algorithm to extract text from pdf
> preserving flow of text
>
>
> An adiditonal feature may include saving the pdf as html as nutch crawls
> the web.
>
>
> An example of such algortithms may be found at:
> www.tamirhassan.com/final.pdf
> http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data
> _from_unstructured_documents.pdf.
>
>
> This is something google does very well, and something nutch must match
> to compete.
>
> -----Original Message-----
> From: John X [mailto:[hidden email]]
> Sent: Wednesday, March 01, 2006 2:12 AM
> To: [hidden email]; [hidden email]
> Cc: [hidden email]
> Subject: Re: Nutch Parsing PDFs, and general PDF extraction
>
>
> On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:
> > thanks for the help.  I dont know what happenned , but it is working
> > no. Did any other contributros read what I sent about parsing PDFs? I
> > dont think nutch is capable with this based on the text stripper code
> > in parse pdf
> >
> > http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-
> > pd
> > f/f1040.pdf+irs+1040+pdf
> >
> <http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p
> > df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
> > &hl=en&gl=us&ct=clnk&cd=1
> >
> >
> > Its time to implement some real pdf parsing technology.
> > any other takers?
>
> Nutch is about search and it relies on 3rd party libraries
> to extract text from various mimetypes, including application/pdf.
> Whether nutch can correctly extract text from a pdf file largely depends
> on the pdf parsing library it uses, currently PDFBox. It won't be very
> difficult to switch to other libraries. However it seems hard to find a
> free/open implementation that can parse every pdf file in the wild.
> There is an alternative: use nutch's parse-ext with a command line pdf
> parser/converter, which can just be an executable.
>
> John
>
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman
Hi Ben

>but that the cost of converting PDF to text is already resource
intensive and some users may not want to pay the additional cost to
>analyze each page.

Agreed. For nutch it could be a simple config parameter to turn that on
or off. Pdf parsing is already optional, maybe there could be
alternaitive pasring strategies when parsing is turned on, to choose one
of the parsing methods (simple, complex1, complex2, etc)

>While PDFs are unstructured, most documents give pretty good results
with the default text extraction.  Usually the extracted text is
>already in reading order.

Except if there are text and columns, then it goes haywire. For example,
parsing tax instructions always fails, and the content is always layed
out in columns.  Many newspaper articles have the same problem.

>An extremely small percent of PDFs actually include tagged information
Agreed, but that may change with Section 508, at least for government,
which is still the largest volume of pdfs on the net.
Is this hard to support with PDFBox?

>Overall, the easiest thing to do would be to implement good PDF->HTML
conversion capabilities to PDFBox, then Nutch just uses that
>resulting HTML for indexing and for preview mode.  Until that is done
there is not much the Nutch developers can do.  

Agreed, I want nutch dev to know whats going on because I do think this
functionality is important for nutch's future. Maybe they have some
insights into parsing methods as many of these devs are experts with
ontologies.

Ben, maybe we should move this into pdf box dev list, and any one who is
interested (nutch developers or not) can get in on it.  I would think
nutch should assign this to someone on their team given the importnance
of the fucntionality.

Rich


-----Original Message-----
From: Ben Litchfield [mailto:[hidden email]]
Sent: Thursday, March 02, 2006 4:46 PM
To: Richard Braman
Cc: [hidden email]; [hidden email]
Subject: RE: Nutch Parsing PDFs, and general PDF extraction



To chime in and give my comments.

It is true that better search engine results could be obtained by first
analysing each PDF page and converting it to some other
structure(XML/HTML) before the indexing process.  But that the cost of
converting PDF to text is already resource intensive and some users may
not want to pay the additional cost to analyze each page.

While PDFs are unstructured, most documents give pretty good results
with the default text extraction.  Usually the extracted text is already
in reading order.

An extremely small percent of PDFs actually include tagged information

Converting a PDF to HTML is something that needs to get implemented in
PDFBox, then it is trivial for Nutch to include it.

Overall, the easiest thing to do would be to implement good PDF->HTML
conversion capabilities to PDFBox, then Nutch just uses that resulting
HTML for indexing and for preview mode.  Until that is done there is not
much the Nutch developers can do.

Ben


On Thu, 2 Mar 2006, Richard Braman wrote:

> It is possible to come up with some better parsing algorithms , than
> simply doing a Stripper.get text, which is what nutch does right now.

> I am not recommending switching from PDFBox.  I think most important
> is that the algorith used in the page does the best  job possible in
> preserving the flow of text.  If the text doesn't flow correctly,
> search results may be altered, which is why if nutch is about search
> it must be able to parse PDF correctly.  Ben Litchfield, the developer

> of PDFbox, has noted that he has developed some better parsing
> technology, and hopes to share those with us soon.
>
> Another thing to consider is if the pdf is "tagged" then it carries a
> XML markup that desribes the flow of text, which was designed to be
> use for accessability under section 508.  I think Ben also noted that
> PDFBOx did not support pdf tags.
> http://www.planetpdf.com/enterprise/article.asp?ContentID=6067
>
> A better parsing strategy may involve the following pseducode:
>
> Determine whther pdf contains tagged content
>
> If so,
> parse tagged content so that returned text flows
> correctly
>
> If not
>
> Determine whether the pdf contains bounding boxes that
indicate that
> content is contained in tablular format.
>
> If not,
> parse getting stripper.get text
>
> If so, implement algorithm to extract text from pdf
preserving flow

> of text
>
>
> An adiditonal feature may include saving the pdf as html as nutch
> crawls the web.
>
>
> An example of such algortithms may be found at:
> www.tamirhassan.com/final.pdf
> http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_da
> ta
> _from_unstructured_documents.pdf.
>
>
> This is something google does very well, and something nutch must
> match to compete.
>
> -----Original Message-----
> From: John X [mailto:[hidden email]]
> Sent: Wednesday, March 01, 2006 2:12 AM
> To: [hidden email]; [hidden email]
> Cc: [hidden email]
> Subject: Re: Nutch Parsing PDFs, and general PDF extraction
>
>
> On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:
> > thanks for the help.  I dont know what happenned , but it is working

> > no. Did any other contributros read what I sent about parsing PDFs?
> > I dont think nutch is capable with this based on the text stripper
> > code in parse pdf
> >
> > http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/ir
> > s-
> > pd
> > f/f1040.pdf+irs+1040+pdf
> >
> <http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs
> -p
> > df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
> > &hl=en&gl=us&ct=clnk&cd=1
> >
> >
> > Its time to implement some real pdf parsing technology.
> > any other takers?
>
> Nutch is about search and it relies on 3rd party libraries
> to extract text from various mimetypes, including application/pdf.
> Whether nutch can correctly extract text from a pdf file largely
> depends on the pdf parsing library it uses, currently PDFBox. It won't

> be very difficult to switch to other libraries. However it seems hard
> to find a free/open implementation that can parse every pdf file in
> the wild. There is an alternative: use nutch's parse-ext with a
> command line pdf parser/converter, which can just be an executable.
>
> John
>