Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

Sharp, Jonathan

Every so often I need to index new batches of scanned PDFs and occasionally Adobe's OCR can't recognize the text in a couple of these documents. In these situations I would like to type in a small amount of text onto the document and have it be extracted by Solr CELL.  

Adobe Pro 9 has a number of different ways to add text directly to a PDF file:

*Typewriter
*Sticky Note
*Callout boxes
*Text boxes

I tried indexing documents with each of these text additions with Solr 1.4.1 + Solr CELL but can't extract the text in any of these boxes.

If someone has modified their Solr CELL installation to use more recent versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment on whether newer versions can pull the text out of any of these various text boxes I'd appreciate that very much.

-Jon




---------------------------------------------------------------------
SECURITY/CONFIDENTIALITY WARNING:  
This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender.

---------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

Alessandro Benedetti-4
Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan <[hidden email]>

>
> Every so often I need to index new batches of scanned PDFs and occasionally
> Adobe's OCR can't recognize the text in a couple of these documents. In
> these situations I would like to type in a small amount of text onto the
> document and have it be extracted by Solr CELL.
>
> Adobe Pro 9 has a number of different ways to add text directly to a PDF
> file:
>
> *Typewriter
> *Sticky Note
> *Callout boxes
> *Text boxes
>
> I tried indexing documents with each of these text additions with Solr
> 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
>
> If someone has modified their Solr CELL installation to use more recent
> versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
> on whether newer versions can pull the text out of any of these various text
> boxes I'd appreciate that very much.
>
> -Jon
>
>
>
>
> ---------------------------------------------------------------------
> SECURITY/CONFIDENTIALITY WARNING:
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of the
> communication is strictly prohibited. If you received the communication in
> error, please notify the sender immediately by replying to this message and
> deleting the message and any accompanying files from your system. If, due to
> the security risks, you do not wish to receive further communications via
> e-mail, please reply to this message and inform the sender that you do not
> wish to receive further e-mail from the sender.
>
> ---------------------------------------------------------------------
>
>


--
--------------------------

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England
Reply | Threaded
Open this post in threaded view
|

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

David Thibault-2
Alessandro & all,

I was having the same issue with Tika crashing on certain PDFs.  I also noticed the bug where no content was extracted after upgrading Tika.  

When I went to the SOLR issue you link to below, I applied all the patches, downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and got the following error:
SEVERE: java.lang.NoSuchMethodError: org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
at java.lang.Thread.run(Thread.java:619)

This is really weird because I DID apply the SolrResourceLoader patch that adds the getClassLoader method.  I even verified by going opening up the JARs and looking at the class file in Eclipse...I can see the SolrResourceLoader.getClassLoader() method.  

Does anyone know why it can't find the method?  After patching the source I did ant clean dist in the base directory of the Solr source tree and everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all the jars from dist/ and all the library dependencies from contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in the logs looked good.

I'm stumped.  It would be very nice to have a Solr implementation using the newest versions of PDFBox & Tika and actually have content being extracted...=)

Best,
Dave


-----Original Message-----
From: Alessandro Benedetti [mailto:[hidden email]]
Sent: Tuesday, July 27, 2010 6:09 AM
To: [hidden email]
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan <[hidden email]>

>
> Every so often I need to index new batches of scanned PDFs and occasionally
> Adobe's OCR can't recognize the text in a couple of these documents. In
> these situations I would like to type in a small amount of text onto the
> document and have it be extracted by Solr CELL.
>
> Adobe Pro 9 has a number of different ways to add text directly to a PDF
> file:
>
> *Typewriter
> *Sticky Note
> *Callout boxes
> *Text boxes
>
> I tried indexing documents with each of these text additions with Solr
> 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
>
> If someone has modified their Solr CELL installation to use more recent
> versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
> on whether newer versions can pull the text out of any of these various text
> boxes I'd appreciate that very much.
>
> -Jon
>
>
>
>
> ---------------------------------------------------------------------
> SECURITY/CONFIDENTIALITY WARNING:
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of the
> communication is strictly prohibited. If you received the communication in
> error, please notify the sender immediately by replying to this message and
> deleting the message and any accompanying files from your system. If, due to
> the security risks, you do not wish to receive further communications via
> e-mail, please reply to this message and inform the sender that you do not
> wish to receive further e-mail from the sender.
>
> ---------------------------------------------------------------------
>
>


--
--------------------------

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply | Threaded
Open this post in threaded view
|

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

Tommaso Teofili
I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied the patched
jars and war inside the dist directory but forgot to update the war inside
the example/webapps directory (that is inside Jetty).
Hope this helps.
Tommaso

2010/7/27 David Thibault <[hidden email]>

> Alessandro & all,
>
> I was having the same issue with Tika crashing on certain PDFs.  I also
> noticed the bug where no content was extracted after upgrading Tika.
>
> When I went to the SOLR issue you link to below, I applied all the patches,
> downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and
> got the following error:
> SEVERE: java.lang.NoSuchMethodError:
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> at
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> at
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> at java.lang.Thread.run(Thread.java:619)
>
> This is really weird because I DID apply the SolrResourceLoader patch that
> adds the getClassLoader method.  I even verified by going opening up the
> JARs and looking at the class file in Eclipse...I can see the
> SolrResourceLoader.getClassLoader() method.
>
> Does anyone know why it can't find the method?  After patching the source I
> did ant clean dist in the base directory of the Solr source tree and
> everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
> the jars from dist/ and all the library dependencies from
> contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in
> the logs looked good.
>
> I'm stumped.  It would be very nice to have a Solr implementation using the
> newest versions of PDFBox & Tika and actually have content being
> extracted...=)
>
> Best,
> Dave
>
>
> -----Original Message-----
> From: Alessandro Benedetti [mailto:[hidden email]]
> Sent: Tuesday, July 27, 2010 6:09 AM
> To: [hidden email]
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> Hi Jon,
> During the last days we front the same problem.
> Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
> content and from others, Solr throws an exception during the Indexing
> Process .
> You must:
> Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> snapshot and tika-parsers 0.8.
> Update PdfBox and all related libraries.
> After that You have to patch Solr 1.4.1 following this patch :
>
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> This is the firts way to solve the problem.
>
> Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
> is
> thrown during the Indexing process, but no content is extracted.
> Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> sounds good but we don't know how stableit is!
> I hope you have now a clear  vision of this issue,
> Best Regards
>
>
>
> 2010/7/26 Sharp, Jonathan <[hidden email]>
>
> >
> > Every so often I need to index new batches of scanned PDFs and
> occasionally
> > Adobe's OCR can't recognize the text in a couple of these documents. In
> > these situations I would like to type in a small amount of text onto the
> > document and have it be extracted by Solr CELL.
> >
> > Adobe Pro 9 has a number of different ways to add text directly to a PDF
> > file:
> >
> > *Typewriter
> > *Sticky Note
> > *Callout boxes
> > *Text boxes
> >
> > I tried indexing documents with each of these text additions with Solr
> > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> >
> > If someone has modified their Solr CELL installation to use more recent
> > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> comment
> > on whether newer versions can pull the text out of any of these various
> text
> > boxes I'd appreciate that very much.
> >
> > -Jon
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > SECURITY/CONFIDENTIALITY WARNING:
> > This message and any attachments are intended solely for the individual
> or
> > entity to which they are addressed. This communication may contain
> > information that is privileged, confidential, or exempt from disclosure
> > under applicable law (e.g., personal health information, research data,
> > financial information). Because this e-mail has been sent without
> > encryption, individuals other than the intended recipient may be able to
> > view the information, forward it to others or tamper with the information
> > without the knowledge or consent of the sender. If you are not the
> intended
> > recipient, or the employee or person responsible for delivering the
> message
> > to the intended recipient, any dissemination, distribution or copying of
> the
> > communication is strictly prohibited. If you received the communication
> in
> > error, please notify the sender immediately by replying to this message
> and
> > deleting the message and any accompanying files from your system. If, due
> to
> > the security risks, you do not wish to receive further communications via
> > e-mail, please reply to this message and inform the sender that you do
> not
> > wish to receive further e-mail from the sender.
> >
> > ---------------------------------------------------------------------
> >
> >
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Personal Page: http://tigerbolt.altervista.org
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

David Thibault-2
Yesterday I did get this working with version 4.0 from trunk.  I haven't fully tested it yet, but the content doesn't come through blank anymore, so that's good.  Would it be more stable to stick with 1.4.1 and your patch to get to Tika 0.8, or to stick with the 4.0 trunk version?

Best,
Dave

-----Original Message-----
From: Tommaso Teofili [mailto:[hidden email]]
Sent: Wednesday, July 28, 2010 3:31 AM
To: [hidden email]
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied the patched
jars and war inside the dist directory but forgot to update the war inside
the example/webapps directory (that is inside Jetty).
Hope this helps.
Tommaso

2010/7/27 David Thibault <[hidden email]>

> Alessandro & all,
>
> I was having the same issue with Tika crashing on certain PDFs.  I also
> noticed the bug where no content was extracted after upgrading Tika.
>
> When I went to the SOLR issue you link to below, I applied all the patches,
> downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and
> got the following error:
> SEVERE: java.lang.NoSuchMethodError:
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> at
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> at
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> at java.lang.Thread.run(Thread.java:619)
>
> This is really weird because I DID apply the SolrResourceLoader patch that
> adds the getClassLoader method.  I even verified by going opening up the
> JARs and looking at the class file in Eclipse...I can see the
> SolrResourceLoader.getClassLoader() method.
>
> Does anyone know why it can't find the method?  After patching the source I
> did ant clean dist in the base directory of the Solr source tree and
> everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
> the jars from dist/ and all the library dependencies from
> contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in
> the logs looked good.
>
> I'm stumped.  It would be very nice to have a Solr implementation using the
> newest versions of PDFBox & Tika and actually have content being
> extracted...=)
>
> Best,
> Dave
>
>
> -----Original Message-----
> From: Alessandro Benedetti [mailto:[hidden email]]
> Sent: Tuesday, July 27, 2010 6:09 AM
> To: [hidden email]
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> Hi Jon,
> During the last days we front the same problem.
> Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
> content and from others, Solr throws an exception during the Indexing
> Process .
> You must:
> Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> snapshot and tika-parsers 0.8.
> Update PdfBox and all related libraries.
> After that You have to patch Solr 1.4.1 following this patch :
>
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> This is the firts way to solve the problem.
>
> Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
> is
> thrown during the Indexing process, but no content is extracted.
> Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> sounds good but we don't know how stableit is!
> I hope you have now a clear  vision of this issue,
> Best Regards
>
>
>
> 2010/7/26 Sharp, Jonathan <[hidden email]>
>
> >
> > Every so often I need to index new batches of scanned PDFs and
> occasionally
> > Adobe's OCR can't recognize the text in a couple of these documents. In
> > these situations I would like to type in a small amount of text onto the
> > document and have it be extracted by Solr CELL.
> >
> > Adobe Pro 9 has a number of different ways to add text directly to a PDF
> > file:
> >
> > *Typewriter
> > *Sticky Note
> > *Callout boxes
> > *Text boxes
> >
> > I tried indexing documents with each of these text additions with Solr
> > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> >
> > If someone has modified their Solr CELL installation to use more recent
> > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> comment
> > on whether newer versions can pull the text out of any of these various
> text
> > boxes I'd appreciate that very much.
> >
> > -Jon
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > SECURITY/CONFIDENTIALITY WARNING:
> > This message and any attachments are intended solely for the individual
> or
> > entity to which they are addressed. This communication may contain
> > information that is privileged, confidential, or exempt from disclosure
> > under applicable law (e.g., personal health information, research data,
> > financial information). Because this e-mail has been sent without
> > encryption, individuals other than the intended recipient may be able to
> > view the information, forward it to others or tamper with the information
> > without the knowledge or consent of the sender. If you are not the
> intended
> > recipient, or the employee or person responsible for delivering the
> message
> > to the intended recipient, any dissemination, distribution or copying of
> the
> > communication is strictly prohibited. If you received the communication
> in
> > error, please notify the sender immediately by replying to this message
> and
> > deleting the message and any accompanying files from your system. If, due
> to
> > the security risks, you do not wish to receive further communications via
> > e-mail, please reply to this message and inform the sender that you do
> not
> > wish to receive further e-mail from the sender.
> >
> > ---------------------------------------------------------------------
> >
> >
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Personal Page: http://tigerbolt.altervista.org
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

Alessandro Benedetti-4
In my opinion, the 1.4.1 version with the Patch is more Stable.
Until 4.0 will be released ....

2010/7/28 David Thibault <[hidden email]>

> Yesterday I did get this working with version 4.0 from trunk.  I haven't
> fully tested it yet, but the content doesn't come through blank anymore, so
> that's good.  Would it be more stable to stick with 1.4.1 and your patch to
> get to Tika 0.8, or to stick with the 4.0 trunk version?
>
> Best,
> Dave
>
> -----Original Message-----
> From: Tommaso Teofili [mailto:[hidden email]]
> Sent: Wednesday, July 28, 2010 3:31 AM
> To: [hidden email]
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> I attached a patch for Solr 1.4.1 release on
> https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
> me.
> This strange behaviour for me was due to the fact that I copied the patched
> jars and war inside the dist directory but forgot to update the war inside
> the example/webapps directory (that is inside Jetty).
> Hope this helps.
> Tommaso
>
> 2010/7/27 David Thibault <[hidden email]>
>
> > Alessandro & all,
> >
> > I was having the same issue with Tika crashing on certain PDFs.  I also
> > noticed the bug where no content was extracted after upgrading Tika.
> >
> > When I went to the SOLR issue you link to below, I applied all the
> patches,
> > downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl,
> and
> > got the following error:
> > SEVERE: java.lang.NoSuchMethodError:
> >
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> > at
> >
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> > at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> > at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> > at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> > at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> > at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> > at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> > at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> > at
> >
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> > at
> >
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> > at
> org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> > at java.lang.Thread.run(Thread.java:619)
> >
> > This is really weird because I DID apply the SolrResourceLoader patch
> that
> > adds the getClassLoader method.  I even verified by going opening up the
> > JARs and looking at the class file in Eclipse...I can see the
> > SolrResourceLoader.getClassLoader() method.
> >
> > Does anyone know why it can't find the method?  After patching the source
> I
> > did ant clean dist in the base directory of the Solr source tree and
> > everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
> > the jars from dist/ and all the library dependencies from
> > contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything
> in
> > the logs looked good.
> >
> > I'm stumped.  It would be very nice to have a Solr implementation using
> the
> > newest versions of PDFBox & Tika and actually have content being
> > extracted...=)
> >
> > Best,
> > Dave
> >
> >
> > -----Original Message-----
> > From: Alessandro Benedetti [mailto:[hidden email]]
> > Sent: Tuesday, July 27, 2010 6:09 AM
> > To: [hidden email]
> > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
> Solr
> > CELL/Tika/PDFBox
> >
> > Hi Jon,
> > During the last days we front the same problem.
> > Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
> > content and from others, Solr throws an exception during the Indexing
> > Process .
> > You must:
> > Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> > snapshot and tika-parsers 0.8.
> > Update PdfBox and all related libraries.
> > After that You have to patch Solr 1.4.1 following this patch :
> >
> >
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> > This is the firts way to solve the problem.
> >
> > Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
> > is
> > thrown during the Indexing process, but no content is extracted.
> > Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> > sounds good but we don't know how stableit is!
> > I hope you have now a clear  vision of this issue,
> > Best Regards
> >
> >
> >
> > 2010/7/26 Sharp, Jonathan <[hidden email]>
> >
> > >
> > > Every so often I need to index new batches of scanned PDFs and
> > occasionally
> > > Adobe's OCR can't recognize the text in a couple of these documents. In
> > > these situations I would like to type in a small amount of text onto
> the
> > > document and have it be extracted by Solr CELL.
> > >
> > > Adobe Pro 9 has a number of different ways to add text directly to a
> PDF
> > > file:
> > >
> > > *Typewriter
> > > *Sticky Note
> > > *Callout boxes
> > > *Text boxes
> > >
> > > I tried indexing documents with each of these text additions with Solr
> > > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> > >
> > > If someone has modified their Solr CELL installation to use more recent
> > > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> > comment
> > > on whether newer versions can pull the text out of any of these various
> > text
> > > boxes I'd appreciate that very much.
> > >
> > > -Jon
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > SECURITY/CONFIDENTIALITY WARNING:
> > > This message and any attachments are intended solely for the individual
> > or
> > > entity to which they are addressed. This communication may contain
> > > information that is privileged, confidential, or exempt from disclosure
> > > under applicable law (e.g., personal health information, research data,
> > > financial information). Because this e-mail has been sent without
> > > encryption, individuals other than the intended recipient may be able
> to
> > > view the information, forward it to others or tamper with the
> information
> > > without the knowledge or consent of the sender. If you are not the
> > intended
> > > recipient, or the employee or person responsible for delivering the
> > message
> > > to the intended recipient, any dissemination, distribution or copying
> of
> > the
> > > communication is strictly prohibited. If you received the communication
> > in
> > > error, please notify the sender immediately by replying to this message
> > and
> > > deleting the message and any accompanying files from your system. If,
> due
> > to
> > > the security risks, you do not wish to receive further communications
> via
> > > e-mail, please reply to this message and inform the sender that you do
> > not
> > > wish to receive further e-mail from the sender.
> > >
> > > ---------------------------------------------------------------------
> > >
> > >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Personal Page: http://tigerbolt.altervista.org
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
> >
>
>


--
--------------------------

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England
Reply | Threaded
Open this post in threaded view
|

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

David Thibault-2
Thanks, I'll try that then. I kind of figured that'd be the answer, but after fighting with Solr & ExtractingRequestHandler for 2 days I also just wanted to be done with it once it started working with 4.0...=)  However, stability would be better in the long run.

Best,
Dave

-----Original Message-----
From: Alessandro Benedetti [mailto:[hidden email]]
Sent: Wednesday, July 28, 2010 9:33 AM
To: [hidden email]
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

In my opinion, the 1.4.1 version with the Patch is more Stable.
Until 4.0 will be released ....

2010/7/28 David Thibault <[hidden email]>

> Yesterday I did get this working with version 4.0 from trunk.  I haven't
> fully tested it yet, but the content doesn't come through blank anymore, so
> that's good.  Would it be more stable to stick with 1.4.1 and your patch to
> get to Tika 0.8, or to stick with the 4.0 trunk version?
>
> Best,
> Dave
>
> -----Original Message-----
> From: Tommaso Teofili [mailto:[hidden email]]
> Sent: Wednesday, July 28, 2010 3:31 AM
> To: [hidden email]
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> I attached a patch for Solr 1.4.1 release on
> https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
> me.
> This strange behaviour for me was due to the fact that I copied the patched
> jars and war inside the dist directory but forgot to update the war inside
> the example/webapps directory (that is inside Jetty).
> Hope this helps.
> Tommaso
>
> 2010/7/27 David Thibault <[hidden email]>
>
> > Alessandro & all,
> >
> > I was having the same issue with Tika crashing on certain PDFs.  I also
> > noticed the bug where no content was extracted after upgrading Tika.
> >
> > When I went to the SOLR issue you link to below, I applied all the
> patches,
> > downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl,
> and
> > got the following error:
> > SEVERE: java.lang.NoSuchMethodError:
> >
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> > at
> >
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> > at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> > at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> > at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> > at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> > at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> > at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> > at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> > at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> > at
> >
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> > at
> >
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> > at
> org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> > at java.lang.Thread.run(Thread.java:619)
> >
> > This is really weird because I DID apply the SolrResourceLoader patch
> that
> > adds the getClassLoader method.  I even verified by going opening up the
> > JARs and looking at the class file in Eclipse...I can see the
> > SolrResourceLoader.getClassLoader() method.
> >
> > Does anyone know why it can't find the method?  After patching the source
> I
> > did ant clean dist in the base directory of the Solr source tree and
> > everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
> > the jars from dist/ and all the library dependencies from
> > contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything
> in
> > the logs looked good.
> >
> > I'm stumped.  It would be very nice to have a Solr implementation using
> the
> > newest versions of PDFBox & Tika and actually have content being
> > extracted...=)
> >
> > Best,
> > Dave
> >
> >
> > -----Original Message-----
> > From: Alessandro Benedetti [mailto:[hidden email]]
> > Sent: Tuesday, July 27, 2010 6:09 AM
> > To: [hidden email]
> > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
> Solr
> > CELL/Tika/PDFBox
> >
> > Hi Jon,
> > During the last days we front the same problem.
> > Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
> > content and from others, Solr throws an exception during the Indexing
> > Process .
> > You must:
> > Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> > snapshot and tika-parsers 0.8.
> > Update PdfBox and all related libraries.
> > After that You have to patch Solr 1.4.1 following this patch :
> >
> >
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> > This is the firts way to solve the problem.
> >
> > Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
> > is
> > thrown during the Indexing process, but no content is extracted.
> > Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> > sounds good but we don't know how stableit is!
> > I hope you have now a clear  vision of this issue,
> > Best Regards
> >
> >
> >
> > 2010/7/26 Sharp, Jonathan <[hidden email]>
> >
> > >
> > > Every so often I need to index new batches of scanned PDFs and
> > occasionally
> > > Adobe's OCR can't recognize the text in a couple of these documents. In
> > > these situations I would like to type in a small amount of text onto
> the
> > > document and have it be extracted by Solr CELL.
> > >
> > > Adobe Pro 9 has a number of different ways to add text directly to a
> PDF
> > > file:
> > >
> > > *Typewriter
> > > *Sticky Note
> > > *Callout boxes
> > > *Text boxes
> > >
> > > I tried indexing documents with each of these text additions with Solr
> > > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> > >
> > > If someone has modified their Solr CELL installation to use more recent
> > > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> > comment
> > > on whether newer versions can pull the text out of any of these various
> > text
> > > boxes I'd appreciate that very much.
> > >
> > > -Jon
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > SECURITY/CONFIDENTIALITY WARNING:
> > > This message and any attachments are intended solely for the individual
> > or
> > > entity to which they are addressed. This communication may contain
> > > information that is privileged, confidential, or exempt from disclosure
> > > under applicable law (e.g., personal health information, research data,
> > > financial information). Because this e-mail has been sent without
> > > encryption, individuals other than the intended recipient may be able
> to
> > > view the information, forward it to others or tamper with the
> information
> > > without the knowledge or consent of the sender. If you are not the
> > intended
> > > recipient, or the employee or person responsible for delivering the
> > message
> > > to the intended recipient, any dissemination, distribution or copying
> of
> > the
> > > communication is strictly prohibited. If you received the communication
> > in
> > > error, please notify the sender immediately by replying to this message
> > and
> > > deleting the message and any accompanying files from your system. If,
> due
> > to
> > > the security risks, you do not wish to receive further communications
> via
> > > e-mail, please reply to this message and inform the sender that you do
> > not
> > > wish to receive further e-mail from the sender.
> > >
> > > ---------------------------------------------------------------------
> > >
> > >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Personal Page: http://tigerbolt.altervista.org
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
> >
>
>


--
--------------------------

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply | Threaded
Open this post in threaded view
|

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

Tommaso Teofili
This was my same feeling :-) and so I went for the trunk to have things
working quickly, but I also have to consider which one is the best version
since I am going to deploy it in the near future in an enterprise
environment and choosing the best version is an importat step.
I am quite new to Solr but I agree with Alessandro that probably using a
slightly patched release should theoretically be more stable than the trunk
which get many updates weekly (and daily).
Cheers,
Tommaso

2010/7/28 David Thibault <[hidden email]>

> Thanks, I'll try that then. I kind of figured that'd be the answer, but
> after fighting with Solr & ExtractingRequestHandler for 2 days I also just
> wanted to be done with it once it started working with 4.0...=)  However,
> stability would be better in the long run.
>
> Best,
> Dave
>
> -----Original Message-----
> From: Alessandro Benedetti [mailto:[hidden email]]
> Sent: Wednesday, July 28, 2010 9:33 AM
> To: [hidden email]
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> In my opinion, the 1.4.1 version with the Patch is more Stable.
> Until 4.0 will be released ....
>
> 2010/7/28 David Thibault <[hidden email]>
>
> > Yesterday I did get this working with version 4.0 from trunk.  I haven't
> > fully tested it yet, but the content doesn't come through blank anymore,
> so
> > that's good.  Would it be more stable to stick with 1.4.1 and your patch
> to
> > get to Tika 0.8, or to stick with the 4.0 trunk version?
> >
> > Best,
> > Dave
> >
> > -----Original Message-----
> > From: Tommaso Teofili [mailto:[hidden email]]
> > Sent: Wednesday, July 28, 2010 3:31 AM
> > To: [hidden email]
> > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
> Solr
> > CELL/Tika/PDFBox
> >
> > I attached a patch for Solr 1.4.1 release on
> > https://issues.apache.org/jira/browse/SOLR-1902 that made things work
> for
> > me.
> > This strange behaviour for me was due to the fact that I copied the
> patched
> > jars and war inside the dist directory but forgot to update the war
> inside
> > the example/webapps directory (that is inside Jetty).
> > Hope this helps.
> > Tommaso
> >
> > 2010/7/27 David Thibault <[hidden email]>
> >
> > > Alessandro & all,
> > >
> > > I was having the same issue with Tika crashing on certain PDFs.  I also
> > > noticed the bug where no content was extracted after upgrading Tika.
> > >
> > > When I went to the SOLR issue you link to below, I applied all the
> > patches,
> > > downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl,
> > and
> > > got the following error:
> > > SEVERE: java.lang.NoSuchMethodError:
> > >
> >
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> > > at
> > >
> >
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> > > at
> > >
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> > > at
> > >
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> > > at
> > >
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> > > at
> > >
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> > > at
> > >
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > > at
> > >
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> > > at
> > >
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> > > at
> > >
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> > > at
> > >
> >
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> > > at
> > >
> >
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> > > at
> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> > > at java.lang.Thread.run(Thread.java:619)
> > >
> > > This is really weird because I DID apply the SolrResourceLoader patch
> > that
> > > adds the getClassLoader method.  I even verified by going opening up
> the
> > > JARs and looking at the class file in Eclipse...I can see the
> > > SolrResourceLoader.getClassLoader() method.
> > >
> > > Does anyone know why it can't find the method?  After patching the
> source
> > I
> > > did ant clean dist in the base directory of the Solr source tree and
> > > everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied
> all
> > > the jars from dist/ and all the library dependencies from
> > > contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat,
> everything
> > in
> > > the logs looked good.
> > >
> > > I'm stumped.  It would be very nice to have a Solr implementation using
> > the
> > > newest versions of PDFBox & Tika and actually have content being
> > > extracted...=)
> > >
> > > Best,
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Alessandro Benedetti [mailto:[hidden email]]
> > > Sent: Tuesday, July 27, 2010 6:09 AM
> > > To: [hidden email]
> > > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
> > Solr
> > > CELL/Tika/PDFBox
> > >
> > > Hi Jon,
> > > During the last days we front the same problem.
> > > Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't
> extract
> > > content and from others, Solr throws an exception during the Indexing
> > > Process .
> > > You must:
> > > Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> > > snapshot and tika-parsers 0.8.
> > > Update PdfBox and all related libraries.
> > > After that You have to patch Solr 1.4.1 following this patch :
> > >
> > >
> >
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> > > This is the firts way to solve the problem.
> > >
> > > Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no
> exception
> > > is
> > > thrown during the Indexing process, but no content is extracted.
> > > Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> > > sounds good but we don't know how stableit is!
> > > I hope you have now a clear  vision of this issue,
> > > Best Regards
> > >
> > >
> > >
> > > 2010/7/26 Sharp, Jonathan <[hidden email]>
> > >
> > > >
> > > > Every so often I need to index new batches of scanned PDFs and
> > > occasionally
> > > > Adobe's OCR can't recognize the text in a couple of these documents.
> In
> > > > these situations I would like to type in a small amount of text onto
> > the
> > > > document and have it be extracted by Solr CELL.
> > > >
> > > > Adobe Pro 9 has a number of different ways to add text directly to a
> > PDF
> > > > file:
> > > >
> > > > *Typewriter
> > > > *Sticky Note
> > > > *Callout boxes
> > > > *Text boxes
> > > >
> > > > I tried indexing documents with each of these text additions with
> Solr
> > > > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> > > >
> > > > If someone has modified their Solr CELL installation to use more
> recent
> > > > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> > > comment
> > > > on whether newer versions can pull the text out of any of these
> various
> > > text
> > > > boxes I'd appreciate that very much.
> > > >
> > > > -Jon
> > > >
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > SECURITY/CONFIDENTIALITY WARNING:
> > > > This message and any attachments are intended solely for the
> individual
> > > or
> > > > entity to which they are addressed. This communication may contain
> > > > information that is privileged, confidential, or exempt from
> disclosure
> > > > under applicable law (e.g., personal health information, research
> data,
> > > > financial information). Because this e-mail has been sent without
> > > > encryption, individuals other than the intended recipient may be able
> > to
> > > > view the information, forward it to others or tamper with the
> > information
> > > > without the knowledge or consent of the sender. If you are not the
> > > intended
> > > > recipient, or the employee or person responsible for delivering the
> > > message
> > > > to the intended recipient, any dissemination, distribution or copying
> > of
> > > the
> > > > communication is strictly prohibited. If you received the
> communication
> > > in
> > > > error, please notify the sender immediately by replying to this
> message
> > > and
> > > > deleting the message and any accompanying files from your system. If,
> > due
> > > to
> > > > the security risks, you do not wish to receive further communications
> > via
> > > > e-mail, please reply to this message and inform the sender that you
> do
> > > not
> > > > wish to receive further e-mail from the sender.
> > > >
> > > > ---------------------------------------------------------------------
> > > >
> > > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Personal Page: http://tigerbolt.altervista.org
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> > >
> >
> >
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Personal Page: http://tigerbolt.altervista.org
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

David Thibault-2
In reply to this post by Tommaso Teofili
Tommasso,

I used your patch and tried it with the 1.4.1 solr.war from a fresh 1.4.1 distribution, and it still gave me that NoSuchMethodError.  However, when I tried it with the newly-patched-and-compiled apache-solr-1.4.2-dev.war file it works.  I think I tried that before and it didn't work.

In any case, thanks for the patch and the advice.  Looks like now it's working for me.

Best,
Dave




-----Original Message-----
From: Tommaso Teofili [mailto:[hidden email]]
Sent: Wednesday, July 28, 2010 3:31 AM
To: [hidden email]
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied the patched
jars and war inside the dist directory but forgot to update the war inside
the example/webapps directory (that is inside Jetty).
Hope this helps.
Tommaso

2010/7/27 David Thibault <[hidden email]>

> Alessandro & all,
>
> I was having the same issue with Tika crashing on certain PDFs.  I also
> noticed the bug where no content was extracted after upgrading Tika.
>
> When I went to the SOLR issue you link to below, I applied all the patches,
> downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and
> got the following error:
> SEVERE: java.lang.NoSuchMethodError:
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> at
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> at
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> at java.lang.Thread.run(Thread.java:619)
>
> This is really weird because I DID apply the SolrResourceLoader patch that
> adds the getClassLoader method.  I even verified by going opening up the
> JARs and looking at the class file in Eclipse...I can see the
> SolrResourceLoader.getClassLoader() method.
>
> Does anyone know why it can't find the method?  After patching the source I
> did ant clean dist in the base directory of the Solr source tree and
> everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
> the jars from dist/ and all the library dependencies from
> contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in
> the logs looked good.
>
> I'm stumped.  It would be very nice to have a Solr implementation using the
> newest versions of PDFBox & Tika and actually have content being
> extracted...=)
>
> Best,
> Dave
>
>
> -----Original Message-----
> From: Alessandro Benedetti [mailto:[hidden email]]
> Sent: Tuesday, July 27, 2010 6:09 AM
> To: [hidden email]
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> Hi Jon,
> During the last days we front the same problem.
> Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
> content and from others, Solr throws an exception during the Indexing
> Process .
> You must:
> Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> snapshot and tika-parsers 0.8.
> Update PdfBox and all related libraries.
> After that You have to patch Solr 1.4.1 following this patch :
>
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> This is the firts way to solve the problem.
>
> Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
> is
> thrown during the Indexing process, but no content is extracted.
> Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> sounds good but we don't know how stableit is!
> I hope you have now a clear  vision of this issue,
> Best Regards
>
>
>
> 2010/7/26 Sharp, Jonathan <[hidden email]>
>
> >
> > Every so often I need to index new batches of scanned PDFs and
> occasionally
> > Adobe's OCR can't recognize the text in a couple of these documents. In
> > these situations I would like to type in a small amount of text onto the
> > document and have it be extracted by Solr CELL.
> >
> > Adobe Pro 9 has a number of different ways to add text directly to a PDF
> > file:
> >
> > *Typewriter
> > *Sticky Note
> > *Callout boxes
> > *Text boxes
> >
> > I tried indexing documents with each of these text additions with Solr
> > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> >
> > If someone has modified their Solr CELL installation to use more recent
> > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> comment
> > on whether newer versions can pull the text out of any of these various
> text
> > boxes I'd appreciate that very much.
> >
> > -Jon
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > SECURITY/CONFIDENTIALITY WARNING:
> > This message and any attachments are intended solely for the individual
> or
> > entity to which they are addressed. This communication may contain
> > information that is privileged, confidential, or exempt from disclosure
> > under applicable law (e.g., personal health information, research data,
> > financial information). Because this e-mail has been sent without
> > encryption, individuals other than the intended recipient may be able to
> > view the information, forward it to others or tamper with the information
> > without the knowledge or consent of the sender. If you are not the
> intended
> > recipient, or the employee or person responsible for delivering the
> message
> > to the intended recipient, any dissemination, distribution or copying of
> the
> > communication is strictly prohibited. If you received the communication
> in
> > error, please notify the sender immediately by replying to this message
> and
> > deleting the message and any accompanying files from your system. If, due
> to
> > the security risks, you do not wish to receive further communications via
> > e-mail, please reply to this message and inform the sender that you do
> not
> > wish to receive further e-mail from the sender.
> >
> > ---------------------------------------------------------------------
> >
> >
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Personal Page: http://tigerbolt.altervista.org
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>
>