OCR not working occasionally

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

OCR not working occasionally

Zheng Lin Edwin Yeo
Hi,

I'm facing the issue of that the Tesseract OCR is not able to extract the
words in a PDF file in an attachment in EMLfile and index it into Solr
occasionally? However, most of the time it can be extracted.

What could be the reason that causes the file in the email attachment to be
failed to extracted using OCR?

I'm using Solr 6.4.2.

Regards,
Edwin
Reply | Threaded
Open this post in threaded view
|

Re: OCR not working occasionally

Rick Leir-2
Hi Edwin
The pdf file format can store text as an image, and then you need OCR to get the text. However, text is more commonly not stored as an image in the pdf, and then you should not use OCR to get the text.

Do you get an error message when you have a failure?
Cheers -- Rick

On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo <[hidden email]> wrote:

>Hi,
>
>I'm facing the issue of that the Tesseract OCR is not able to extract
>the
>words in a PDF file in an attachment in EMLfile and index it into Solr
>occasionally? However, most of the time it can be extracted.
>
>What could be the reason that causes the file in the email attachment
>to be
>failed to extracted using OCR?
>
>I'm using Solr 6.4.2.
>
>Regards,
>Edwin

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Reply | Threaded
Open this post in threaded view
|

Re: OCR not working occasionally

Zheng Lin Edwin Yeo
Hi Rick,

Thanks for your reply.
I saw this error message for the file which has a failure.
Am I able to index such files together with the other files which store
text as an image together in the same indexing threads?


2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling
setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c
2017-03-19 01:02:26.610 ERROR
(updateExecutor-2-thread-4-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.u.SolrCmdDistributor
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.99.1:8984/solr/collection1_shard1_replica1:
Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 </title>
</head>
<body>
<h2>HTTP ERROR: 404</h2>
<p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
<pre>    Not Found</pre></p>
<hr />
</body>
</html>

at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:578)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(ConcurrentUpdateSolrClient.java:430)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:293)
at
org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:282)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

2017-03-19 01:02:26.657 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher
Opening [Searcher@77e108d5[collection1_shard1_replica2] main]
2017-03-19 01:02:26.658 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
end_commit_flush
2017-03-19 01:02:26.658 INFO
 (searcherExecutor-16-thread-1-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.c.QuerySenderListener QuerySenderListener sending requests to
Searcher@77e108d5[collection1_shard1_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))}
2017-03-19 01:02:26.658 INFO
 (searcherExecutor-16-thread-1-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.c.QuerySenderListener QuerySenderListener done.
2017-03-19 01:02:26.659 INFO
 (searcherExecutor-16-thread-1-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.c.SolrCore [collection1_shard1_replica2] Registered new searcher
Searcher@77e108d5[collection1_shard1_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))}
2017-03-19 01:02:26.659 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard1_replica2]
 webapp=/solr path=/update
params={update.distrib=FROMLEADER&update.chain=files-update-processor&waitSearcher=true&openSearcher=true&commit=true&softCommit=false&distrib.from=
http://192.168.99.1:8983/solr/collection1_shard1_replica2/&commit_end_point=true&wt=javabin&version=2&expungeDeletes=false}{commit=}
0 49
2017-03-19 01:02:26.662 WARN  (qtp1543727556-139) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2]
o.a.s.u.p.DistributedUpdateProcessor Error sending update to
http://192.168.99.1:8984/solr
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.99.1:8984/solr/collection1_shard1_replica1:
Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 </title>
</head>
<body>
<h2>HTTP ERROR: 404</h2>
<p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
<pre>    Not Found</pre></p>
<hr />
</body>
</html>

at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:578)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(ConcurrentUpdateSolrClient.java:430)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:293)
at
org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:282)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
2017-03-19 01:02:26.662 INFO  (qtp1543727556-139) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard1_replica2]
 webapp=/solr path=/update params={commit=true}{commit=} 0 66
2017-03-19 01:02:43.019 INFO  (qtp1543727556-21) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
[collection1_shard1_replica2]  webapp=/solr path=/admin/file
params={wt=json&_=1489885363012} status=0 QTime=4
2017-03-19 01:02:45.453 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.c.PluginBag Going to
create a new requestHandler with {type = requestHandler,name =
/select,class = solr.SearchHandler,attributes = {enable=true, startup=lazy,
name=/select, class=solr.SearchHandler},args =
{defaults={echoParams=explicit,rows=10,wt=json,indent=true,df=text,fl=id,
content, content_type, content_cat, content_subcat, creation_date, subject,
userid, author, entity, location, geolocation, visibility, accesslevel,
accessgroup, reference, crossreference, resourcename, importance, tag,
popularity, language_s, score}}}
2017-03-19 01:02:45.461 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
[collection1_shard1_replica2]  webapp=/solr path=/select
params={q=*:*&indent=true&wt=json&_=1489885365450} hits=3 status=0 QTime=8


Regards,
Edwin


On 19 March 2017 at 06:31, Rick Leir <[hidden email]> wrote:

> Hi Edwin
> The pdf file format can store text as an image, and then you need OCR to
> get the text. However, text is more commonly not stored as an image in the
> pdf, and then you should not use OCR to get the text.
>
> Do you get an error message when you have a failure?
> Cheers -- Rick
>
> On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo <
> [hidden email]> wrote:
> >Hi,
> >
> >I'm facing the issue of that the Tesseract OCR is not able to extract
> >the
> >words in a PDF file in an attachment in EMLfile and index it into Solr
> >occasionally? However, most of the time it can be extracted.
> >
> >What could be the reason that causes the file in the email attachment
> >to be
> >failed to extracted using OCR?
> >
> >I'm using Solr 6.4.2.
> >
> >Regards,
> >Edwin
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
Reply | Threaded
Open this post in threaded view
|

Re: OCR not working occasionally

Zheng Lin Edwin Yeo
This is my settings in the PDFParser.properties file
under tika-parsers-1.13.jar

enableAutoSpace true
extractAnnotationText true
sortByPosition false
suppressDuplicateOverlappingText false
extractAcroFormContent true
extractInlineImages true
extractUniqueInlineImagesOnly true
checkExtractAccessPermission false
allowExtractionForAccessibility true
ifXFAExtractOnlyXFA false
catchIntermediateIOExceptions true

Regards,
Edwin


On 19 March 2017 at 09:08, Zheng Lin Edwin Yeo <[hidden email]> wrote:

> Hi Rick,
>
> Thanks for your reply.
> I saw this error message for the file which has a failure.
> Am I able to index such files together with the other files which store
> text as an image together in the same indexing threads?
>
>
> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
> start commit{,optimize=false,openSearcher=true,waitSearcher=true,
> expungeDeletes=false,softCommit=false,prepareCommit=false}
> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling
> setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c
> 2017-03-19 01:02:26.610 ERROR (updateExecutor-2-thread-4-processing-n:
> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
> x:collection1_shard1_replica2] o.a.s.u.SolrCmdDistributor
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://192.168.99.1:8984/solr/
> collection1_shard1_replica1: Expected mime type application/octet-stream
> but got text/html. <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
> <title>Error 404 </title>
> </head>
> <body>
> <h2>HTTP ERROR: 404</h2>
> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
> <pre>    Not Found</pre></p>
> <hr />
> </body>
> </html>
>
> at org.apache.solr.client.solrj.impl.HttpSolrClient.
> executeMethod(HttpSolrClient.java:578)
> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:279)
> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:268)
> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(
> ConcurrentUpdateSolrClient.java:430)
> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
> at org.apache.solr.update.SolrCmdDistributor.doRequest(
> SolrCmdDistributor.java:293)
> at org.apache.solr.update.SolrCmdDistributor.lambda$
> submit$0(SolrCmdDistributor.java:282)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at com.codahale.metrics.InstrumentedExecutorService$
> InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.
> lambda$execute$0(ExecutorUtil.java:229)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
> 2017-03-19 01:02:26.657 INFO  (qtp1543727556-19) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher
> Opening [Searcher@77e108d5[collection1_shard1_replica2] main]
> 2017-03-19 01:02:26.658 INFO  (qtp1543727556-19) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
> end_commit_flush
> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
> QuerySenderListener sending requests to Searcher@77e108d5[collection1_shard1_replica2]
> main{ExitableDirectoryReader(UninvertingDirectoryReader(
> Uninverting(_0(6.4.2):C3)))}
> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
> QuerySenderListener done.
> 2017-03-19 01:02:26.659 INFO  (searcherExecutor-16-thread-1-processing-n:
> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
> x:collection1_shard1_replica2] o.a.s.c.SolrCore
> [collection1_shard1_replica2] Registered new searcher Searcher@77e108d5
> [collection1_shard1_replica2] main{ExitableDirectoryReader(
> UninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))}
> 2017-03-19 01:02:26.659 INFO  (qtp1543727556-19) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory
> [collection1_shard1_replica2]  webapp=/solr path=/update
> params={update.distrib=FROMLEADER&update.chain=files-
> update-processor&waitSearcher=true&openSearcher=true&commit=
> true&softCommit=false&distrib.from=http://192.168.99.1:8983/
> solr/collection1_shard1_replica2/&commit_end_point=
> true&wt=javabin&version=2&expungeDeletes=false}{commit=} 0 49
> 2017-03-19 01:02:26.662 WARN  (qtp1543727556-139) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.p.DistributedUpdateProcessor
> Error sending update to http://192.168.99.1:8984/solr
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://192.168.99.1:8984/solr/
> collection1_shard1_replica1: Expected mime type application/octet-stream
> but got text/html. <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
> <title>Error 404 </title>
> </head>
> <body>
> <h2>HTTP ERROR: 404</h2>
> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
> <pre>    Not Found</pre></p>
> <hr />
> </body>
> </html>
>
> at org.apache.solr.client.solrj.impl.HttpSolrClient.
> executeMethod(HttpSolrClient.java:578)
> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:279)
> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:268)
> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(
> ConcurrentUpdateSolrClient.java:430)
> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
> at org.apache.solr.update.SolrCmdDistributor.doRequest(
> SolrCmdDistributor.java:293)
> at org.apache.solr.update.SolrCmdDistributor.lambda$
> submit$0(SolrCmdDistributor.java:282)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at com.codahale.metrics.InstrumentedExecutorService$
> InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.
> lambda$execute$0(ExecutorUtil.java:229)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> 2017-03-19 01:02:26.662 INFO  (qtp1543727556-139) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory
> [collection1_shard1_replica2]  webapp=/solr path=/update
> params={commit=true}{commit=} 0 66
> 2017-03-19 01:02:43.019 INFO  (qtp1543727556-21) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
> [collection1_shard1_replica2]  webapp=/solr path=/admin/file
> params={wt=json&_=1489885363012} status=0 QTime=4
> 2017-03-19 01:02:45.453 INFO  (qtp1543727556-19) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.PluginBag Going to
> create a new requestHandler with {type = requestHandler,name =
> /select,class = solr.SearchHandler,attributes = {enable=true, startup=lazy,
> name=/select, class=solr.SearchHandler},args = {defaults={echoParams=
> explicit,rows=10,wt=json,indent=true,df=text,fl=id, content,
> content_type, content_cat, content_subcat, creation_date, subject, userid,
> author, entity, location, geolocation, visibility, accesslevel,
> accessgroup, reference, crossreference, resourcename, importance, tag,
> popularity, language_s, score}}}
> 2017-03-19 01:02:45.461 INFO  (qtp1543727556-19) [c:collection1 s:shard1
> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
> [collection1_shard1_replica2]  webapp=/solr path=/select
> params={q=*:*&indent=true&wt=json&_=1489885365450} hits=3 status=0 QTime=8
>
>
> Regards,
> Edwin
>
>
> On 19 March 2017 at 06:31, Rick Leir <[hidden email]> wrote:
>
>> Hi Edwin
>> The pdf file format can store text as an image, and then you need OCR to
>> get the text. However, text is more commonly not stored as an image in the
>> pdf, and then you should not use OCR to get the text.
>>
>> Do you get an error message when you have a failure?
>> Cheers -- Rick
>>
>> On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo <
>> [hidden email]> wrote:
>> >Hi,
>> >
>> >I'm facing the issue of that the Tesseract OCR is not able to extract
>> >the
>> >words in a PDF file in an attachment in EMLfile and index it into Solr
>> >occasionally? However, most of the time it can be extracted.
>> >
>> >What could be the reason that causes the file in the email attachment
>> >to be
>> >failed to extracted using OCR?
>> >
>> >I'm using Solr 6.4.2.
>> >
>> >Regards,
>> >Edwin
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: OCR not working occasionally

Zheng Lin Edwin Yeo
I have found this solution in Stackoverflow from Tim Allison to be working.

http://stackoverflow.com/questions/32354209/apache-
tika-extract-scanned-pdf-files

Regards,
Edwin

On 19 March 2017 at 19:47, Zheng Lin Edwin Yeo <[hidden email]> wrote:

> This is my settings in the PDFParser.properties file
> under tika-parsers-1.13.jar
>
> enableAutoSpace true
> extractAnnotationText true
> sortByPosition false
> suppressDuplicateOverlappingText false
> extractAcroFormContent true
> extractInlineImages true
> extractUniqueInlineImagesOnly true
> checkExtractAccessPermission false
> allowExtractionForAccessibility true
> ifXFAExtractOnlyXFA false
> catchIntermediateIOExceptions true
>
> Regards,
> Edwin
>
>
> On 19 March 2017 at 09:08, Zheng Lin Edwin Yeo <[hidden email]>
> wrote:
>
>> Hi Rick,
>>
>> Thanks for your reply.
>> I saw this error message for the file which has a failure.
>> Am I able to index such files together with the other files which store
>> text as an image together in the same indexing threads?
>>
>>
>> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
>> start commit{,optimize=false,openSearcher=true,waitSearcher=true,e
>> xpungeDeletes=false,softCommit=false,prepareCommit=false}
>> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling
>> setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c
>> 2017-03-19 01:02:26.610 ERROR (updateExecutor-2-thread-4-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.u.SolrCmdDistributor
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://192.168.99.1:8984/solr/
>> collection1_shard1_replica1: Expected mime type application/octet-stream
>> but got text/html. <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
>> <title>Error 404 </title>
>> </head>
>> <body>
>> <h2>HTTP ERROR: 404</h2>
>> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
>> <pre>    Not Found</pre></p>
>> <hr />
>> </body>
>> </html>
>>
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMeth
>> od(HttpSolrClient.java:578)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:279)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:268)
>> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient
>> .request(ConcurrentUpdateSolrClient.java:430)
>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
>> at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdD
>> istributor.java:293)
>> at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(
>> SolrCmdDistributor.java:282)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at com.codahale.metrics.InstrumentedExecutorService$Instrumente
>> dRunnable.run(InstrumentedExecutorService.java:176)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>>
>> 2017-03-19 01:02:26.657 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher
>> Opening [Searcher@77e108d5[collection1_shard1_replica2] main]
>> 2017-03-19 01:02:26.658 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
>> end_commit_flush
>> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
>> QuerySenderListener sending requests to Searcher@77e108d5[collection1_shard1_replica2]
>> main{ExitableDirectoryReader(UninvertingDirectoryReader(Unin
>> verting(_0(6.4.2):C3)))}
>> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
>> QuerySenderListener done.
>> 2017-03-19 01:02:26.659 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.SolrCore
>> [collection1_shard1_replica2] Registered new searcher Searcher@77e108d5
>> [collection1_shard1_replica2] main{ExitableDirectoryReader(U
>> ninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))}
>> 2017-03-19 01:02:26.659 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory
>> [collection1_shard1_replica2]  webapp=/solr path=/update
>> params={update.distrib=FROMLEADER&update.chain=files-update-
>> processor&waitSearcher=true&openSearcher=true&commit=true&
>> softCommit=false&distrib.from=http://192.168.99.1:8983/solr/
>> collection1_shard1_replica2/&commit_end_point=true&wt=
>> javabin&version=2&expungeDeletes=false}{commit=} 0 49
>> 2017-03-19 01:02:26.662 WARN  (qtp1543727556-139) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.p.DistributedUpdateProcessor
>> Error sending update to http://192.168.99.1:8984/solr
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://192.168.99.1:8984/solr/
>> collection1_shard1_replica1: Expected mime type application/octet-stream
>> but got text/html. <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
>> <title>Error 404 </title>
>> </head>
>> <body>
>> <h2>HTTP ERROR: 404</h2>
>> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
>> <pre>    Not Found</pre></p>
>> <hr />
>> </body>
>> </html>
>>
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMeth
>> od(HttpSolrClient.java:578)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:279)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:268)
>> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient
>> .request(ConcurrentUpdateSolrClient.java:430)
>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
>> at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdD
>> istributor.java:293)
>> at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(
>> SolrCmdDistributor.java:282)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at com.codahale.metrics.InstrumentedExecutorService$Instrumente
>> dRunnable.run(InstrumentedExecutorService.java:176)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>> 2017-03-19 01:02:26.662 INFO  (qtp1543727556-139) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory
>> [collection1_shard1_replica2]  webapp=/solr path=/update
>> params={commit=true}{commit=} 0 66
>> 2017-03-19 01:02:43.019 INFO  (qtp1543727556-21) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
>> [collection1_shard1_replica2]  webapp=/solr path=/admin/file
>> params={wt=json&_=1489885363012} status=0 QTime=4
>> 2017-03-19 01:02:45.453 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.PluginBag Going to
>> create a new requestHandler with {type = requestHandler,name =
>> /select,class = solr.SearchHandler,attributes = {enable=true, startup=lazy,
>> name=/select, class=solr.SearchHandler},args =
>> {defaults={echoParams=explicit,rows=10,wt=json,indent=true,df=text,fl=id,
>> content, content_type, content_cat, content_subcat, creation_date, subject,
>> userid, author, entity, location, geolocation, visibility, accesslevel,
>> accessgroup, reference, crossreference, resourcename, importance, tag,
>> popularity, language_s, score}}}
>> 2017-03-19 01:02:45.461 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
>> [collection1_shard1_replica2]  webapp=/solr path=/select
>> params={q=*:*&indent=true&wt=json&_=1489885365450} hits=3 status=0
>> QTime=8
>>
>>
>> Regards,
>> Edwin
>>
>>
>> On 19 March 2017 at 06:31, Rick Leir <[hidden email]> wrote:
>>
>>> Hi Edwin
>>> The pdf file format can store text as an image, and then you need OCR to
>>> get the text. However, text is more commonly not stored as an image in the
>>> pdf, and then you should not use OCR to get the text.
>>>
>>> Do you get an error message when you have a failure?
>>> Cheers -- Rick
>>>
>>> On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo <
>>> [hidden email]> wrote:
>>> >Hi,
>>> >
>>> >I'm facing the issue of that the Tesseract OCR is not able to extract
>>> >the
>>> >words in a PDF file in an attachment in EMLfile and index it into Solr
>>> >occasionally? However, most of the time it can be extracted.
>>> >
>>> >What could be the reason that causes the file in the email attachment
>>> >to be
>>> >failed to extracted using OCR?
>>> >
>>> >I'm using Solr 6.4.2.
>>> >
>>> >Regards,
>>> >Edwin
>>>
>>> --
>>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>
>>
>>
>