Solr indexing with Tika DIH local vs network share

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr indexing with Tika DIH local vs network share

neilb
Hi, I am trying to setup Solr for our  project which can return full text
searches on PDF documents. I am able to run the sample Tika DIH example
locally on my windows server machine. It can index all PDF documents
recursively in "baseDir" of config xml. Presently "baseDir" points to local
folder on the same machine and has around 10K pdf files. This whole setup
works as expected.

Next step is to import PDF documents located on network share. I created
another core, with very similar configuration files except this time,
baseDir points to network share ("\\myserver\pdfshare"). I have no success
in indexing these documents on newly created core. I have tried mapping this
network share to local drive and updated config accordingly but still no
success.
I managed to copy all pdf file from network share to local folder where
example core with sample Tika DIH points and I am able to index all pdf
files.

So I am not sure why Tika config with network path is not able to index the
files. Looking into log I can see following entries but that doesn't explain
anything. Can someone guide to resolve the issue?

2019-03-26 13:58:37.250 DEBUG (Scheduler-1147580192) [   ]
o.e.j.i.FillInterest onFail
FillInterest@419eacc8{AC.ReadCB@1ad637ed{HttpConnection@1ad637ed::SocketChannelEndPoint@6190d407{/10.206.11.68:51486<->/10.205.53.163:8983,OPEN,fill=FI,flush=-,to=120010/120000}{io=1/1,kio=1,kro=1}->HttpConnection@1ad637ed[p=HttpParser{s=START,0
of
-1},g=HttpGenerator@7d81e85c{s=START}]=>HttpChannelOverHttp@10e588cc{r=2,c=false,a=IDLE,uri=null,age=0}}}
java.util.concurrent.TimeoutException: Idle timeout expired: 120010/120000
ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
[jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
[jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
[?:1.8.0_201]
        at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_201]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
Source) [?:1.8.0_201]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source) [?:1.8.0_201]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[?:1.8.0_201]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[?:1.8.0_201]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_201]


Is it possible that Solr is not ale to access the network share? Is this
anyway that I can run Solr.cmd under different user (who as access to
network share) in windows environment?
Please let me know if you wish to know any more details about the issue.


Thanks in advance




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr indexing with Tika DIH local vs network share

Erick Erickson
Not quite an answer to your specific qustion, but… There
are a number of reasons why it’s better to run your Tika
process outside of Solr and DIH. Here’s the long form:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Ignore the RDBMS parts. It’s somewhat old, but should be
adaptable easily.

Best,
Erick

> On Mar 26, 2019, at 8:27 AM, neilb <[hidden email]> wrote:
>
> Hi, I am trying to setup Solr for our  project which can return full text
> searches on PDF documents. I am able to run the sample Tika DIH example
> locally on my windows server machine. It can index all PDF documents
> recursively in "baseDir" of config xml. Presently "baseDir" points to local
> folder on the same machine and has around 10K pdf files. This whole setup
> works as expected.
>
> Next step is to import PDF documents located on network share. I created
> another core, with very similar configuration files except this time,
> baseDir points to network share ("\\myserver\pdfshare"). I have no success
> in indexing these documents on newly created core. I have tried mapping this
> network share to local drive and updated config accordingly but still no
> success.
> I managed to copy all pdf file from network share to local folder where
> example core with sample Tika DIH points and I am able to index all pdf
> files.
>
> So I am not sure why Tika config with network path is not able to index the
> files. Looking into log I can see following entries but that doesn't explain
> anything. Can someone guide to resolve the issue?
>
> 2019-03-26 13:58:37.250 DEBUG (Scheduler-1147580192) [   ]
> o.e.j.i.FillInterest onFail
> FillInterest@419eacc8{AC.ReadCB@1ad637ed{HttpConnection@1ad637ed::SocketChannelEndPoint@6190d407{/10.206.11.68:51486<->/10.205.53.163:8983,OPEN,fill=FI,flush=-,to=120010/120000}{io=1/1,kio=1,kro=1}->HttpConnection@1ad637ed[p=HttpParser{s=START,0
> of
> -1},g=HttpGenerator@7d81e85c{s=START}]=>HttpChannelOverHttp@10e588cc{r=2,c=false,a=IDLE,uri=null,age=0}}}
> java.util.concurrent.TimeoutException: Idle timeout expired: 120010/120000
> ms
> at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
> [jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
> at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
> [jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> [?:1.8.0_201]
> at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_201]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
> Source) [?:1.8.0_201]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
> Source) [?:1.8.0_201]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> [?:1.8.0_201]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> [?:1.8.0_201]
> at java.lang.Thread.run(Unknown Source) [?:1.8.0_201]
>
>
> Is it possible that Solr is not ale to access the network share? Is this
> anyway that I can run Solr.cmd under different user (who as access to
> network share) in windows environment?
> Please let me know if you wish to know any more details about the issue.
>
>
> Thanks in advance
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Solr indexing with Tika DIH local vs network share

neilb
Hi Erick, thanks a lot for your suggestions. I will look into it. But to
answer my own query, I was little impatient and checking indexing status
after every minute. What I found is after few hours, status started updating
with document count and finished the indexing process in around 5Hrs.
Do you see anything wrong with current setup of Solr and Tika DIH? All I am
looking for PDF full text search results and have it integrated in web app
dashboard using ajax queries. Also this particular  article
<http://lets-share.senktas.net/2017/11/solr-as-a-service.html>   was helpful
to get Solr running as windows service with 4G memory configuration under
localsystem account.

Thanks again!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr indexing with Tika DIH local vs network share

Erick Erickson
I suspect is that your autocommit settings in solrconfig.xml
are something like

hard commit: has openSearcher set to “false”
soft commit: has the interval set to -1 (never)

That means that until an external commit is executed, you won’t see any documents. Try setting your soft commit  to something like, say, 5 minutes (or even one minute). That would reduce the interval before docs become searchable.

I think DIH issues a commit at the end of the run, so that would be why you didn’t see anything for so long if I’m right.

Here’s more than you want to know about all this: https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

I _still_ recommend you move the Tika processing off of Solr. 4G of memory is easily exceeded with the right (well, wrong) PDF document. And since Tika is runing inside Solr, that’ll mean Solr has an OOM and at that point you really don’t know the state of Solr and must restart. Running Tika in a different process will insulate Solr from this kind of thing.

Best,
Erick


> On Mar 29, 2019, at 8:36 AM, neilb <[hidden email]> wrote:
>
> Hi Erick, thanks a lot for your suggestions. I will look into it. But to
> answer my own query, I was little impatient and checking indexing status
> after every minute. What I found is after few hours, status started updating
> with document count and finished the indexing process in around 5Hrs.
> Do you see anything wrong with current setup of Solr and Tika DIH? All I am
> looking for PDF full text search results and have it integrated in web app
> dashboard using ajax queries. Also this particular  article
> <http://lets-share.senktas.net/2017/11/solr-as-a-service.html>   was helpful
> to get Solr running as windows service with 4G memory configuration under
> localsystem account.
>
> Thanks again!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Solr indexing with Tika DIH local vs network share

neilb
Hi Erick, I am using solrconfig.xml from samples only and has very few
entries. I have attached my config files for review along with reply.

Thanks
solrconfig.xml
<http://lucene.472066.n3.nabble.com/file/t494741/solrconfig.xml>  
tika-data-config.xml
<http://lucene.472066.n3.nabble.com/file/t494741/tika-data-config.xml>  
managed-schema
<http://lucene.472066.n3.nabble.com/file/t494741/managed-schema.managed-schema>  





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Solr indexing with Tika DIH local vs network share

Erick Erickson
So just try adding the autocommit and auotsoftcommit settings. All of the example configs have these entries and you can copy/paste/change

> On Mar 29, 2019, at 10:35 AM, neilb <[hidden email]> wrote:
>
> Hi Erick, I am using solrconfig.xml from samples only and has very few
> entries. I have attached my config files for review along with reply.
>
> Thanks
> solrconfig.xml
> <http://lucene.472066.n3.nabble.com/file/t494741/solrconfig.xml>  
> tika-data-config.xml
> <http://lucene.472066.n3.nabble.com/file/t494741/tika-data-config.xml>  
> managed-schema
> <http://lucene.472066.n3.nabble.com/file/t494741/managed-schema.managed-schema>  
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Solr indexing with Tika DIH local vs network share

neilb
Thank you Erick, this is very helpful!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html