Idle Timeout while DIH indexing and implicit sharding in 7.4

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Idle Timeout while DIH indexing and implicit sharding in 7.4

Вадим Иванов
Hello gurus, 
I am using solrCloud with DIH for indexing my data.
Testing 7.4.0 with implicitly sharded collection  I have noticed that any
indexing
longer then 2 minutes always failing with many timeout records in log coming
from all replicas in collection.

Such as:
x:Mycol_s_0_replica_t40 RequestHandlerBase
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout
expired: 120001/120000 ms
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 120000/120000 ms
        at
org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1075)
        at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313)
        at
org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWra
pper.java:74)
...
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired:
120000/120000 ms
        at
org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
201(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
eduledThreadPoolExecutor.java:293)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
49)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
24)
        ... 1 more
        Suppressed: java.lang.Throwable: HttpInput failure
                at
org.eclipse.jetty.server.HttpInput.failed(HttpInput.java:821)
                at
org.eclipse.jetty.server.HttpConnection$BlockingReadCallback.failed(HttpConn
ection.java:649)
                at
org.eclipse.jetty.io.FillInterest.onFail(FillInterest.java:134)

Resulting indexing status:
  "statusMessages":{
    "Total Requests made to DataSource":"1",
    "Total Rows Fetched":"2828323",
    "Total Documents Processed":"2828323",
    "Total Documents Skipped":"0",
    "Full Dump Started":"2018-09-12 14:28:21",
    "":"Indexing completed. Added/Updated: 2828323 documents. Deleted 0
documents.",
    "Committed":"2018-09-12 14:33:41",
    "Time taken":"0:5:19.507",
    "Full Import failed":"2018-09-12 14:33:41"}}

Nevertheless all these documents seems indexed fine and searchable.
If the same collection not sharded  or sharded as " compositeId"   indexing
done without any errors.
Type of replicas - nrt or tolg doesn't matter.
Small Indexing (taking less than 2 minutes) run smoothly.

Testing environment - 1 node, Collection with 6 shards, 1 replica for each
shard
Collection:
/admin/collections?action=CREATE&name=Mycol
        &numShards=6
        &router.name=implicit
        &shards=s_0,s_1,s_2,s_3,s_4,s_5
        &router.field=sf_shard
        &collection.configName=Mycol
        &maxShardsPerNode=10
        &nrtReplicas=0&tlogReplicas=1


I have never noticed such behavior before on my prod configuration (solr
6.3.0)
Seems like bug in new version, but I could not find any jira on issue.

Any ideas, please...

--
BR
Vadim Ivanov

Reply | Threaded
Open this post in threaded view
|

RE: Idle Timeout while DIH indexing and implicit sharding in 7.4

Вадим Иванов
Hi,
I've put some more tests on the issue and managed to find out more details.
Time out occurs when while long indexing some documents in the beginning is
going to one shard and then for a long time (more than 120 sec) no data at
all is going to that shard.
Connection to that core, opened in the beginning of indexing, goes to  idle
timeout :( .
If no data at all going to the shard during indexing - no timeout occurs on
that shard.
If Indexing finishes earlier than 120 sec - no timeout occurs on that shard.
Unfortunately, in our use-case there are lot of long  indexing up to 30
minutes with uneven shard distribution of documents.
Any suggestion how to mitigate issue?
--
BR
Vadim Ivanov


-----Original Message-----
From: Вадим Иванов [mailto:[hidden email]]
Sent: Wednesday, September 12, 2018 4:29 PM
To: [hidden email]
Subject: Idle Timeout while DIH indexing and implicit sharding in 7.4

Hello gurus, 
I am using solrCloud with DIH for indexing my data.
Testing 7.4.0 with implicitly sharded collection  I have noticed that any
indexing
longer then 2 minutes always failing with many timeout records in log coming
from all replicas in collection.

Such as:
x:Mycol_s_0_replica_t40 RequestHandlerBase
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout
expired: 120001/120000 ms
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 120000/120000 ms
        at
org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1075)
        at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313)
        at
org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWra
pper.java:74)
...
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired:
120000/120000 ms
        at
org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
201(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
eduledThreadPoolExecutor.java:293)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
49)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
24)
        ... 1 more
        Suppressed: java.lang.Throwable: HttpInput failure
                at
org.eclipse.jetty.server.HttpInput.failed(HttpInput.java:821)
                at
org.eclipse.jetty.server.HttpConnection$BlockingReadCallback.failed(HttpConn
ection.java:649)
                at
org.eclipse.jetty.io.FillInterest.onFail(FillInterest.java:134)

Resulting indexing status:
  "statusMessages":{
    "Total Requests made to DataSource":"1",
    "Total Rows Fetched":"2828323",
    "Total Documents Processed":"2828323",
    "Total Documents Skipped":"0",
    "Full Dump Started":"2018-09-12 14:28:21",
    "":"Indexing completed. Added/Updated: 2828323 documents. Deleted 0
documents.",
    "Committed":"2018-09-12 14:33:41",
    "Time taken":"0:5:19.507",
    "Full Import failed":"2018-09-12 14:33:41"}}

Nevertheless all these documents seems indexed fine and searchable.
If the same collection not sharded  or sharded as " compositeId"   indexing
done without any errors.
Type of replicas - nrt or tolg doesn't matter.
Small Indexing (taking less than 2 minutes) run smoothly.

Testing environment - 1 node, Collection with 6 shards, 1 replica for each
shard
Collection:
/admin/collections?action=CREATE&name=Mycol
        &numShards=6
        &router.name=implicit
        &shards=s_0,s_1,s_2,s_3,s_4,s_5
        &router.field=sf_shard
        &collection.configName=Mycol
        &maxShardsPerNode=10
        &nrtReplicas=0&tlogReplicas=1


I have never noticed such behavior before on my prod configuration (solr
6.3.0)
Seems like bug in new version, but I could not find any jira on issue.

Any ideas, please...

--
BR
Vadim Ivanov

Reply | Threaded
Open this post in threaded view
|

Re: Idle Timeout while DIH indexing and implicit sharding in 7.4

Mikhail Khludnev-2
Hello, Vadim.
My guess (and only guess) that bunch of updates coming into a shard causes
a heavy merge that blocks new updates in its' order. This can be verified
with logs or threaddump from the problematic node. The probable measures
are: try to shuffle updates to load other shards for a while and let
parallel merge to pack that shard. And just wait a little by increasing
timeout in jetty.
Let us know what you will encounter.

On Thu, Sep 13, 2018 at 3:54 PM Vadim Ivanov <
[hidden email]> wrote:

> Hi,
> I've put some more tests on the issue and managed to find out more details.
> Time out occurs when while long indexing some documents in the beginning is
> going to one shard and then for a long time (more than 120 sec) no data at
> all is going to that shard.
> Connection to that core, opened in the beginning of indexing, goes to  idle
> timeout :( .
> If no data at all going to the shard during indexing - no timeout occurs on
> that shard.
> If Indexing finishes earlier than 120 sec - no timeout occurs on that
> shard.
> Unfortunately, in our use-case there are lot of long  indexing up to 30
> minutes with uneven shard distribution of documents.
> Any suggestion how to mitigate issue?
> --
> BR
> Vadim Ivanov
>
>
> -----Original Message-----
> From: Вадим Иванов [mailto:[hidden email]]
> Sent: Wednesday, September 12, 2018 4:29 PM
> To: [hidden email]
> Subject: Idle Timeout while DIH indexing and implicit sharding in 7.4
>
> Hello gurus,
> I am using solrCloud with DIH for indexing my data.
> Testing 7.4.0 with implicitly sharded collection  I have noticed that any
> indexing
> longer then 2 minutes always failing with many timeout records in log
> coming
> from all replicas in collection.
>
> Such as:
> x:Mycol_s_0_replica_t40 RequestHandlerBase
> java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout
> expired: 120001/120000 ms
> null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
> timeout expired: 120000/120000 ms
>         at
>
> org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1075)
>         at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313)
>         at
>
> org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWra
> pper.java:74)
> ...
> Caused by: java.util.concurrent.TimeoutException: Idle timeout expired:
> 120000/120000 ms
>         at
> org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
>         at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
> 201(ScheduledThreadPoolExecutor.java:180)
>         at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
> eduledThreadPoolExecutor.java:293)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
> 49)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
> 24)
>         ... 1 more
>         Suppressed: java.lang.Throwable: HttpInput failure
>                 at
> org.eclipse.jetty.server.HttpInput.failed(HttpInput.java:821)
>                 at
>
> org.eclipse.jetty.server.HttpConnection$BlockingReadCallback.failed(HttpConn
> ection.java:649)
>                 at
> org.eclipse.jetty.io.FillInterest.onFail(FillInterest.java:134)
>
> Resulting indexing status:
>   "statusMessages":{
>     "Total Requests made to DataSource":"1",
>     "Total Rows Fetched":"2828323",
>     "Total Documents Processed":"2828323",
>     "Total Documents Skipped":"0",
>     "Full Dump Started":"2018-09-12 14:28:21",
>     "":"Indexing completed. Added/Updated: 2828323 documents. Deleted 0
> documents.",
>     "Committed":"2018-09-12 14:33:41",
>     "Time taken":"0:5:19.507",
>     "Full Import failed":"2018-09-12 14:33:41"}}
>
> Nevertheless all these documents seems indexed fine and searchable.
> If the same collection not sharded  or sharded as " compositeId"   indexing
> done without any errors.
> Type of replicas - nrt or tolg doesn't matter.
> Small Indexing (taking less than 2 minutes) run smoothly.
>
> Testing environment - 1 node, Collection with 6 shards, 1 replica for each
> shard
> Collection:
> /admin/collections?action=CREATE&name=Mycol
>         &numShards=6
>         &router.name=implicit
>         &shards=s_0,s_1,s_2,s_3,s_4,s_5
>         &router.field=sf_shard
>         &collection.configName=Mycol
>         &maxShardsPerNode=10
>         &nrtReplicas=0&tlogReplicas=1
>
>
> I have never noticed such behavior before on my prod configuration (solr
> 6.3.0)
> Seems like bug in new version, but I could not find any jira on issue.
>
> Any ideas, please...
>
> --
> BR
> Vadim Ivanov
>
>

--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

RE: Idle Timeout while DIH indexing and implicit sharding in 7.4

Вадим Иванов
Hello
Mikhail, thank you  for support.
I have already tested this case a lot to be sure what is happening under the hood.
As you proposed - I 've shuffled data coming from sql to solr to see how solr reacts:
I have 6 shards s0 ... s5
shard - is the routing field in my collection. (router.name=implicit&router.field=shard)
Му sql query looks like this

Select
        id
        ....
        , Case when  100 > RowNumber then 's5'
        else 's_' + cast(RowNumber % 4 as varchar)
         end  as shard
from ...

Оnly first 99 rows goes to shard s5, and all the rest data spreads evenly between s0 ... s3.
After 120 sec of indexing I receive IdleTimeout  from shard leader of s5
s4 receives no data and seems do not open connection at all - so no Timeout occurs
s0...s3 receives data and  no Timeout occurs

When I tweak IdleTimeout  in /opt/solr-7.4.0/server/etc/jetty-http.xml It helps,
But I have concerns about icreasing it from 120sec to 30 min.
Is it safe? What consequences could be?

I have noticed that IdleTimeout  in jetty 9.3.8 (coming with Solr 6.3.0) was 50 sec
And no such behavior was observed in Solr 6.3. So default was increased significantly in 9.4.10 for some reason.
Maybe someone could shed some light on the reasons. What was changed in document routing behavior and why?
Maybe there was discussion about it that I could not find?

--
BR Vadim

-----Original Message-----
From: Mikhail Khludnev [mailto:[hidden email]]
Sent: Friday, September 14, 2018 12:10 PM
To: solr-user
Subject: Re: Idle Timeout while DIH indexing and implicit sharding in 7.4

Hello, Vadim.
My guess (and only guess) that bunch of updates coming into a shard causes
a heavy merge that blocks new updates in its' order. This can be verified
with logs or threaddump from the problematic node. The probable measures
are: try to shuffle updates to load other shards for a while and let
parallel merge to pack that shard. And just wait a little by increasing
timeout in jetty.
Let us know what you will encounter.

On Thu, Sep 13, 2018 at 3:54 PM Vadim Ivanov <
[hidden email]> wrote:

> Hi,
> I've put some more tests on the issue and managed to find out more details.
> Time out occurs when while long indexing some documents in the beginning is
> going to one shard and then for a long time (more than 120 sec) no data at
> all is going to that shard.
> Connection to that core, opened in the beginning of indexing, goes to  idle
> timeout :( .
> If no data at all going to the shard during indexing - no timeout occurs on
> that shard.
> If Indexing finishes earlier than 120 sec - no timeout occurs on that
> shard.
> Unfortunately, in our use-case there are lot of long  indexing up to 30
> minutes with uneven shard distribution of documents.
> Any suggestion how to mitigate issue?
> --
> BR
> Vadim Ivanov
>
>
> -----Original Message-----
> From: Вадим Иванов [mailto:[hidden email]]
> Sent: Wednesday, September 12, 2018 4:29 PM
> To: [hidden email]
> Subject: Idle Timeout while DIH indexing and implicit sharding in 7.4
>
> Hello gurus,
> I am using solrCloud with DIH for indexing my data.
> Testing 7.4.0 with implicitly sharded collection  I have noticed that any
> indexing
> longer then 2 minutes always failing with many timeout records in log
> coming
> from all replicas in collection.
>
> Such as:
> x:Mycol_s_0_replica_t40 RequestHandlerBase
> java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout
> expired: 120001/120000 ms
> null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
> timeout expired: 120000/120000 ms
>         at
>
> org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1075)
>         at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313)
>         at
>
> org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWra
> pper.java:74)
> ...
> Caused by: java.util.concurrent.TimeoutException: Idle timeout expired:
> 120000/120000 ms
>         at
> org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
>         at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
> 201(ScheduledThreadPoolExecutor.java:180)
>         at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
> eduledThreadPoolExecutor.java:293)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
> 49)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
> 24)
>         ... 1 more
>         Suppressed: java.lang.Throwable: HttpInput failure
>                 at
> org.eclipse.jetty.server.HttpInput.failed(HttpInput.java:821)
>                 at
>
> org.eclipse.jetty.server.HttpConnection$BlockingReadCallback.failed(HttpConn
> ection.java:649)
>                 at
> org.eclipse.jetty.io.FillInterest.onFail(FillInterest.java:134)
>
> Resulting indexing status:
>   "statusMessages":{
>     "Total Requests made to DataSource":"1",
>     "Total Rows Fetched":"2828323",
>     "Total Documents Processed":"2828323",
>     "Total Documents Skipped":"0",
>     "Full Dump Started":"2018-09-12 14:28:21",
>     "":"Indexing completed. Added/Updated: 2828323 documents. Deleted 0
> documents.",
>     "Committed":"2018-09-12 14:33:41",
>     "Time taken":"0:5:19.507",
>     "Full Import failed":"2018-09-12 14:33:41"}}
>
> Nevertheless all these documents seems indexed fine and searchable.
> If the same collection not sharded  or sharded as " compositeId"   indexing
> done without any errors.
> Type of replicas - nrt or tolg doesn't matter.
> Small Indexing (taking less than 2 minutes) run smoothly.
>
> Testing environment - 1 node, Collection with 6 shards, 1 replica for each
> shard
> Collection:
> /admin/collections?action=CREATE&name=Mycol
>         &numShards=6
>         &router.name=implicit
>         &shards=s_0,s_1,s_2,s_3,s_4,s_5
>         &router.field=sf_shard
>         &collection.configName=Mycol
>         &maxShardsPerNode=10
>         &nrtReplicas=0&tlogReplicas=1
>
>
> I have never noticed such behavior before on my prod configuration (solr
> 6.3.0)
> Seems like bug in new version, but I could not find any jira on issue.
>
> Any ideas, please...
>
> --
> BR
> Vadim Ivanov
>
>

--
Sincerely yours
Mikhail Khludnev