[jira] [Commented] (SOLR-10397) Port 'autoAddReplicas' feature to the policy rules framework and make it work with non-shared filesystems

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-10397) Port 'autoAddReplicas' feature to the policy rules framework and make it work with non-shared filesystems

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089405#comment-16089405 ]

Shalin Shekhar Mangar commented on SOLR-10397:
----------------------------------------------

Thanks Dat.

bq. OverseerFailover is not guaranteed ( we should tackle this problem in another issue )

I've opened SOLR-11085 for improving resiliency of actions against overseer failures.

bq. AutoAddReplicas is triggered by NodeLost event, so when we switch autoAddReplicas from off to on nothing happen. I think this is ok.

I'm inclined to remove the quirk in how autoAddReplicas used to work and I don't think we need to support it. Please ensure that both the deprecation of the cluster property and the change in this behavior is documented in CHANGES.txt under the upgrade notes section.

A few things I noticed in the patch:
# Typo in AutoAddReplicasIntergrationTest (intergration instead of integration)
# same as above in HdfsAutoAddReplicasIntergrationTest
# There is a large block of code commented out in SharedFSAutoReplicaFailoverTest. Please remove it if it is no longer needed.
# The TestPolicy.testMoveReplicasInMultipleCollections does not seem like a very useful test. All it is testing is that some operation is returned. It should be testing that only the hinted collections' replicas are moved and that no operation is returned if there are no replicas belonging to the collection on the node that went down
# minor nit -- {{autoAddReplicas != null && autoAddReplicas.equals("false")}} can be simplified to {{!Boolean.parseBoolean(autoAddReplicas)}}
# typo in comment "Waitting" in waitForState calls in AutoAddReplicasIntergrationTest.testSimple
# The return value for {{waitForAllActiveAndLiveReplicas}} in the tests should be asserted to be true otherwise even after timeout the test silently proceeds to succeed.
# I am seeing some thread leak failures in HdfsAutoAddReplicasIntergrationTest:
{code}
NOTE: reproduce with: ant test  -Dtestcase=HdfsAutoAddReplicasIntergrationTest -Dtests.seed=EF1C283E3B67B9EE -Dtests.locale=mk-MK -Dtests.timezone=Etc/GMT-2 -Dtests.asserts=true -Dtests.file.encoding=UTF-8

Test ignored.

com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
   1) Thread[id=685, name=ForkJoinPool.commonPool-worker-0, state=TIMED_WAITING, group=TGRP-HdfsAutoAddReplicasIntergrationTest]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   2) Thread[id=686, name=ForkJoinPool.commonPool-worker-7, state=WAITING, group=TGRP-HdfsAutoAddReplicasIntergrationTest]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   3) Thread[id=687, name=ForkJoinPool.commonPool-worker-1, state=WAITING, group=TGRP-HdfsAutoAddReplicasIntergrationTest]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

        at __randomizedtesting.SeedInfo.seed([EF1C283E3B67B9EE]:0)
{code}

> Port 'autoAddReplicas' feature to the policy rules framework and make it work with non-shared filesystems
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10397
>                 URL: https://issues.apache.org/jira/browse/SOLR-10397
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public)
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Cao Manh Dat
>              Labels: autoscaling
>             Fix For: 7.0
>
>         Attachments: SOLR-10397.1.patch, SOLR-10397.2.patch, SOLR-10397.patch
>
>
> Currently 'autoAddReplicas=true' can be specified in the Collection Create API to automatically add replicas when a replica becomes unavailable. I propose to move this feature to the autoscaling cluster policy rules design.
> This will include the following:
> * Trigger support for ‘nodeLost’ event type
> * Modification of existing implementation of ‘autoAddReplicas’ to automatically create the appropriate ‘nodeLost’ trigger.
> * Any such auto-created trigger must be marked internally such that setting ‘autoAddReplicas=false’ via the Modify Collection API should delete or disable corresponding trigger.
> * Support for non-HDFS filesystems while retaining the optimization afforded by HDFS i.e. the replaced replica can point to the existing data dir of the old replica.
> * Deprecate/remove the feature of enabling/disabling ‘autoAddReplicas’ across the entire cluster using cluster properties in favor of using the suspend-trigger/resume-trigger APIs.
> This will retain backward compatibility for the most part and keep a common use-case easy to enable as well as make it available to more people (i.e. people who don't use HDFS).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...