-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879
Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879
Conversation
In CLUSTER SETSLOT propagation logic, if the replicas are down, the client will get block during command processing and then unblock with `NOREPLICAS Not enough good replicas to write`. The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout. We should only wait for those normal replicas, otherwise the waiting propagation will always timeout since there are not enough replicas. Signed-off-by: Binbin <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #879 +/- ##
============================================
+ Coverage 70.40% 70.59% +0.19%
============================================
Files 112 112
Lines 61465 61510 +45
============================================
+ Hits 43275 43425 +150
+ Misses 18190 18085 -105
|
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin I am not sure why this is a bug? We require strong consistency on purpose when replicating the slot migration states. Dropping the synchronization/wait opens up race conditions and reduces reliability. Is there a scenario where such a strong consistency is not desired? |
i think it should wait the online replica, but you also got a good point here, so in this case, you are suggesting user should fix the offline replica first and then fix the migration? there may a case that a replica is remain fail but we are not forgetting it, so it is always here. |
we are struck in this must or not situation. I just thought a good point about why we should use online replica. Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it. If we still performing the migration, we know the drawback, and we will wait the online replicas. Of course there might be some conner case, but i think we can live with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes LGTM overall. Just some nits.
I think you have a good point. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and didn't get included in the replication, it would also unlikely win the election. This PR makes sense to me. Thanks! |
Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
Signed-off-by: Binbin <[email protected]>
#879) In CLUSTER SETSLOT propagation logic, if the replicas are down, the client will get block during command processing and then unblock with `NOREPLICAS Not enough good replicas to write`. The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout. We should only wait for those online replicas, otherwise the waiting propagation will always timeout since there are not enough replicas. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and was not included in the replication, it would also unlikely win the election. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
#879) In CLUSTER SETSLOT propagation logic, if the replicas are down, the client will get block during command processing and then unblock with `NOREPLICAS Not enough good replicas to write`. The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout. We should only wait for those online replicas, otherwise the waiting propagation will always timeout since there are not enough replicas. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and was not included in the replication, it would also unlikely win the election. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with
NOREPLICAS Not enough good replicas to write
.The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.
We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica failed right before the replication and
was not included in the replication, it would also unlikely win the election.