Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879

enjoy-binbin · 2024-08-08T15:47:20Z

In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with NOREPLICAS Not enough good replicas to write.

The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.

We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica failed right before the replication and
was not included in the replication, it would also unlikely win the election.

In CLUSTER SETSLOT propagation logic, if the replicas are down, the client will get block during command processing and then unblock with `NOREPLICAS Not enough good replicas to write`. The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout. We should only wait for those normal replicas, otherwise the waiting propagation will always timeout since there are not enough replicas. Signed-off-by: Binbin <[email protected]>

src/cluster_legacy.c

codecov · 2024-08-08T22:06:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.59%. Comparing base (109cc21) to head (7a41f69).
Report is 34 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #879      +/-   ##
============================================
+ Coverage     70.40%   70.59%   +0.19%     
============================================
  Files           112      112              
  Lines         61465    61510      +45     
============================================
+ Hits          43275    43425     +150     
+ Misses        18190    18085     -105

Files	Coverage Δ
src/cluster_legacy.c	`85.51% <100.00%> (+0.17%)`	⬆️

... and 25 files with indirect coverage changes

Signed-off-by: Binbin <[email protected]>

PingXie · 2024-08-13T07:13:09Z

@enjoy-binbin I am not sure why this is a bug? We require strong consistency on purpose when replicating the slot migration states. Dropping the synchronization/wait opens up race conditions and reduces reliability. Is there a scenario where such a strong consistency is not desired?

enjoy-binbin · 2024-08-13T07:27:51Z

i think it should wait the online replica, but you also got a good point here, so in this case, you are suggesting user should fix the offline replica first and then fix the migration?

there may a case that a replica is remain fail but we are not forgetting it, so it is always here.

enjoy-binbin · 2024-08-19T16:54:02Z

we are struck in this must or not situation. I just thought a good point about why we should use online replica.

Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.

If we still performing the migration, we know the drawback, and we will wait the online replicas.

Of course there might be some conner case, but i think we can live with it.

PingXie

The changes LGTM overall. Just some nits.

src/cluster_legacy.c

PingXie · 2024-08-20T20:23:24Z

we are struck in this must or not situation. I just thought a good point about why we should use online replica.

Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.

If we still performing the migration, we know the drawback, and we will wait the online replicas.

Of course there might be some conner case, but i think we can live with it.

I think you have a good point. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and didn't get included in the replication, it would also unlikely win the election. This PR makes sense to me. Thanks!

Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

src/cluster_legacy.c

tests/unit/cluster/slot-migration.tcl

Signed-off-by: Binbin <[email protected]>

#879) In CLUSTER SETSLOT propagation logic, if the replicas are down, the client will get block during command processing and then unblock with `NOREPLICAS Not enough good replicas to write`. The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout. We should only wait for those online replicas, otherwise the waiting propagation will always timeout since there are not enough replicas. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and was not included in the replication, it would also unlikely win the election. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

enjoy-binbin commented Aug 8, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

enjoy-binbin requested a review from PingXie August 8, 2024 15:49

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Aug 9, 2024

fix typo

28d91d8

Signed-off-by: Binbin <[email protected]>

enjoy-binbin mentioned this pull request Aug 13, 2024

[BUG] cluster rebalance --cluster-weight <node>=0 fails with clusterManagerMoveSlot: NOREPLICAS error #899

Closed

PingXie reviewed Aug 20, 2024

View reviewed changes

core review from Ping

8bb0f4a

Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

ranshid reviewed Aug 21, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

tests/unit/cluster/slot-migration.tcl Outdated Show resolved Hide resolved

code review

7a41f69

Signed-off-by: Binbin <[email protected]>

enjoy-binbin requested a review from PingXie August 22, 2024 06:30

PingXie approved these changes Aug 23, 2024

View reviewed changes

enjoy-binbin merged commit 5d97f51 into valkey-io:unstable Aug 23, 2024
47 checks passed

enjoy-binbin deleted the fix_cluster_setslot_block branch August 23, 2024 08:21

PingXie mentioned this pull request Sep 9, 2024

[BUG] slot stuck in importing state on replica after scale-up and rebalance #998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879

Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879

enjoy-binbin commented Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 8, 2024 •

edited

Loading

PingXie commented Aug 13, 2024

enjoy-binbin commented Aug 13, 2024

enjoy-binbin commented Aug 19, 2024

PingXie left a comment

PingXie commented Aug 20, 2024

Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879

Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879

Conversation

enjoy-binbin commented Aug 8, 2024 • edited Loading

codecov bot commented Aug 8, 2024 • edited Loading

Codecov Report

PingXie commented Aug 13, 2024

enjoy-binbin commented Aug 13, 2024

enjoy-binbin commented Aug 19, 2024

PingXie left a comment

Choose a reason for hiding this comment

PingXie commented Aug 20, 2024

enjoy-binbin commented Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 8, 2024 •

edited

Loading