Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CLUSTER SETSLOT block and unblock error when all replicas are down #879

Merged

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Aug 8, 2024

In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with NOREPLICAS Not enough good replicas to write.

The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.

We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica failed right before the replication and
was not included in the replication, it would also unlikely win the election.

In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with `NOREPLICAS Not enough good replicas to write`.

The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.

We should only wait for those normal replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin requested a review from PingXie August 8, 2024 15:49
Copy link

codecov bot commented Aug 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.59%. Comparing base (109cc21) to head (7a41f69).
Report is 34 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #879      +/-   ##
============================================
+ Coverage     70.40%   70.59%   +0.19%     
============================================
  Files           112      112              
  Lines         61465    61510      +45     
============================================
+ Hits          43275    43425     +150     
+ Misses        18190    18085     -105     
Files Coverage Δ
src/cluster_legacy.c 85.51% <100.00%> (+0.17%) ⬆️

... and 25 files with indirect coverage changes

@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Aug 9, 2024
Signed-off-by: Binbin <[email protected]>
@PingXie
Copy link
Member

PingXie commented Aug 13, 2024

@enjoy-binbin I am not sure why this is a bug? We require strong consistency on purpose when replicating the slot migration states. Dropping the synchronization/wait opens up race conditions and reduces reliability. Is there a scenario where such a strong consistency is not desired?

@enjoy-binbin
Copy link
Member Author

i think it should wait the online replica, but you also got a good point here, so in this case, you are suggesting user should fix the offline replica first and then fix the migration?

there may a case that a replica is remain fail but we are not forgetting it, so it is always here.

@enjoy-binbin
Copy link
Member Author

we are struck in this must or not situation. I just thought a good point about why we should use online replica.

Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.

If we still performing the migration, we know the drawback, and we will wait the online replicas.

Of course there might be some conner case, but i think we can live with it.

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM overall. Just some nits.

@PingXie
Copy link
Member

PingXie commented Aug 20, 2024

we are struck in this must or not situation. I just thought a good point about why we should use online replica.

Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.

If we still performing the migration, we know the drawback, and we will wait the online replicas.

Of course there might be some conner case, but i think we can live with it.

I think you have a good point. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and didn't get included in the replication, it would also unlikely win the election. This PR makes sense to me. Thanks!

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin requested a review from PingXie August 22, 2024 06:30
@enjoy-binbin enjoy-binbin merged commit 5d97f51 into valkey-io:unstable Aug 23, 2024
47 checks passed
@enjoy-binbin enjoy-binbin deleted the fix_cluster_setslot_block branch August 23, 2024 08:21
madolson pushed a commit that referenced this pull request Sep 2, 2024
#879)

In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with `NOREPLICAS Not enough good replicas to write`.

The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.

We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica  failed right before the replication and
was not included in the replication, it would also unlikely win the election.

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
madolson pushed a commit that referenced this pull request Sep 3, 2024
#879)

In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with `NOREPLICAS Not enough good replicas to write`.

The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.

We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica  failed right before the replication and
was not included in the replication, it would also unlikely win the election.

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants