-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cluster rebalance --cluster-weight <node>=0 fails with clusterManagerMoveSlot: NOREPLICAS error #899
Comments
thank for the report. I think #879 can fix it. I will check it later edit: i see the PR does not cover this case, i will handle it later |
@PingXie in this case, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes, during the migration:
I can't think of a good way, do you have any ideas? A tcl test that can reproduce this:
|
Right I don't think #879 would help with this issue. I think one option could be waiting for |
|
I did a bit more investigation and it is failing on the second call to clusterManagerSetSlot in clusterManagerMoveSlot where it sets it on the source node. Would it be ok to add NOREPLICAS to the list of acceptable errors? |
@jdork0 you are right. There is no harm in this case to ignore the error. In fact, I think this situation is essentially the same as the case where both the source primary and replicas become replicas of the target shard. The last |
i am guessing we can't simply ignore the NOREPLICAS error here? otherwise we will lost the CLUSTER SETSLOT command in the source node view. (though it may be harmless, the dst node will gossip the message eventually) ok, lets ignore the NOREPLICAS error in the source node side. |
…USTER SETSLOT This fixes valkey-io#899. In that issue, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes. And during the slot migration: 1. Primary calling blockClientForReplicaAck, waiting its replica. 2. Its replica reconfiguring itself as a replica of other shards due to replica migration and disconnect from the old primary. 3. The old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error. In this case, the replicas were automatically migrates to another primary, resulting in the client being unblocked with the NOREPLICAS error. In this case, since the configuration will eventually propagate itself, so we ignore this error in the source node side. Signed-off-by: Binbin <[email protected]>
Right. We should only ignore it on the source side. |
…USTER SETSLOT (#928) This fixes #899. In that issue, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes. And during the slot migration: 1. Primary calling blockClientForReplicaAck, waiting its replica. 2. Its replica reconfiguring itself as a replica of other shards due to replica migration and disconnect from the old primary. 3. The old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error. In this case, the replicas might automatically migrate to another primary, resulting in the client being unblocked with the NOREPLICAS error. In this case, since the configuration will eventually propagate itself, we can safely ignore this error on the source node. Signed-off-by: Binbin <[email protected]>
…USTER SETSLOT (#928) This fixes #899. In that issue, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes. And during the slot migration: 1. Primary calling blockClientForReplicaAck, waiting its replica. 2. Its replica reconfiguring itself as a replica of other shards due to replica migration and disconnect from the old primary. 3. The old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error. In this case, the replicas might automatically migrate to another primary, resulting in the client being unblocked with the NOREPLICAS error. In this case, since the configuration will eventually propagate itself, we can safely ignore this error on the source node. Signed-off-by: Binbin <[email protected]>
…USTER SETSLOT (#928) This fixes #899. In that issue, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes. And during the slot migration: 1. Primary calling blockClientForReplicaAck, waiting its replica. 2. Its replica reconfiguring itself as a replica of other shards due to replica migration and disconnect from the old primary. 3. The old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error. In this case, the replicas might automatically migrate to another primary, resulting in the client being unblocked with the NOREPLICAS error. In this case, since the configuration will eventually propagate itself, we can safely ignore this error on the source node. Signed-off-by: Binbin <[email protected]>
…USTER SETSLOT (valkey-io#928) This fixes valkey-io#899. In that issue, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes. And during the slot migration: 1. Primary calling blockClientForReplicaAck, waiting its replica. 2. Its replica reconfiguring itself as a replica of other shards due to replica migration and disconnect from the old primary. 3. The old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error. In this case, the replicas might automatically migrate to another primary, resulting in the client being unblocked with the NOREPLICAS error. In this case, since the configuration will eventually propagate itself, we can safely ignore this error on the source node. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>
…USTER SETSLOT (valkey-io#928) This fixes valkey-io#899. In that issue, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes. And during the slot migration: 1. Primary calling blockClientForReplicaAck, waiting its replica. 2. Its replica reconfiguring itself as a replica of other shards due to replica migration and disconnect from the old primary. 3. The old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error. In this case, the replicas might automatically migrate to another primary, resulting in the client being unblocked with the NOREPLICAS error. In this case, since the configuration will eventually propagate itself, we can safely ignore this error on the source node. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>
…USTER SETSLOT (valkey-io#928) This fixes valkey-io#899. In that issue, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes. And during the slot migration: 1. Primary calling blockClientForReplicaAck, waiting its replica. 2. Its replica reconfiguring itself as a replica of other shards due to replica migration and disconnect from the old primary. 3. The old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error. In this case, the replicas might automatically migrate to another primary, resulting in the client being unblocked with the NOREPLICAS error. In this case, since the configuration will eventually propagate itself, we can safely ignore this error on the source node. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>
Describe the bug
Testing with the 8.0.0-rc1 load, during a rebalance launched from valkey-cli to remove all the shards from a master, an error is seen if the master is configured with 'cluster-allow-replica-migration no'.
If 'cluster-allow-replica-migration yes', then the command succeeds.
This is different behaviour compared to valkey 7.2.6 where the command succeeds when 'cluster-allow-replica-migration no'
To reproduce
Create an 8 node cluster, 4 primaries, 4 replicas:
Pick one of the primaries, in my case, node 4, get the node id and set 'cluster-allow-replica-migration no'
Then rebalance the cluster so it removes all shards from one of the masters:
Despite the error, the cluster check appears ok I think:
Expected behavior
I expect this should not error as the it works in valkey 7.2.6.
Additional information
All server logs are attached.
logs.tar.gz
The text was updated successfully, but these errors were encountered: