-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ControlPlane node setup failing with "etcdserver: can only promote a learner member" #3152
ControlPlane node setup failing with "etcdserver: can only promote a learner member" #3152
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Yes, according to me suggested extra check will be helpful inside the wait. The edge scenario when an etcd member promotion is successful in the backend but sends the client a timeout message is manageable with PollUntilContextTimeout. By preventing pointless promotion attempts, this check would increase the MemberPromote method's dependability. |
As this code Re-check if the member is already promoted before attempting the promotion. If triage get accepted . I would like to work on it. |
/sig cluster-lifecycle |
/transfer kubeadm |
i don't think it's a bug per se, so more of a feature / improvement. |
Friendly ping @niranjandarshann: If you’d like to contribute to this, you might want to put the plan for the implementation code on your agenda, as we’re nearing the code freeze on March 20th. |
/assign @neolit123 @HirazawaUi I will work on it with Bernard, I am trying to get some spare time for it. The implementation seems straightforward (last famous words...) |
Hello
I am using CAPI to manage setting up new kubernetes clusters
In my environment I am encountering an issue where I try to setup a 3 node controlplane but get stuck after adding the second node.
In the kcp logs this error repeats
The etcd member table looks like this
If i look at the cloud-init-output log for node obj-store-01-wchl9 it terminates with an error
So it appears that we had added an etcd member as a learner and then were attempting to promote it.
One of the calls to promote times out but succeeded on the backend, causing all future calls to promote to fail
I found this recent PR that addresses a similar issue kubernetes/kubernetes#127491 but would not address the problem I'm seeing
In the latest kubeadm the MemberPromote now looks like this
What I would like to do is add an additional check within the
wait.PollUntilContextTimeout
func so that it checks if the member has been promoted before every attempt to promote. If the member has already been promoted it should return earlyi.e. reuse this code from earlier in the function
Ideally etcd cluster wouldnt return a timeout message when the op actually succeeded, but I would like to add this extra check regardless as I think its fairly low risk. How would I go about adding it?
The text was updated successfully, but these errors were encountered: