Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vSphere: ignore canceled VMs when scheduling #1465

Merged
merged 1 commit into from
Mar 26, 2025

Conversation

mansam
Copy link
Contributor

@mansam mansam commented Mar 22, 2025

Now that the migration runner attempts to schedule as many VMs as it can in a single reconcilation,
it has exposed a bug in the scheduler where canceled VMs are rescheduled endlessly in certain pathological cases, preventing the reconciliation from terminating.

(for instance, if a VM is canceled before it has been marked started, it will always appear ready to schedule although there is nothing for the plan controller to do with it.)

The endless rescheduling causes the controller to infinite loop and be unable to reconcile resources, necessitating that the offending Migration resource be deleted and the controller pod be restarted.

Checking for the Canceled condition on the VM is adequate to prevent the problem. (Requesting cancelation of the VM via the Migration resource will cause the VM to be marked with the Canceled condition in memory as the Plan is reconciled, which will cause it to be ignored by the scheduler, and on the following reconcile it will be cleaned up as intended.)

Now that the migration runner attempts to schedule
as many VMs as it can in a single reconcilation,
it has exposed a bug in the scheduler where canceled
VMs are rescheduled endlessly in certain pathological
cases, preventing the reconciliation from terminating.

(for instance, if a VM is canceled before
it has been marked started, it will always appear
ready to schedule although there is nothing for
the plan controller to do with it.)

The endless rescheduling causes the controller to
infinite loop and be unable to reconcile resources,
necessitating that the offending Migration resource
be deleted and the controller pod be restarted.

Checking for the `Canceled` condition on the VM is adequate
to prevent the problem. (Requesting cancelation of the VM
via the Migration resource will cause the VM to be
marked with the Canceled condition in memory as the Plan
is reconciled, which will cause it to be ignored by the
scheduler, and on the following reconcile it will be
cleaned up as intended.)

Signed-off-by: Sam Lucidi <[email protected]>
@mansam mansam requested review from mnecas and yaacov as code owners March 22, 2025 01:56
@codecov-commenter
Copy link

codecov-commenter commented Mar 22, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 15.14%. Comparing base (f1fe5d0) to head (ae2f4b3).
Report is 65 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/plan/scheduler/vsphere/scheduler.go 0.00% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1465      +/-   ##
==========================================
- Coverage   15.45%   15.14%   -0.32%     
==========================================
  Files         112      113       +1     
  Lines       23377    24351     +974     
==========================================
+ Hits         3613     3687      +74     
- Misses      19479    20376     +897     
- Partials      285      288       +3     
Flag Coverage Δ
unittests 15.14% <0.00%> (-0.32%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@mnecas mnecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mnecas mnecas added this to the 2.8.1 milestone Mar 25, 2025
@mnecas mnecas merged commit 634de67 into kubev2v:main Mar 26, 2025
18 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants