-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cohort Fair Sharing Status and Metrics. #4561
Cohort Fair Sharing Status and Metrics. #4561
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
53a483a
to
1c63924
Compare
922a2d2
to
9213a19
Compare
d497fd6
to
2f0636a
Compare
test/integration/singlecluster/scheduler/fairsharing/fair_sharing_test.go
Outdated
Show resolved
Hide resolved
6d34d75
to
7b59156
Compare
/retest |
f560554
to
0c79adc
Compare
Great work :) /lgtm |
LGTM label has been added. Git tree hash: 5b934726f16a0701f2b045901913dbd95a3d8eee
|
var ancestors []kueue.CohortReference | ||
for cohort != nil && cohort.HasParent() { | ||
ancestors = append(ancestors, cohort.Name) | ||
cohort = cohort.Parent() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking comment: It is one of the things we could have a helper function for, but it could be a follow up.
Similar loop was done in the FS scheduling algorithm.
cc @gabesaba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ticketed already: #4644
util.ExpectObjectToBeDeleted(ctx, k8sClient, cohortBank, true) | ||
}) | ||
|
||
ginkgo.It("admits workloads respecting fair share", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please run this test locally in a loop say 50 times, just to ensure it is not flaky? We have recently many flakes, and I want to be extra cautious about adding more before release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caught a race condition while running the stress test:
WARNING: DATA RACE
Read at 0x00c000f9c240 by goroutine 331:
runtime.mapdelete_faststr()
/Users/mykhailo_bobrovskyi/go/pkg/mod/golang.org/[email protected]/src/internal/runtime/maps/runtime_faststr_swiss.go:396 +0x8c
sigs.k8s.io/kueue/pkg/hierarchy.(*CycleChecker).HasCycle()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/hierarchy/cycle.go:34 +0x68
sigs.k8s.io/kueue/pkg/cache.(*Cache).ClusterQueueAncestors()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/cache/cache.go:730 +0x130
sigs.k8s.io/kueue/pkg/controller/core.(*cohortCqHandler).Generic()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/controller/core/cohort_controller.go:215 +0x74
sigs.k8s.io/controller-runtime/pkg/source.(*channel[go.shape.b7c155c91576830ee655d804b5e7b1fc9d1b717385cca83dc56c659142a3fa38,go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2.1()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:213 +0xac
sigs.k8s.io/controller-runtime/pkg/source.(*channel[go.shape.b7c155c91576830ee655d804b5e7b1fc9d1b717385cca83dc56c659142a3fa38,go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:214 +0xc0
Previous write at 0x00c000f9c240 by goroutine 223:
runtime.mapaccess2()
/Users/mykhailo_bobrovskyi/go/pkg/mod/golang.org/[email protected]/src/internal/runtime/maps/runtime_swiss.go:117 +0x2dc
sigs.k8s.io/kueue/pkg/hierarchy.(*CycleChecker).HasCycle()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/hierarchy/cycle.go:41 +0xcc
sigs.k8s.io/kueue/pkg/cache.(*Cache).Snapshot()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/cache/snapshot.go:114 +0x2ac
sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/scheduler/scheduler.go:191 +0x280
sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule-fm()
<autogenerated>:1 +0x44
sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff.func1()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/util/wait/backoff.go:43 +0x54
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x48
k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0x94
sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/util/wait/backoff.go:42 +0x10c
sigs.k8s.io/kueue/pkg/util/wait.UntilWithBackoff()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/util/wait/backoff.go:34 +0xc8
sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).Start.gowrap1()
/Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/scheduler/scheduler.go:146 +0x4c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @gabesaba
is it a preexisting failure or new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is a new one. The problem is that we are using a regular map in CycleChecker (see here). I've changed it to a SyncMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gabe can you double check this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in any case the code should not be panicing.
Yeah, the downside of sync map is that it penalizes performance of code which is single-threaded like scheduler, but I think this should be fine as the operations are very fast. Anyway I will wait for green light.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbobrovskyi can you confirm the test is no longer flaky after rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
17m40s: 500 runs so far, 0 failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, left minor comments
395d98f
to
1863ca7
Compare
1863ca7
to
190f414
Compare
/lgtm |
LGTM label has been added. Git tree hash: f31444450f0d8a229102912fb079e6538f7b9cc9
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mbobrovskyi, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/release-note-edit
|
What type of PR is this?
/kind feature
What this PR does / why we need it:
Cohort Fair Sharing Status and Metrics
Which issue(s) this PR fixes:
Fixes #4554
Special notes for your reviewer:
Does this PR introduce a user-facing change?