Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cohort Fair Sharing Status and Metrics. #4561

Conversation

mbobrovskyi
Copy link
Contributor

@mbobrovskyi mbobrovskyi commented Mar 12, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Cohort Fair Sharing Status and Metrics

Which issue(s) this PR fixes:

Fixes #4554

Special notes for your reviewer:

Does this PR introduce a user-facing change?

[FSxHC] Add Cohort Fair Sharing Status and Metrics

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 12, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 12, 2025
Copy link

netlify bot commented Mar 12, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 190f414
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67d80eab6055c100081db57f
😎 Deploy Preview https://deploy-preview-4561--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mbobrovskyi mbobrovskyi force-pushed the feature/cohort-fair-sharing-status-and-metrics branch 3 times, most recently from 53a483a to 1c63924 Compare March 12, 2025 15:40
@mbobrovskyi mbobrovskyi marked this pull request as ready for review March 12, 2025 15:42
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2025
@k8s-ci-robot k8s-ci-robot requested a review from gabesaba March 12, 2025 15:42
@mbobrovskyi mbobrovskyi force-pushed the feature/cohort-fair-sharing-status-and-metrics branch 3 times, most recently from 922a2d2 to 9213a19 Compare March 12, 2025 17:38
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 12, 2025
@mbobrovskyi mbobrovskyi force-pushed the feature/cohort-fair-sharing-status-and-metrics branch 4 times, most recently from d497fd6 to 2f0636a Compare March 13, 2025 05:23
@mbobrovskyi mbobrovskyi force-pushed the feature/cohort-fair-sharing-status-and-metrics branch 4 times, most recently from 6d34d75 to 7b59156 Compare March 13, 2025 12:33
@mbobrovskyi
Copy link
Contributor Author

/retest

@mbobrovskyi mbobrovskyi force-pushed the feature/cohort-fair-sharing-status-and-metrics branch from f560554 to 0c79adc Compare March 14, 2025 12:21
@gabesaba
Copy link
Contributor

Great work :)

/lgtm
/assign @mimowo

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5b934726f16a0701f2b045901913dbd95a3d8eee

Comment on lines +734 to +738
var ancestors []kueue.CohortReference
for cohort != nil && cohort.HasParent() {
ancestors = append(ancestors, cohort.Name)
cohort = cohort.Parent()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking comment: It is one of the things we could have a helper function for, but it could be a follow up.
Similar loop was done in the FS scheduling algorithm.

cc @gabesaba

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ticketed already: #4644

util.ExpectObjectToBeDeleted(ctx, k8sClient, cohortBank, true)
})

ginkgo.It("admits workloads respecting fair share", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please run this test locally in a loop say 50 times, just to ensure it is not flaky? We have recently many flakes, and I want to be extra cautious about adding more before release.

Copy link
Contributor Author

@mbobrovskyi mbobrovskyi Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught a race condition while running the stress test:

WARNING: DATA RACE
Read at 0x00c000f9c240 by goroutine 331:
  runtime.mapdelete_faststr()
      /Users/mykhailo_bobrovskyi/go/pkg/mod/golang.org/[email protected]/src/internal/runtime/maps/runtime_faststr_swiss.go:396 +0x8c
  sigs.k8s.io/kueue/pkg/hierarchy.(*CycleChecker).HasCycle()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/hierarchy/cycle.go:34 +0x68
  sigs.k8s.io/kueue/pkg/cache.(*Cache).ClusterQueueAncestors()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/cache/cache.go:730 +0x130
  sigs.k8s.io/kueue/pkg/controller/core.(*cohortCqHandler).Generic()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/controller/core/cohort_controller.go:215 +0x74
  sigs.k8s.io/controller-runtime/pkg/source.(*channel[go.shape.b7c155c91576830ee655d804b5e7b1fc9d1b717385cca83dc56c659142a3fa38,go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2.1()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:213 +0xac
  sigs.k8s.io/controller-runtime/pkg/source.(*channel[go.shape.b7c155c91576830ee655d804b5e7b1fc9d1b717385cca83dc56c659142a3fa38,go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:214 +0xc0

Previous write at 0x00c000f9c240 by goroutine 223:
  runtime.mapaccess2()
      /Users/mykhailo_bobrovskyi/go/pkg/mod/golang.org/[email protected]/src/internal/runtime/maps/runtime_swiss.go:117 +0x2dc
  sigs.k8s.io/kueue/pkg/hierarchy.(*CycleChecker).HasCycle()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/hierarchy/cycle.go:41 +0xcc
  sigs.k8s.io/kueue/pkg/cache.(*Cache).Snapshot()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/cache/snapshot.go:114 +0x2ac
  sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/scheduler/scheduler.go:191 +0x280
  sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule-fm()
      <autogenerated>:1 +0x44
  sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff.func1()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/util/wait/backoff.go:43 +0x54
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x48
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0x94
  sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/util/wait/backoff.go:42 +0x10c
  sigs.k8s.io/kueue/pkg/util/wait.UntilWithBackoff()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/util/wait/backoff.go:34 +0xc8
  sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).Start.gowrap1()
      /Users/mykhailo_bobrovskyi/Projects/epam/kueue/pkg/scheduler/scheduler.go:146 +0x4c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @gabesaba
is it a preexisting failure or new?

Copy link
Contributor Author

@mbobrovskyi mbobrovskyi Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a new one. The problem is that we are using a regular map in CycleChecker (see here). I've changed it to a SyncMap.

Copy link
Contributor

@mimowo mimowo Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gabe can you double check this part?

Copy link
Contributor

@mimowo mimowo Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in any case the code should not be panicing.

Yeah, the downside of sync map is that it penalizes performance of code which is single-threaded like scheduler, but I think this should be fine as the operations are very fast. Anyway I will wait for green light.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbobrovskyi can you confirm the test is no longer flaky after rebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17m40s: 500 runs so far, 0 failures

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
thanks!

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, left minor comments

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2025
@k8s-ci-robot k8s-ci-robot requested review from gabesaba and mimowo March 14, 2025 13:09
@mbobrovskyi mbobrovskyi force-pushed the feature/cohort-fair-sharing-status-and-metrics branch from 395d98f to 1863ca7 Compare March 14, 2025 14:51
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2025
@mbobrovskyi mbobrovskyi force-pushed the feature/cohort-fair-sharing-status-and-metrics branch from 1863ca7 to 190f414 Compare March 17, 2025 11:59
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2025
@gabesaba
Copy link
Contributor

Great work :)

/lgtm /assign @mimowo

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f31444450f0d8a229102912fb079e6538f7b9cc9

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mbobrovskyi, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2025
@k8s-ci-robot k8s-ci-robot merged commit 971a2aa into kubernetes-sigs:main Mar 17, 2025
19 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.11 milestone Mar 17, 2025
@mbobrovskyi mbobrovskyi deleted the feature/cohort-fair-sharing-status-and-metrics branch March 17, 2025 13:11
@gabesaba
Copy link
Contributor

/release-note-edit

[FSxHC] Add Cohort Fair Sharing Status and Metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cohort Fair Sharing Status and Metrics
5 participants