-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization #303
Comments
Hi @michael-nammi, DetailsInstructions for interacting with me using comments are available here. |
@michael-nammi It would be great if you could provide the yaml of each test process deployment, which would speed up our troubleshooting. |
Here are the yaml files of the deployments: Test for bug 1Steps:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-a
spec:
replicas: 1
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 2
nvidia.com/gpumem: 16384
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-b
spec:
replicas: 1
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 4096
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-b
spec:
replicas: 3
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 2
nvidia.com/gpumem: 8192 Test for bug 2Steps:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-a
spec:
replicas: 1
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 4096
nvidia.com/gpucores: 120 |
Environment
Kubernetes version: v1.27.9
HAMi version: v2.3.9
Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources
Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources
Steps to reproduce the issue
Expected Behavior
The update should fail because there is not enough memory and GPUs available in the cluster to satisfy the requirements of 3 replicas of Deployment B with the specified resources.
Actual Behavior
The update fails, but the node resource allocation is incorrectly reported:
Prometheus Metrics
Bug 2: Incorrect GPU Utilization
Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources
Steps to reproduce the issue
Expected Behavior
The deployment should fail to be scheduled due to the GPU utilization requirement exceeding the maximum limit of 100%.
Actual Behavior
The deployment is incorrectly scheduled with the following resource allocation:
The text was updated successfully, but these errors were encountered: