Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization #303

michael-nammi · 2024-05-10T08:40:43Z

Environment

Kubernetes version: v1.27.9
HAMi version: v2.3.9

Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources

Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources

Steps to reproduce the issue

Pre-conditions:
- Node 1: 4GiB Memory, 1 GPU
- Node 2: 4GiB Memory, 1 GPU
- Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB)
Create Deployment A:
- Replicas: 1
- Memory requirement: 16GiB
- GPU requirement: 2
Create Deployment B:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1
Delete Deployment A
Modify Deployment B
- Change replicas to 3
- Change memory requirement to 8GiB
- Change GPU requirement to 2

Expected Behavior

The update should fail because there is not enough memory and GPUs available in the cluster to satisfy the requirements of 3 replicas of Deployment B with the specified resources.

Node 1: 4GiB Memory occupied by the pre-existing resources of Deployment B
Node 2: Unchanged (idle)
Node 3: 2 replicas of Deployment B fully occupied the memory of two GPUs

Actual Behavior

The update fails, but the node resource allocation is incorrectly reported:

Node 1: 4GiB Memory
Node 2: Unchanged (idle)
Node 3: Resources are reported as 8GiB and 12GiB, which is inconsistent with the expected result of having all GPUs with full memory

Prometheus Metrics

Bug 2: Incorrect GPU Utilization

Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources

Steps to reproduce the issue

Pre-conditions:
- Node 1: 4GiB Memory, 1 GPU (Max Utilization: 100%)
- Node 2: 4GiB Memory, 1 GPU (Max Utilization: 100%)
- Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB Memory and Max Utilization: 100%)
Create Deployment A:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1
- GPUcores requirement: 120 (which implies a requirement of more than 100% GPU utilization if taking "100" as the maximum)

Expected Behavior

The deployment should fail to be scheduled due to the GPU utilization requirement exceeding the maximum limit of 100%.

Node 1: Memory should remain unallocated (4GiB)
Node 2: Memory should remain unallocated (4GiB)
Node 3: Both GPUs should remain unallocated (16GiB + 100%, and 16GiB + 100%)

Actual Behavior

The deployment is incorrectly scheduled with the following resource allocation:

Node 1: Unchanged (4GiB Memory idle)
Node 2: Unchanged (4GiB Memory idle)
Node 3: Resources are reported incorrectly:
- First GPU: Appears as if 4GiB Memory + 100% Utilization has been allocated to Deployment A (should be no allocation)
- Second GPU: Unallocated (16GiB Memory and 100% Utilization idle)

github-actions · 2024-05-10T08:40:56Z

Hi @michael-nammi,
Thanks for opening an issue!
We will look into it as soon as possible.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

wawa0210 · 2024-05-23T07:53:31Z

@michael-nammi It would be great if you could provide the yaml of each test process deployment, which would speed up our troubleshooting.

michael-nammi · 2024-05-27T01:47:03Z

Here are the yaml files of the deployments:

Test for bug 1

Steps:

Create Deployment A:
- Replicas: 1
- Memory requirement: 16GiB
- GPU requirement: 2

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 2
            nvidia.com/gpumem: 16384

Create Deployment B:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 1
            nvidia.com/gpumem: 4096

Delete Deployment A
- kubectl delete deployment deployment-a
Modify Deployment B
- Change replicas to 3
- Change memory requirement to 8GiB
- Change GPU requirement to 2

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 2
            nvidia.com/gpumem: 8192

Test for bug 2

Steps:

Create Deployment A:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1
- GPUcores requirement: 120

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 1
            nvidia.com/gpumem: 4096
            nvidia.com/gpucores: 120

michael-nammi changed the title ~~Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization Misreporting~~ Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization #303

Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization #303

michael-nammi commented May 10, 2024 •

edited

Loading

github-actions bot commented May 10, 2024

wawa0210 commented May 23, 2024

michael-nammi commented May 27, 2024

Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization #303

Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization #303

Comments

michael-nammi commented May 10, 2024 • edited Loading

Environment

Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources

Steps to reproduce the issue

Expected Behavior

Actual Behavior

Prometheus Metrics

Bug 2: Incorrect GPU Utilization

Steps to reproduce the issue

Expected Behavior

Actual Behavior

github-actions bot commented May 10, 2024

wawa0210 commented May 23, 2024

michael-nammi commented May 27, 2024

Test for bug 1

Steps:

Test for bug 2

Steps:

michael-nammi commented May 10, 2024 •

edited

Loading