-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Co-Locating vLLM w/ training to achieve higher throughput and GPU utilization #3162
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Nice! Can vLLM release GPU memory to prevent GPU OOM in this setting? Does this one work on 32B QwQ models? |
|
@toslali-ibm Thanks for your contribution! Very Nice! |
Hello @shirinyamani, thank you! Essentially, there is no centralized vLLM server—instead, vLLM processes run directly on each device (see here), sharing the GPU with training workloads. Each device handles its own batch for generation. The key idea is that you don’t need a separate GPU dedicated solely to vLLM, as it would remain idle when no generation is taking place. As shown in our experiment, this colocation can achieve higher throughput using N-1 GPUs. |
Since we are on this topic, I have one more question maybe from both @binary-husky @toslali-ibm trl vllm-serve --model Qwen/Qwen2.5-7B --tensor_parallel_size 4 --pipeline-parallel-size 2 Now the issues is that if we wanna use the AsyncLLMEngine, then it do not have |
@shirinyamani hi this is @fabianlim i work with @toslali-ibm. To answer your questiosns
|
I have made some progress on |
nice @binary-husky, my understanding is that your async feature works for grad_accum > 1, but currently we compare with your approach when grad_accum = 1. So my feeling is our methods are orthorgonal; they also apply with grad_accum > 1 I believe. Is is currently true that we require cc: @shirinyamani |
I recently conducted another experiment with a larger batch size of 24 and observed a 2.5x speedup when using N-1 GPUs -- 7 GPUs (with colocated vLLM and training) compared to 8 GPUs (where 1 GPU was dedicated to remote vLLM and 7 to training). The next PRs we will focus on: If this PR is successfully merged, we can move forward on (a) and (b). Additionally, @binary-husky's nice work on the async feature for grad_accum > 1 seems orthogonal to colocated vLLM, meaning it can be applied to both the remote vLLM client and the colocated vLLM client to achieve more gains! Looking forward to your feedback! |
Thanks for this work!
As this feature is experimental, I wouldn't wait until it's merged before moving on to the next steps, as it's these steps that will tell us whether it really makes sense to merge. If you want to keep the elements separate, you can always make a PR on this branch. |
Thanks @qgallouedec
To scale, we're incorporating TP. The key idea is that each TP device initializes an LLM() with tp=N and distributed_executor_backend="external_launcher". Each process holds 1/N of the model weights, and work is sharded such that all processes receive the same input and generate outputs collectively. There's an example of this setup in the official vLLM repo. I'm currently building a small PoC to demonstrate TP combined with vllm_colocation.
I think this can be the initial setup. But it is possible to create mini shards like say: You have 4 processes and Each mini shard consists of 2 processes. So, you get 2 mini shards: [rank 0, rank 1] and [rank 2, rank 3]. Each mini shard is responsible for running 1 instance of vLLM with TP=2. @fabianlim has a PoC of this, which we will incorporate on top of vllm_coloc here. |
Hello @qgallouedec I have a working PoC for TP using external_launcher for 32B model ( The script sets TP=4, and each process initializes its own LLM() instance with tensor_parallel_size=4 and distributed_executor_backend="external_launcher". Each process holds 1/N of the model weights and participates in .generate()—collaborating to produce the output. The crucial part is ensuring all shards receive the same input. Please let me know your comments/questions. Note: TP w/ external launcher works on vllm==0.7.3 ---> 0.8.0 and onward has a bug associated with it, and I created a bug report. CC @fabianlim Click to see `poc.py` - run it via `accelerate launch --num_processes=4 poc.py` -- use vllm==0.7.3"""
Each process instantiates a LLM() w/ tp=N and distributed_executor_backend="external_launcher".
- Each process holds 1/N of the model weights
- Each process does .generate() — work together to generate the output
- The key part is ensuring that all processes receive the same input, because vLLM expects to generate jointly using participating shards
"""
from vllm import LLM, SamplingParams
from accelerate import Accelerator
from accelerate.utils import gather_object
# === Setup distributed environment ===
accelerator = Accelerator()
device = accelerator.device
tp_size = accelerator.num_processes
rank = accelerator.process_index
print(f"\n----------\nDevice: {device} | Tensor Parallel Size: {tp_size} | Process Rank: {rank}\n----------\n")
# === Each worker has local prompts ===
local_prompts = [
f"Prompt 1 from worker {rank}, How is it going for you today?",
f"Prompt 2 from worker {rank}, What is the weather like in Boston usually?"
]
# === Gather all prompts across workers ===
all_prompts = gather_object(local_prompts)
# === Initialize vLLM ===
llm = LLM(
model="Qwen/Qwen2.5-32B-Instruct",
tensor_parallel_size=tp_size,
distributed_executor_backend="external_launcher",
device="cuda",
)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# === Generate outputs collectively — all ranks must call this with same inputs ===
outputs = llm.generate(
prompts=all_prompts,
sampling_params=sampling_params,
use_tqdm=(rank == 0)
)
# === Print results from all ranks ===
for i, output in enumerate(outputs):
prompt = all_prompts[i]
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r} --- Generated: {generated_text!r}\n")
# === Print results from rank 0 — shows all outputs ===
if rank == 0:
print(f"\n==== Final Output (TP={tp_size}) ====\n")
for i, output in enumerate(outputs):
prompt = all_prompts[i]
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r} --- Generated: {generated_text!r}\n")
# === Slice outputs to keep only local shard's portion if TP > 1 ===
if tp_size > 1:
process_index = rank
tp_slice = slice(
process_index * len(local_prompts),
(process_index + 1) * len(local_prompts)
)
local_outputs = outputs[tp_slice]
else:
local_outputs = outputs
# === Print generations for this rank's original prompts ===
for i, output in enumerate(local_outputs):
prompt = local_prompts[i]
generated_text = output.outputs[0].text
print(f"\n\n\n----Local generations --- Rank {rank} -- Prompt: {prompt!r} --> Generated: {generated_text!r}\n") |
I’ve created a draft PR showcasing TP and vLLM sleep with vllm_colocation. When training a 14B model ( Below is the details of the experiment. Install TRL from PR
Run GRPO trainer in coloc mode:
Run GRPO trainer in server mode
ConfigClick to see `config.yaml`# Model arguments
model_name_or_path: Qwen/Qwen2.5-14B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
# Data training arguments
dataset_name: open-r1/OpenR1-Math-220k
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"
dataset_prompt_column: "problem"
# GRPO trainer config
bf16: true
use_vllm: true
vllm_tp: true # remove this in server mode
vllm_gpu_memory_utilization: 0.4 # remove this in server mode
vllm_enable_prefix_caching: false
vllm_max_model_len: 1024
do_eval: false
eval_strategy: "no"
use_vllm: true
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 2.0e-05
log_completions: false
log_level: info
logging_first_step: true
logging_steps: 5
logging_strategy: steps
lr_scheduler_type: cosine
max_grad_norm: 0.2
max_prompt_length: 512
max_completion_length: 512
max_steps: 10
num_generations: 4
num_train_epochs: 1
overwrite_output_dir: true
per_device_train_batch_size: 1
reward_funcs:
- accuracy
- format
reward_weights:
- 1.0
- 1.0
eval_strategy: "no"
save_strategy: "no"
report_to: none
seed: 42
temperature: 0.7
warmup_ratio: 0.1 |
What does this PR do?
Fixes #3064
Addresses:
#3114
#2971
#2922
#2887
Motivation:
Colocating vLLM processes with training workloads enables higher throughput and more efficient GPU utilization. Our test (see section below) shows a ~2× faster GRPO training time with N-1 GPUs i.e., using 7 GPUs for both vLLM and training, compared to 8 GPUs with current TRL (using 7 GPUs for training plus a dedicated GPU for an isolated vLLM server)
Enabler:
vLLM (version >0.7.3) introduced support for an external launcher, allowing vLLM processes to run alongside other workloads on the same GPU.
Benefits:
Implementation Notes:
get_vllm_client()
proxy returns the appropriate client based on the training setup and configurationVLLMColocationClient
runs vLLM in-process for faster inference and better GPU utilizationVLLMNoOpClient
is a no-op fallback for non-main processes in distributed training when default VLLMClient is used.Notes:
Testing vllm colocation
To run and test the trainer w/ vllm colocation enabled,
(when training
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
model withopen-r1/OpenR1-Math-220k
dataset) ; execute the following bash script (experiment.sh).Click to view experiment.sh
Click to view config.yaml
To run default trainer for comparison:
CUDA_VISIBLE_DEVICES
to 1,2,3,4,5,6,7 andvllm_colocation
to False in the experiment.sh scriptClick to view vllm_server.sh
CC @fabianlim
Requirements
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.