-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-gpu vllm inference with tensor parallelism, colocating policy model + ref model + vllm engine on the same node #514
Comments
Hi @nhannguyen2709 that sounds promising, it would be great if you could share more details |
This is very cool @nhannguyen2709 ! Just FYI there's also a PR on TRL to enable multi-node support with Ultimately, we'd like to settle on a single implementation (easier to maintain), so perhaps you can also take a look at that PR and comment on whether it differs substantially to your approach? |
@edbeeching @lewtun I attached here a diagram depicting my implementation. The settings are 8 accelerate processes, vLLM with tensor parallel size of 4.
|
Hello @lewtun @edbeeching,
I've created a custom fork based on the faster GRPO trainer PR with some nice improvements to allow large-scale training using just 1 single node. To summarize, I've done the following things:
(1) Policy model + reference model + vllm engines are now living on the same node
(2) All gpus can be used to generate rollouts, and vllm tensor_parallel_size can be set to values > 1
(3) Policy model and optimizer states are offloaded to cpu and reloaded to gpu prior to and after rollout generation. I've tested the offloading strategies with both deepspeed zero2 and zero3.
(4) Training with num_iterations > 1
I've been able to do full-finetuning with qwen 7b and lora-finetuning with qwen 14b on single 8xh100 node.
If you're are interested, I'm willing to open a PR and share more detailed training logs + evaluation on AIME 24-25.
The text was updated successfully, but these errors were encountered: