Error with latest setup.py trl.extras.vllm_client - Server is not up yet #543

ECMGit · 2025-03-24T21:17:53Z

Hi team,

The latest main commit with updated trl has error, I cannot reproduce with 8 GPUs grpo.py

my previous build based on Commit 8782fa6
is working

error log:

[rank1]:[W324 20:25:15.507902367 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank6]:[W324 20:25:15.508125161 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W324 20:25:15.517874031 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank5]:[W324 20:25:15.525969750 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W324 20:25:15.596575310 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[INFO|trainer.py:748] 2025-03-24 20:25:15,587 >> Using auto half precision backend
2025-03-24 20:25:15 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
[rank4]:[W324 20:25:15.812969611 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2025-03-24 20:25:17 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:19 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:21 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:23 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:25 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:27 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...

Can anyone help me with that?

The text was updated successfully, but these errors were encountered:

qgallouedec · 2025-03-24T21:24:09Z

What code do you use for training? Dyou set up a vllm server with trl vllm-serve?

ECMGit · 2025-03-24T21:33:35Z

Thanks for your response!

I am using grpo training cmd for DeepSeek R1 and Simple RL:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \ --num_processes=7 src/open_r1/grpo.py \ --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \ --num_processes=7 src/open_r1/grpo.py \ --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml

do I need to set up a vllm server first with trl vllm-serve before I run above cmd?

qgallouedec · 2025-03-24T21:37:38Z

Yes, check the latest doc: https://huggingface.co/docs/trl/en/grpo_trainer#speed-up-training-with-vllm-powered-generation.
We probably have some examples to update in Open-R1 as well

ECMGit · 2025-03-24T22:11:44Z

Thanks! Now following solution for training example in cmd works for me:

use GPU 0 for vllm-server, others for training if you have 8 GPUs total
run CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model <model_name>

then in another session:
DeepSeek R1 distill, need around 100+ hrs
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \ --num_processes=7 src/open_r1/grpo.py \ --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml

Simple-RL approach - 3hrs, reward function: accuracy, format
```
  CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
  --num_processes=7 src/open_r1/grpo.py \
  --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
```

wjmZZZ · 2025-03-26T07:07:58Z

What code do you use for training? Dyou set up a vllm server with trl vllm-serve?

I ran vllmclient as instructed: CUDA_VISIBLE_DEVICES=0,1 trl vllm-serve --model Qwen2.5-7B-Instruct --tensor_parallel_size 2
However, when running the GRPO training script, I'm still encountering the same issue:
[INFO|trainer.py:746] 2025-03-26 05:53:13,046 >> Using auto half precision backend
2025-03-26 05:53:13 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:15 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:17 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...

I found the issue. It was caused by my proxy configuration, which prevented it from using the vllm address.

MZ-Sun · 2025-03-26T07:13:11Z

What code do you use for training? Dyou set up a vllm server with trl vllm-serve?

I ran vllmclient as instructed: CUDA_VISIBLE_DEVICES=0,1 trl vllm-serve --model Qwen2.5-7B-Instruct --tensor_parallel_size 2

However, when running the GRPO training script, I'm still encountering the same issue:
[INFO|trainer.py:746] 2025-03-26 05:53:13,046 >> Using auto half precision backend
2025-03-26 05:53:13 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:15 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:17 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
I found the issue. It was caused by my proxy configuration, which prevented it from using the vllm address.

Are you also on a cluster?

SWEENEYHE · 2025-04-01T09:00:46Z

I got the same problem, I solved it by "unset http_proxy" and "unset https_proxy"

QuyAnh2005 mentioned this issue Mar 25, 2025

"trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds..." knoveleng/open-rs#8

Closed

ECMGit closed this as completed Mar 25, 2025

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with latest setup.py trl.extras.vllm_client - Server is not up yet #543

Error with latest setup.py trl.extras.vllm_client - Server is not up yet #543

ECMGit commented Mar 24, 2025 •

edited

Loading

qgallouedec commented Mar 24, 2025

ECMGit commented Mar 24, 2025

qgallouedec commented Mar 24, 2025

ECMGit commented Mar 24, 2025 •

edited

Loading

This comment has been minimized.

wjmZZZ commented Mar 26, 2025

MZ-Sun commented Mar 26, 2025

SWEENEYHE commented Apr 1, 2025

Error with latest setup.py trl.extras.vllm_client - Server is not up yet #543

Error with latest setup.py trl.extras.vllm_client - Server is not up yet #543

Comments

ECMGit commented Mar 24, 2025 • edited Loading

qgallouedec commented Mar 24, 2025

ECMGit commented Mar 24, 2025

qgallouedec commented Mar 24, 2025

ECMGit commented Mar 24, 2025 • edited Loading

This comment has been minimized.

wjmZZZ commented Mar 26, 2025

MZ-Sun commented Mar 26, 2025

SWEENEYHE commented Apr 1, 2025

ECMGit commented Mar 24, 2025 •

edited

Loading

ECMGit commented Mar 24, 2025 •

edited

Loading