Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with latest setup.py trl.extras.vllm_client - Server is not up yet #543

Closed
ECMGit opened this issue Mar 24, 2025 · 8 comments
Closed

Comments

@ECMGit
Copy link

ECMGit commented Mar 24, 2025

Hi team,

The latest main commit with updated trl has error, I cannot reproduce with 8 GPUs grpo.py

my previous build based on Commit 8782fa6
is working

error log:

[rank1]:[W324 20:25:15.507902367 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank6]:[W324 20:25:15.508125161 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W324 20:25:15.517874031 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank5]:[W324 20:25:15.525969750 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W324 20:25:15.596575310 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[INFO|trainer.py:748] 2025-03-24 20:25:15,587 >> Using auto half precision backend
2025-03-24 20:25:15 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
[rank4]:[W324 20:25:15.812969611 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2025-03-24 20:25:17 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:19 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:21 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:23 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:25 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-24 20:25:27 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...

Can anyone help me with that?

@qgallouedec
Copy link
Member

What code do you use for training? Dyou set up a vllm server with trl vllm-serve?

@ECMGit
Copy link
Author

ECMGit commented Mar 24, 2025

Thanks for your response!

I am using grpo training cmd for DeepSeek R1 and Simple RL:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \ --num_processes=7 src/open_r1/grpo.py \ --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \ --num_processes=7 src/open_r1/grpo.py \ --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml

do I need to set up a vllm server first with trl vllm-serve before I run above cmd?

@qgallouedec
Copy link
Member

Yes, check the latest doc: https://huggingface.co/docs/trl/en/grpo_trainer#speed-up-training-with-vllm-powered-generation.
We probably have some examples to update in Open-R1 as well

@ECMGit
Copy link
Author

ECMGit commented Mar 24, 2025

Thanks! Now following solution for training example in cmd works for me:

use GPU 0 for vllm-server, others for training if you have 8 GPUs total
run CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model <model_name>

then in another session:
DeepSeek R1 distill, need around 100+ hrs
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \ --num_processes=7 src/open_r1/grpo.py \ --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml

Simple-RL approach - 3hrs, reward function: accuracy, format
```
  CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
  --num_processes=7 src/open_r1/grpo.py \
  --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
```

@wjmZZZ

This comment has been minimized.

@wjmZZZ
Copy link

wjmZZZ commented Mar 26, 2025

What code do you use for training? Dyou set up a vllm server with trl vllm-serve?

I ran vllmclient as instructed: CUDA_VISIBLE_DEVICES=0,1 trl vllm-serve --model Qwen2.5-7B-Instruct --tensor_parallel_size 2

Image However, when running the GRPO training script, I'm still encountering the same issue:
[INFO|trainer.py:746] 2025-03-26 05:53:13,046 >> Using auto half precision backend
2025-03-26 05:53:13 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:15 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:17 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...

I found the issue. It was caused by my proxy configuration, which prevented it from using the vllm address.

@MZ-Sun
Copy link

MZ-Sun commented Mar 26, 2025

What code do you use for training? Dyou set up a vllm server with trl vllm-serve?

I ran vllmclient as instructed: CUDA_VISIBLE_DEVICES=0,1 trl vllm-serve --model Qwen2.5-7B-Instruct --tensor_parallel_size 2
Image
However, when running the GRPO training script, I'm still encountering the same issue:

[INFO|trainer.py:746] 2025-03-26 05:53:13,046 >> Using auto half precision backend
2025-03-26 05:53:13 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:15 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...
2025-03-26 05:53:17 - INFO - trl.extras.vllm_client - Server is not up yet. Retrying in 2.0 seconds...

I found the issue. It was caused by my proxy configuration, which prevented it from using the vllm address.

Are you also on a cluster?

@SWEENEYHE
Copy link

I got the same problem, I solved it by "unset http_proxy" and "unset https_proxy"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants