Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #511

singstreetwithu · 2025-03-16T16:20:15Z

When I run the following code, based on the DigitalLearningGmbH___math-lighteval dataset, to perform grpo training on the Qwen2.5-Math-7B model, an error is reported.
Please help me, I've been stuck for several days...

code:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes=3 src/open_r1/grpo.py --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml

error

[INFO|image_processing_auto.py:301] 2025-03-14 20:50:32,284 >> Could not locate the image processor configuration file, will try to use the model config instead.
INFO 03-14 20:50:41 config.py:542] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 03-14 20:50:41 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/root/paddlejob/workspace/open-r1/models/Qwen2.5-Math-7B', speculative_config=None, tokenizer='/root/paddlejob/workspace/open-r1/models/Qwen2.5-Math-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:3, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/paddlejob/workspace/open-r1/models/Qwen2.5-Math-7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
[INFO|tokenization_utils_base.py:2048] 2025-03-14 20:50:41,275 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-14 20:50:41,275 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-14 20:50:41,275 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-14 20:50:41,275 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-14 20:50:41,275 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-14 20:50:41,275 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-14 20:50:41,275 >> loading file chat_template.jinja
[rank2]:[W314 20:50:41.696109819 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[INFO|tokenization_utils_base.py:2313] 2025-03-14 20:50:41,619 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:1093] 2025-03-14 20:50:41,728 >> loading configuration file /root/paddlejob/workspace/open-r1/models/Qwen2.5-Math-7B/generation_config.json
[INFO|configuration_utils.py:1140] 2025-03-14 20:50:41,728 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151643,
"max_new_tokens": 2048
}

INFO 03-14 20:50:41 cuda.py:230] Using Flash Attention backend.
INFO 03-14 20:50:42 model_runner.py:1110] Starting to load model /root/paddlejob/workspace/open-r1/models/Qwen2.5-Math-7B...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:05, 1.98s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:04<00:04, 2.13s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:06<00:02, 2.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00, 2.24s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00, 2.20s/it]

INFO 03-14 20:50:52 model_runner.py:1115] Loading model weights took 0.0000 GB
INFO 03-14 20:50:53 worker.py:267] Memory profiling takes 0.94 seconds
INFO 03-14 20:50:53 worker.py:267] the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.70) = 27.57GiB
INFO 03-14 20:50:53 worker.py:267] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 27.57GiB.
INFO 03-14 20:50:54 executor_base.py:110] # CUDA blocks: 32269, # CPU blocks: 4681
INFO 03-14 20:50:54 executor_base.py:115] Maximum concurrency for 4096 tokens per request: 126.05x
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/paddlejob/workspace/open-r1/src/open_r1/grpo.py", line 280, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/root/paddlejob/workspace/open-r1/src/open_r1/grpo.py", line 214, in main
[rank0]: trainer = GRPOTrainer(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 476, in init
[rank0]: self.llm = LLM(
[rank0]: ^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/utils.py", line 1051, in inner
[rank0]: return fn(*args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 242, in init
[rank0]: self.llm_engine = self.engine_class.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 484, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 276, in init**
[rank0]: self._initialize_kv_caches()
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 121, in initialize_cache
[rank0]: self.collective_rpc("initialize_cache",
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/utils.py", line 2220, in run_method
[rank0]: return func(*args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/worker/worker.py", line 306, in initialize_cache
[rank0]: self._init_cache_engine()
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/worker/worker.py", line 311, in _init_cache_engine
[rank0]: self.cache_engine = [
[rank0]: ^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/worker/worker.py", line 312, in
[rank0]: CacheEngine(self.cache_config, self.model_config,
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/worker/cache_engine.py", line 69, in init
[rank0]: self.gpu_cache = self._allocate_kv_cache(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/vllm/worker/cache_engine.py", line 103, in allocate_kv_cache
[rank0]: layer_kv_cache = torch.zeros(alloc_shape,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: r.nvmlDeviceGetNvLinkRemoteDeviceType INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":33, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType
[rank0]:[W314 20:50:56.487939548 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0314 20:50:57.848000 17103 openr1/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17271 closing signal SIGTERM
W0314 20:50:57.849000 17103 openr1/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 17272 closing signal SIGTERM
E0314 20:50:59.416000 17103 openr1/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 17270) of binary: /root/paddlejob/workspace/open-r1/openr1/bin/python3
Traceback (most recent call last):
File "/root/paddlejob/workspace/open-r1/openr1/bin/accelerate", line 10, in
sys.exit(main())
^^^^^^
File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
deepspeed_launcher(args)
File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launcher
distrib_run.run(args)
File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/paddlejob/workspace/open-r1/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/open_r1/grpo.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-03-14_20:50:57
host : yq01-inf-hic-k8s-a100-aa24-0495.yq01.baidu.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 17270)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

my device：

4 a100 40g or 8 a100 40g；whatever the number of gpus is(4/8) ,the error appears;

cuda:

some of packages：

mdurl 0.1.2
mistral_common 1.5.3
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.19.0
multidict 6.1.0
multiprocess 0.70.16
murmurhash 1.0.12
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1.3
nltk 3.9.1
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
open-r1 0.1.0.dev0 /root/paddlejob/workspace/open-r1
openai 1.66.3
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python-headless 4.11.0.86
outlines 0.1.11
outlines_core 0.1.26
packaging 24.2
pandas 2.2.3
parameterized 0.9.0
partial-json-parser 0.2.1.1.post5
pathvalidate 3.2.3
pfzy 0.3.4
pillow 11.1.0
pip 25.0.1
platformdirs 4.3.6
pluggy 1.5.0
portalocker 3.1.1
preshed 3.0.9
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.0.2
prompt_toolkit 3.0.50
propcache 0.3.0
protobuf 3.20.3
psutil 7.0.0
py-cpuinfo 9.0.0
py-spy 0.4.0
pyarrow 19.0.1
pyasn1 0.6.1
pyasn1_modules 0.4.1
pycodestyle 2.12.1
pycountry 24.6.1
pydantic 2.10.6
pydantic_core 2.27.2
pyflakes 3.2.0
Pygments 2.19.1
pytablewriter 1.2.1
pytest 8.3.5
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2025.1
PyYAML 6.0.2
pyzmq 26.3.0
ray 2.43.0
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
rich 13.9.4
rouge_score 0.1.2
rpds-py 0.23.1
rsa 4.9
ruff 0.10.0
sacrebleu 2.5.1
safetensors 0.5.3
scikit-learn 1.6.1
scipy 1.15.2
sentencepiece 0.2.0
sentry-sdk 2.22.0
setproctitle 1.3.5
setuptools 76.0.0
six 1.17.0
smart-open 6.4.0
smmap 5.0.2
sniffio 1.3.1
spacy 3.7.2
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.5.1
starlette 0.46.1
sympy 1.13.1
tabledata 1.3.4
tabulate 0.9.0
tcolorpy 0.1.7
termcolor 2.3.0
thinc 8.2.5
threadpoolctl 3.6.0
tiktoken 0.9.0
tokenizers 0.21.1
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
tqdm 4.67.1
transformers 4.49.0
triton 3.1.0
trl 0.16.0.dev0
typepy 1.3.4
typer 0.9.4
typing_extensions 4.12.2
tzdata 2025.1
urllib3 2.3.0
uvicorn 0.34.0
uvloop 0.21.0
virtualenv 20.29.3
vllm 0.7.2
wandb 0.19.8
wasabi 1.1.3
watchfiles 1.0.4
wcwidth 0.2.13
weasel 0.3.4
websockets 15.0.1
wrapt 1.17.2
xformers 0.0.28.post3
xgrammar 0.1.15
xxhash 3.5.0
yarl 1.18.3
zipp 3.21.0

two files

Model arguments

model_name_or_path: /root/paddlejob/workspace/open-r1/models/Qwen2.5-Math-7B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

Data training arguments

dataset_name: DigitalLearningGmbH/MATH-lighteval
dataset_config: default
system_prompt: "You are a helpful AI Assistant, designed to provided well-reasoned and detailed responses. You FIRST think about the reasoning process as an internal monologue and then provide the user with the answer. The reasoning process MUST BE enclosed within and tags."

GRPO trainer config

bf16: true
use_vllm: true
vllm_device: auto
vllm_gpu_memory_utilization: 0.7
do_eval: true
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
hub_model_id: Qwen-2.5-7B-Simple-RL
hub_strategy: every_save
learning_rate: 3.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 5
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 512
max_completion_length: 1024
max_steps: -1
num_generations: 3
num_train_epochs: 1
output_dir: data/Qwen-2.5-7B-Simple-RL
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: true
report_to:

wandb
reward_funcs:
accuracy
format
reward_weights:
1.0
1.0
save_strategy: "no"
seed: 42
warmup_ratio: 0.1

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #511

Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #511

singstreetwithu commented Mar 16, 2025 •

edited

Loading

Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #511

Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #511

Comments

singstreetwithu commented Mar 16, 2025 • edited Loading

code:

error

src/open_r1/grpo.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-03-14_20:50:57 host : yq01-inf-hic-k8s-a100-aa24-0495.yq01.baidu.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 17270) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

my device：

cuda:

some of packages：

two files

Model arguments

Data training arguments

GRPO trainer config

singstreetwithu commented Mar 16, 2025 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-03-14_20:50:57
host : yq01-inf-hic-k8s-a100-aa24-0495.yq01.baidu.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 17270)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html