Almost successfully reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct. Here are the training configurations. #545

kiseliu · 2025-03-25T04:05:27Z

After encountering several errors and incorrect results, I'd like to share my experience reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct.

The training commands below are configured for a node of 8 x H100s (80GB).

1. Modify the config file of Qwen/Qwen2.5-Math-7B-Instruct

After downloading the model Qwen/Qwen2.5-Math-7B-Instruct, we should modify the model config file following https://huggingface.co/open-r1/OpenR1-Qwen-7B/blob/main/config.json

2. Modify the training recipes correctly

If you follow the official installation steps and run the following training command:

accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Qwen-7B/sft/config.yaml

The first issue you’ll encounter is #366.

To resolve this issue, the effective way is to modify the corresponding recipe: https://github.com/huggingface/open-r1/blob/main/recipes/OpenR1-Qwen-7B/sft/config.yaml by changing the line 29

- use_liger_kernel: true
+ use_liger: true

Other solutions will lead to other issues, such as:

Downgrading trl to 0.15.x: it might not work or have this issue: https://huggingface.co/open-r1/OpenR1-Qwen-7B/discussions/7 ;
Set use_liger_kernel: false: then we cannot set the gradient_accumulation_steps: 2 since it will bring an OOM problem;

3. Modify the sft.py file

Following this issue: #494, we could change the sft.py https://github.com/huggingface/open-r1/blob/main/src/open_r1/sft.py as follows:

- tokenizer.pad_token = tokenizer.eos_token
+ if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token

4. The reproducing results on AIME24, MATH-500

I tested the saved model (the total step is 3219 steps), and the performance is:

AIME24: 46.7;
MATH-500: 92.4;

I tested open-r1/OpenR1-Qwen-7B, and the performance is:

AIME24: 50.0;
MATH-500: 92.8;

It seems that this model is saved at step 3150.

The text was updated successfully, but these errors were encountered:

Tim-Siu · 2025-03-25T08:36:03Z

Thanks for sharing!

I wonder if replacing DataCollatorForLanguageModeling with DataCollatorForCompletionOnlyLM would further boost the performance.

Rightnow, the OpenR1 recipes train on both prompts and response. It doesn't seem like common practices.

kiseliu · 2025-03-25T09:27:53Z

@Tim-Siu I am not sure if the performance will boost.

For the current version, it only costs 12 hours for sft on one node of 8 x H100s (80GB).

If replacing DataCollatorForLanguageModeling with DataCollatorForCompletionOnlyLM, then we cannot set packing=True. I think the training time will take much longer like 20+ hours.

VulDetect-llm · 2025-03-25T11:11:39Z

how to run 7b model on multiple gpus, 1 A800 encounters oom errors

juliancodaforno · 2025-03-25T21:45:39Z

Thank you so much!! I had two questions:

In terms of efficiency didn't you change other parameters? Because doing exactly how you say I get expected time of run of 60 hours on 8 A100 of 80GB but should be ~x2.5 your reported time of 12hours (since you use H100s) so still quite off. I guess you must have changed the attn_implementation to flash_attention_2 right? Also as a side note, you can increase the per_device_train_batch_size to 2 for 8 Gpus of 80GB (don't forget to lower the grad_accum to 1) and still works without OOM. Then I get ~23hrs of expected runtime which is more what one would expect from H100s to A100s.
Did you try replicating other models such as the Qwen2.5-1.5B-Instruct? Because there I can't replicate.

kiseliu · 2025-03-26T03:23:01Z

Thank you so much!! I had two questions:

In terms of efficiency didn't you change other parameters? Because doing exactly how you say I get expected time of run of 60 hours on 8 A100 of 80GB but should be ~x2.5 your reported time of 12hours (since you use H100s) so still quite off. I guess you must have changed the attn_implementation to flash_attention_2 right? Also as a side note, you can increase the per_device_train_batch_size to 2 for 8 Gpus of 80GB (don't forget to lower the grad_accum to 1) and still works without OOM. Then I get ~23hrs of expected runtime which is more what one would expect from H100s to A100s.

Did you try replicating other models such as the Qwen2.5-1.5B-Instruct? Because there I can't replicate.

@juliancodaforno

I didn’t change any other parameters. The attn_implementation I used was the default value (sdpa).

Yes, setting per_device_train_batch_size=2 and gradient_accumulation_steps=1 should work without causing OOM errors.
I previously attempted to run the code for Qwen2.5-1.5B-Instruct but did so incorrectly. I won’t retry it again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Almost successfully reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct. Here are the training configurations. #545

Almost successfully reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct. Here are the training configurations. #545

kiseliu commented Mar 25, 2025 •

edited

Loading

Tim-Siu commented Mar 25, 2025 •

edited

Loading

kiseliu commented Mar 25, 2025 •

edited

Loading

VulDetect-llm commented Mar 25, 2025

juliancodaforno commented Mar 25, 2025

kiseliu commented Mar 26, 2025 •

edited

Loading

Almost successfully reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct. Here are the training configurations. #545

Almost successfully reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct. Here are the training configurations. #545

Comments

kiseliu commented Mar 25, 2025 • edited Loading

1. Modify the config file of Qwen/Qwen2.5-Math-7B-Instruct

2. Modify the training recipes correctly

3. Modify the sft.py file

4. The reproducing results on AIME24, MATH-500

Tim-Siu commented Mar 25, 2025 • edited Loading

kiseliu commented Mar 25, 2025 • edited Loading

VulDetect-llm commented Mar 25, 2025

juliancodaforno commented Mar 25, 2025

kiseliu commented Mar 26, 2025 • edited Loading

kiseliu commented Mar 25, 2025 •

edited

Loading

Tim-Siu commented Mar 25, 2025 •

edited

Loading

kiseliu commented Mar 25, 2025 •

edited

Loading

kiseliu commented Mar 26, 2025 •

edited

Loading