Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Almost successfully reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct. Here are the training configurations. #545

Open
kiseliu opened this issue Mar 25, 2025 · 5 comments

Comments

@kiseliu
Copy link

kiseliu commented Mar 25, 2025

After encountering several errors and incorrect results, I'd like to share my experience reproducing open-r1/OpenR1-Qwen-7B based on Qwen/Qwen2.5-Math-7B-Instruct.

The training commands below are configured for a node of 8 x H100s (80GB).

1. Modify the config file of Qwen/Qwen2.5-Math-7B-Instruct

After downloading the model Qwen/Qwen2.5-Math-7B-Instruct, we should modify the model config file following https://huggingface.co/open-r1/OpenR1-Qwen-7B/blob/main/config.json

2. Modify the training recipes correctly

If you follow the official installation steps and run the following training command:

accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Qwen-7B/sft/config.yaml

The first issue you’ll encounter is #366.

To resolve this issue, the effective way is to modify the corresponding recipe: https://github.com/huggingface/open-r1/blob/main/recipes/OpenR1-Qwen-7B/sft/config.yaml by changing the line 29

- use_liger_kernel: true
+ use_liger: true

Other solutions will lead to other issues, such as:

3. Modify the sft.py file

Following this issue: #494, we could change the sft.py https://github.com/huggingface/open-r1/blob/main/src/open_r1/sft.py as follows:

- tokenizer.pad_token = tokenizer.eos_token
+ if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token

4. The reproducing results on AIME24, MATH-500

I tested the saved model (the total step is 3219 steps), and the performance is:

  • AIME24: 46.7;
  • MATH-500: 92.4;

I tested open-r1/OpenR1-Qwen-7B, and the performance is:

  • AIME24: 50.0;
  • MATH-500: 92.8;

It seems that this model is saved at step 3150.

@Tim-Siu
Copy link

Tim-Siu commented Mar 25, 2025

Thanks for sharing!

I wonder if replacing DataCollatorForLanguageModeling with DataCollatorForCompletionOnlyLM would further boost the performance.

Rightnow, the OpenR1 recipes train on both prompts and response. It doesn't seem like common practices.

@kiseliu
Copy link
Author

kiseliu commented Mar 25, 2025

@Tim-Siu I am not sure if the performance will boost.

For the current version, it only costs 12 hours for sft on one node of 8 x H100s (80GB).

If replacing DataCollatorForLanguageModeling with DataCollatorForCompletionOnlyLM, then we cannot set packing=True. I think the training time will take much longer like 20+ hours.

@VulDetect-llm
Copy link

how to run 7b model on multiple gpus, 1 A800 encounters oom errors

@juliancodaforno
Copy link

Thank you so much!! I had two questions:

  1. In terms of efficiency didn't you change other parameters? Because doing exactly how you say I get expected time of run of 60 hours on 8 A100 of 80GB but should be ~x2.5 your reported time of 12hours (since you use H100s) so still quite off. I guess you must have changed the attn_implementation to flash_attention_2 right? Also as a side note, you can increase the per_device_train_batch_size to 2 for 8 Gpus of 80GB (don't forget to lower the grad_accum to 1) and still works without OOM. Then I get ~23hrs of expected runtime which is more what one would expect from H100s to A100s.

  2. Did you try replicating other models such as the Qwen2.5-1.5B-Instruct? Because there I can't replicate.

@kiseliu
Copy link
Author

kiseliu commented Mar 26, 2025

Thank you so much!! I had two questions:

  1. In terms of efficiency didn't you change other parameters? Because doing exactly how you say I get expected time of run of 60 hours on 8 A100 of 80GB but should be ~x2.5 your reported time of 12hours (since you use H100s) so still quite off. I guess you must have changed the attn_implementation to flash_attention_2 right? Also as a side note, you can increase the per_device_train_batch_size to 2 for 8 Gpus of 80GB (don't forget to lower the grad_accum to 1) and still works without OOM. Then I get ~23hrs of expected runtime which is more what one would expect from H100s to A100s.
  2. Did you try replicating other models such as the Qwen2.5-1.5B-Instruct? Because there I can't replicate.

@juliancodaforno

  1. I didn’t change any other parameters. The attn_implementation I used was the default value (sdpa).

    Yes, setting per_device_train_batch_size=2 and gradient_accumulation_steps=1 should work without causing OOM errors.

  2. I previously attempted to run the code for Qwen2.5-1.5B-Instruct but did so incorrectly. I won’t retry it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants