Simplified-AdEMAMix

This is the official implementation of the Sim-AdEMAMix optimizer. To use, copy the simplified_AdEMAMix.py file to your codebase and use the optimizer in the following fashion (here T represents the total steps of the run):

from simplified_AdEMAMix import SimAdEMAMix

optim = SimAdEMAMix(lr = 1e-4, betas=(.99, .95), alpha=0.0, min_beta1=0.9, beta1_warmup=T, weight_decay=0.0)

The optimizer by default has the momentum maintained in theory style (not EMA style) with bias correction turned off, which generally seems to help in practice with cosine decay. Optimal value of $\alpha$ really depends on the batch size, and from theory, should scale down linearly with increase in batch size. Our optimal alpha at a batch size of 1m tokens was close to 0 ($\approx 0.05$), while at 32k was close to 100. At higher batch sizes (i.e curvature dominated regime, instead of noise dominated), $\alpha$ should be set close to a small multiple of $1-\beta_1$ (inspired by Nesterov).

For tuning $\eta, \beta_1, \beta_2$ and min_beta, if we have an optimal Adam run with hyperparameters $\eta^{adam}, \beta_1^{adam}$ and $\beta_2^{adam}$, we recommend that for AdEMAMix, the optimal hyperparameters should be around min_beta = $\beta_1^{adam}$, $\beta_1$ higher than min_beta (for min_beta=0.9, maybe try 0.95, 0.99, 0.999), $\beta_2 = \beta_2^{adam}$ and $\eta = \eta^{adam} \sqrt{(1-\text{min beta})*(1-\beta_1)}$ (thus optimal $\eta$ is coupled with value of $\beta_1$). One more thing to note is that $\beta_1$ should generally decrease with increasing batch size.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
simplified_AdEMAMix.py		simplified_AdEMAMix.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simplified-AdEMAMix

About

Releases

Packages

Languages

License

DepenM/Simplified-AdEMAMix

Folders and files

Latest commit

History

Repository files navigation

Simplified-AdEMAMix

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages