You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your excellent work and for open-sourcing all these codes.
I've recently found that when restarting the pre-training task from a halfway checkpoint (e.g. checkpoint_last.pt), the EMA decay remains as the initial value and is not successfully updated in the model.
It seems that the logic if on line 360 is not correct during the restarting procedure. When restarting, this function is called when loading the checkpoint, where self.num_updates=0 and num_updates is bigger than 1. Therefore, self.decay will remain as the initial value 0.9998, which is the value at the very beginning of the pre-training process.
I'm not sure how bad the consequences are, since the EMA decay will be corrected after the first batch. But it seems to me that restarting from a checkpoint is a bit worse than training without stopping after a few trials.
Hope this will help.
The text was updated successfully, but these errors were encountered:
By the way, I guess only the ema decay should be updated when loading the checkpoint. self.ema.step should not be called since it will be called when the first batch is inputted into the model.
Thank you so much for pointing this out @jianganbai.
I'll look into the issue with EMA updates when resuming training from a checkpoint (might take some time as I'm currently tied up with other ddl 😢).
Really appreciate your findings—I'll keep the issue open until it's resolved.
Thank you for your excellent work and for open-sourcing all these codes.
I've recently found that when restarting the pre-training task from a halfway checkpoint (e.g. checkpoint_last.pt), the EMA decay remains as the initial value and is not successfully updated in the model.
EAT/models/EAT_pretraining.py
Lines 356 to 380 in e1ad547
It seems that the logic if on line 360 is not correct during the restarting procedure. When restarting, this function is called when loading the checkpoint, where
self.num_updates=0
andnum_updates
is bigger than 1. Therefore,self.decay
will remain as the initial value0.9998
, which is the value at the very beginning of the pre-training process.I'm not sure how bad the consequences are, since the EMA decay will be corrected after the first batch. But it seems to me that restarting from a checkpoint is a bit worse than training without stopping after a few trials.
Hope this will help.
The text was updated successfully, but these errors were encountered: