To install EasyDeL in a Kaggle or Colab environment, follow these steps:
pip uninstall torch-xla -y -q # Remove pre-installed torch-xla (for TPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu -qU # Install PyTorch for model conversion
pip install git+https://github.com/erfanzar/easydel -qU # Install EasyDeL from the latest source
pip install -U "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html -qU # Install JAX for TPUs
pip install tensorflow tensorflow-datasets -qU # Install TensorFlow and datasets for training
To set up TPU hosts for multi-host or multi-slice environments, install eopod
:
pip install eopod
If you encounter an error where eopod
is not found, add the local bin path to your shell profile:
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc # Apply changes immediately
Next, configure eopod
with your Google Cloud project details:
eopod configure --project-id YOUR_PROJECT_ID --zone YOUR_ZONE --tpu-name YOUR_TPU_NAME
Use eopod
to install the necessary packages on all TPU slices:
eopod run pip install torch --index-url https://download.pytorch.org/whl/cpu -qU # Required for model conversion
eopod run pip install git+https://github.com/erfanzar/easydel -qU # Install EasyDeL from the latest source
eopod run pip install -U "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html -qU # Install JAX for TPUs
eopod run pip install tensorflow tensorflow-datasets -qU # Required for training and data loaders
EasyDeL supports distributed execution with Ray, particularly for multi-host and multi-slice TPU environments. For GPUs, manual configuration is required, but TPUs can leverage eformer
, an EasyDeL utility for cluster management.
For a 2x v4-64 TPU setup, run:
eopod run "python -m eformer.escale.tpexec.tpu_patcher --tpu-version TPU-VERSION --tpu-slice TPU-SLICES --num-slices NUM_SLICES --internal-ips INTERNAL_IP1-SLICE1,INTERNAL_IP2-SLICE1,INTERNAL_IP3-SLICE1,INTERNAL_IP4-SLICE1,INTERNAL_IP1-SLICE2,INTERNAL_IP2-SLICE2,INTERNAL_IP3-SLICE2,INTERNAL_IP4-SLICE2 --self-job"
For a v4-256 TPU:
eopod run "python -m eformer.escale.tpexec.tpu_patcher --tpu-version v4 --tpu-slice 256 --num-slices 1 --internal-ips <comma-separated-TPU-IPs> --self-job"
Once Ray is configured, you can use eformer.escale.tpexec
instead of eopod
for executing distributed code and benefiting from Ray's capabilities.
Before training, log in to Hugging Face and Weights & Biases:
eopod run "python -c 'from huggingface_hub import login; login(token=\"<API-TOKEN-HERE>\")'"
eopod run python -m wandb login <API-TOKEN-HERE>
Run the following command to fine-tune a model using EasyDeL's DPO framework:
eopod run python -m easydel.scripts.finetune.dpo \
--repo_id meta-llama/Llama-3.1-8B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--dataset_split "train[:90%]" \
--refrence_model_repo_id meta-llama/Llama-3.3-70B-Instruct \
--attn_mechanism vanilla \
--beta 0.08 \
--loss_type sigmoid \
--max_length 2048 \
--max_prompt_length 1024 \
--ref_model_sync_steps 128 \
--total_batch_size 16 \
--learning_rate 1e-6 \
--learning_rate_end 6e-7 \
--log_steps 50 \
--shuffle_train_dataset \
--report_steps 1 \
--progress_bar_type tqdm \
--num_train_epochs 3 \
--auto_shard_states \
--optimizer adamw \
--scheduler linear \
--do_last_save \
--save_steps 1000 \
--use_wandb
This setup ensures proper installation and configuration for training large models using EasyDeL with TPUs and distributed environments.