Example Overview
The PyTorch minGPT example is based on the source code forked from GitHub repository minGPT.
Setup
Please follow the instructions provided in the Gaudi Installation Guide to set up the environment including the $PYTHON environment variable. The guide will walk you through the process of setting up your system to run the model on Gaudi.
Clone Habana Model-References
In the docker container, clone this repository and switch to the branch that matches your SynapseAI version. You can run the hl-smi utility to determine the SynapseAI version.
git clone https://github.com/HabanaAI/Gaudi-tutorials /path/to/Gaudi-tutorials
cd Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/
Install Habana DeepSpeed
Please follow the instructions provided in the Gaudi DeepSpeed User Guide to install the DeepSpeed on Gaudi.
pip install git+https://github.com/HabanaAI/[email protected]
Memory Consumptions Under Different DeepSpeed Technologies
Preparations
- Make sure there are available Gaudi devices (HPUs). In this tutorial we use 8 Gaudi devices.
- To demonstrate the memory, add –dump-memory in the command line.
- To limit the training steps (e.g. 4 steps), add –steps 4 in the command line.
Run minGPT with different DeepSpeed technologies
- Create a big model instead of the default gpt-nano model. This makes the memory variation more obvious during different phases.
Change the model type from gpt-nano to gpt2
--- a/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py
+++ b/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py
@@ -146,7 +146,7 @@ for a, b in zip(x,y):
from mingpt.model import GPT
model_config = GPT.get_default_config()
-model_config.model_type = 'gpt-nano'
+model_config.model_type = 'gpt2'
model_config.vocab_size = train_dataset.get_vocab_size()
model_config.block_size = train_dataset.get_block_size()
- Run minGPT with DeepSpeed ZeRO0
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory
The memory consumption on different training phases and the max memory consumption will look like below (in MB):
Step | Before forward (M) | After forward (M) | Before backward (M) | After backward (M) | Before step (M) | After step (M) | Max memory (M) |
---|---|---|---|---|---|---|---|
0 | 328 | 328 | 328 | 1726(max 1735) | 1726 | 1402 | 1735 |
1 | 1726(max 2700) | 1726 | 1726 | 2051(max 2384) | 2051 | 2051 | 2700 |
2 | 2051 | 2051 | 2051 | 2051(max 2384) | 2051 | 1726 | 2384 |
3 | 1726 | 1726 | 1726 | 2051(max 2384) | 2051 | 1726 | 2384 |
- Run minGPT with DeepSpeed ZeRO1
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory
The memory consumption on different training phases and the max memory consumption will look like below (in MB):
Step | Before forward (M) | After forward (M) | Before backward (M) | After backward (M) | Before step (M) | After step (M) | Max memory (M) |
---|---|---|---|---|---|---|---|
0 | 166 | 166 | 166 | 830(max 1056) | 830 | 835 | 1056 |
1 | 672 | 672 | 672 | 695(max 997) | 695 | 672(max 857) | 997 |
2 | 672 | 672 | 672 | 695(max 997) | 695 | 672(max 857) | 997 |
3 | 672 | 672 | 672 | 695(max 997) | 695 | 672(max 857) | 997 |
- Run minGPT with DeepSpeed ZeRO1 and Activation Checkpoiting
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1_ac.json --use_hpu --steps 4 --dump-memory --activation-checkpoint
The memory consumption on different training phases and the max memory consumption will look like below (in MB):
Step | Before forward (M) | After forward (M) | Before backward (M) | After backward (M) | Before step (M) | After step (M) | Max memory (M) |
---|---|---|---|---|---|---|---|
0 | 166 | 166 | 166 | 581(max 758) | 581 | 423(max 586) | 758 |
1 | 423 | 423 | 423 | 446(max 755) | 446 | 423(max 608) | 755 |
2 | 423 | 423 | 423 | 446(max 758) | 446 | 423(max 608) | 758 |
3 | 423 | 423 | 423 | 446(max 758) | 446 | 423(max 608) | 758 |
- Run minGPT with DeepSpeed ZeRO2
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero2.json --use_hpu --steps 4 --dump-memory
The memory consumption on different training phases and the max memory consumption will look like below (in MB):
Step | Before forward (M) | After forward (M) | Before backward (M) | After backward (M) | Before step (M) | After step (M) | Max memory (M) |
---|---|---|---|---|---|---|---|
0 | 166 | 166 | 166 | 660(max 993) | 660 | 682 | 993 |
1 | 520 | 520 | 520 | 663(max 935) | 663 | 523(max 708) | 935 |
2 | 523 | 523 | 523 | 568(max 935) | 568 | 523(max 708) | 935 |
3 | 523 | 523 | 523 | 568(max 935) | 568 | 523(max 708) | 935 |
Conclusions:
- Zero0 (basically the default DDP) takes biggest memory
- Zero1 & 2 takes less memory than Zero0
- With Activation Checkpointing, memory decreases even more.
Use ZeRO to solve the Out-Of-Memory issue
Due to the limited memory on HPU device, it may fail to run a big model on HPU with default configuration (e.g. ZeRO0)
- Create a very big model with minGPT
Change the model type from gpt-nano to gpt2-xl
--- a/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py
+++ b/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py
@@ -146,7 +146,7 @@ for a, b in zip(x,y):
from mingpt.model import GPT
model_config = GPT.get_default_config()
-model_config.model_type = 'gpt-nano'
+model_config.model_type = 'gpt2-xl'
model_config.vocab_size = train_dataset.get_vocab_size()
model_config.block_size = train_dataset.get_block_size()
- Run minGPT with DeepSpeed ZeRO0
cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory
There will be a OOM error from HPU SW stack like below:
...
RuntimeError: FATAL ERROR :: MODULE:BRIDGE Exception in Launch thread...
FATAL ERROR :: MODULE:DEVMEM Allocation failed for size::40960000 (39.0625)MB
- Run minGPT with DeepSpeed ZeRO1
cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory
Via applying ZeRO technology, e.g. ZeRO1, the model can run successfully on HPU.