Enabling DeepSpeed on Gaudi

Published: 01/20/2023

This tutorial provides example training scripts to demonstrate different DeepSpeed optimization technologies on HPU. This tutorial will focus on the memory optimization technologies, including Zero Redundancy Optimizer(ZeRO) and Activation Checkpointing.

Example Overview

The PyTorch minGPT example is based on the source code forked from GitHub repository minGPT.

Setup

Please follow the instructions provided in the Gaudi Installation Guide to set up the environment including the $PYTHON environment variable. The guide will walk you through the process of setting up your system to run the model on Gaudi.

Clone Habana Model-References

In the docker container, clone this repository and switch to the branch that matches your SynapseAI version. You can run the hl-smi utility to determine the SynapseAI version.

git clone https://github.com/HabanaAI/Gaudi-tutorials /path/to/Gaudi-tutorials
cd Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/

Install Habana DeepSpeed

Please follow the instructions provided in the Gaudi DeepSpeed User Guide to install the DeepSpeed on Gaudi.

pip install git+https://github.com/HabanaAI/[email protected]

Memory Consumptions Under Different DeepSpeed Technologies

Preparations

Make sure there are available Gaudi devices (HPUs). In this tutorial we use 8 Gaudi devices.
To demonstrate the memory, add –dump-memory in the command line.
To limit the training steps (e.g. 4 steps), add –steps 4 in the command line.

Run minGPT with different DeepSpeed technologies

Create a big model instead of the default gpt-nano model. This makes the memory variation more obvious during different phases.

Change the model type from gpt-nano to gpt2

--- a/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py
+++ b/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py
@@ -146,7 +146,7 @@ for a, b in zip(x,y):
 from mingpt.model import GPT

 model_config = GPT.get_default_config()
-model_config.model_type = 'gpt-nano'
+model_config.model_type = 'gpt2'
 model_config.vocab_size = train_dataset.get_vocab_size()
 model_config.block_size = train_dataset.get_block_size()

Run minGPT with DeepSpeed ZeRO0

cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory

The memory consumption on different training phases and the max memory consumption will look like below (in MB):

Step	Before forward (M)	After forward (M)	Before backward (M)	After backward (M)	Before step (M)	After step (M)	Max memory (M)
0	328	328	328	1726(max 1735)	1726	1402	1735
1	1726(max 2700)	1726	1726	2051(max 2384)	2051	2051	2700
2	2051	2051	2051	2051(max 2384)	2051	1726	2384
3	1726	1726	1726	2051(max 2384)	2051	1726	2384

Run minGPT with DeepSpeed ZeRO1

cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory

The memory consumption on different training phases and the max memory consumption will look like below (in MB):

Step	Before forward (M)	After forward (M)	Before backward (M)	After backward (M)	Before step (M)	After step (M)	Max memory (M)
0	166	166	166	830(max 1056)	830	835	1056
1	672	672	672	695(max 997)	695	672(max 857)	997
2	672	672	672	695(max 997)	695	672(max 857)	997
3	672	672	672	695(max 997)	695	672(max 857)	997

Run minGPT with DeepSpeed ZeRO1 and Activation Checkpoiting

cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1_ac.json --use_hpu --steps 4 --dump-memory --activation-checkpoint

The memory consumption on different training phases and the max memory consumption will look like below (in MB):

Step	Before forward (M)	After forward (M)	Before backward (M)	After backward (M)	Before step (M)	After step (M)	Max memory (M)
0	166	166	166	581(max 758)	581	423(max 586)	758
1	423	423	423	446(max 755)	446	423(max 608)	755
2	423	423	423	446(max 758)	446	423(max 608)	758
3	423	423	423	446(max 758)	446	423(max 608)	758

Run minGPT with DeepSpeed ZeRO2

cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero2.json --use_hpu --steps 4 --dump-memory

The memory consumption on different training phases and the max memory consumption will look like below (in MB):

Step	Before forward (M)	After forward (M)	Before backward (M)	After backward (M)	Before step (M)	After step (M)	Max memory (M)
0	166	166	166	660(max 993)	660	682	993
1	520	520	520	663(max 935)	663	523(max 708)	935
2	523	523	523	568(max 935)	568	523(max 708)	935
3	523	523	523	568(max 935)	568	523(max 708)	935

Conclusions:

Zero0 (basically the default DDP) takes biggest memory
Zero1 & 2 takes less memory than Zero0
With Activation Checkpointing, memory decreases even more.

Use ZeRO to solve the Out-Of-Memory issue

Due to the limited memory on HPU device, it may fail to run a big model on HPU with default configuration (e.g. ZeRO0)

Create a very big model with minGPT

Change the model type from gpt-nano to gpt2-xl

--- a/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py
+++ b/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py
@@ -146,7 +146,7 @@ for a, b in zip(x,y):
 from mingpt.model import GPT

 model_config = GPT.get_default_config()
-model_config.model_type = 'gpt-nano'
+model_config.model_type = 'gpt2-xl'
 model_config.vocab_size = train_dataset.get_vocab_size()
 model_config.block_size = train_dataset.get_block_size()

Run minGPT with DeepSpeed ZeRO0

cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory

There will be a OOM error from HPU SW stack like below:

...
RuntimeError: FATAL ERROR :: MODULE:BRIDGE Exception in Launch thread...
FATAL ERROR :: MODULE:DEVMEM Allocation failed for size::40960000 (39.0625)MB

Run minGPT with DeepSpeed ZeRO1

cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT
deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory

Via applying ZeRO technology, e.g. ZeRO1, the model can run successfully on HPU.

Share this article:

Enabling DeepSpeed on Gaudi

This tutorial provides example training scripts to demonstrate different DeepSpeed optimization technologies on HPU. This tutorial will focus on the memory optimization technologies, including Zero Redundancy Optimizer(ZeRO) and Activation Checkpointing.

Example Overview

Setup

Clone Habana Model-References

Install Habana DeepSpeed

Memory Consumptions Under Different DeepSpeed Technologies

Preparations

Run minGPT with different DeepSpeed technologies

Use ZeRO to solve the Out-Of-Memory issue

You might also be interested in:

Memory-Efficient Training on Habana^® Gaudi^® with DeepSpeed

Fine tuning GPT2 with Hugging Face and Habana Gaudi

The Habana team is happy to announce the release of SynapseAI® version 1.7.0.

Enabling DeepSpeed on Gaudi

This tutorial provides example training scripts to demonstrate different DeepSpeed optimization technologies on HPU. This tutorial will focus on the memory optimization technologies, including Zero Redundancy Optimizer(ZeRO) and Activation Checkpointing.

Example Overview

Setup

Clone Habana Model-References

Install Habana DeepSpeed

Memory Consumptions Under Different DeepSpeed Technologies

Preparations

Run minGPT with different DeepSpeed technologies

Use ZeRO to solve the Out-Of-Memory issue

You might also be interested in:

Memory-Efficient Training on Habana® Gaudi® with DeepSpeed

Fine tuning GPT2 with Hugging Face and Habana Gaudi

The Habana team is happy to announce the release of SynapseAI® version 1.7.0.

Memory-Efficient Training on Habana^® Gaudi^® with DeepSpeed