Pre-Training the BERT 1.5B model with DeepSpeed

Published: 01/31/2023

In this post, we show you how to run Habana’s DeepSpeed enabled BERT1.5B model from our Model-References repository.

This BERT model is based on Habana’s existing BERT model, just scaled up to a larger size. The focus on this example is pre-training on the Wikipedia dataset. Once trained, the model can be fine-tuned on different datasets for multiple types of tasks such as question and answering, translation, or text generation.

This specific model is based on the standard BERT architecture and contains 48 main layers, 1,600 hidden layers and 25 attention heads. The combination of these parameters results in a 1.5 billion parameter model.

Models at this size with more and more parameters may no longer fit in the device’s memory (causing out of memory errors). In this example, we are using Habana’s fork of the DeepSpeed library to optimize system parameters, including optimizer states and gradients to ensure that the model can execute on Gaudi for the best performance.

DeepSpeed is an open-source deep learning optimization library for PyTorch that is designed to reduce computing power and memory use and to train large, distributed models with better parallelism. DeepSpeed includes the Zero Redundancy Optimizer (ZeRO) for training models. The details of the ZeRO optimizer will be discussed later in this document. The DeepSpeed library is implemented between the user’s model and the PyTorch framework, so minimal changes are needed to the existing PyTorch model. For more information on how Habana uses the ZeRO optimizer, you can refer to our previous blog on Memory-Efficient training here.

TF Learning — Figure 1: Platform Software stack with DeepSpeed

You can follow these simple steps to get up and running on first-gen Gaudi® or Gaudi®2. The first step is to set up the environment, which includes setting up an instance of Gaudi devices, the SynapseAI Software stack, the additional software requirements including Habana’s version of the DeepSpeed library, and then downloading the appropriate dataset for training. The second step is to run the model itself.

Set Up the Environment

Set up your cloud computing environment to get access to the Gaudi accelerator. There are two options available in the cloud today:

Amazon EC2 DL1 Instances: based on first-gen Gaudi
- https://aws.amazon.com/ec2/instance-types/dl1/
- Users can refer to Habana’s quick start guide here for instructions on how to start a DL1 instance; an AWS user account is required.

Intel Developer Cloud using Gaudi2
- https://developer.habana.ai/intel-developer-cloud
- Instructions are provided on the Developer page; a user account will need to be created.

Both instances have eight Gaudi devices for use. Users must ensure that they have ~1TB of storage in the instance to be able to download the dataset and have space for execution. The download of this dataset may take several hours to execute. It is recommended to select Ubuntu20.04 as the base OS and include the Habana full SynapseAI Software stack and drivers in the image. Once you have set up the environment, install the PyTorch Docker image from Habana’s Vault by following the instructions below. See the Habana Installation Guide here for more details.

docker pull vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0:latest

~$ docker pull vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0:latest
latest: Pulling from gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0
846c0b181fff: Pull complete
6ae2e14c2539: Downloading [=====>                                             ]  24.85MB/234.5MB
3fe89580045c: Download complete
1ac28d0b180f: Download complete
c993c6ef6fe8: Downloading [==============================================>    ]  27.71MB/30.02MB
6d0e3696a459: Downloading [==============================>                    ]  2.105MB/3.479MB
05d721556cd2: Waiting
a39276db8326: Pulling fs layer
930126633dfb: Waiting
---
ba5ed4a169e5:Download compelte
Code language: PHP (php)

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host \
vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0:latest

Next, clone Habana’s Model-References repository to get access to the BERT 1.5B model and install the associated software to run the model.

git clone -b 1.7.1 https://github.com/HabanaAI/Model-References
export PYTHONPATH=/root/Model-References:$PYTHONPATH
export PYTHON=/usr/bin/python3.8
cd /root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/
pip install -r ./requirements.txt

root@ubuntu2004:~# cd ~
root@ubuntu2004:~# git clone -b 1.7.1 https://github.com/HabanaAI/Model-References
Cloning into 'Model-References'...
remote: Enumerating objects: 15256, done.
remote: Counting objects: 100% (15255/15255), done.
remote: Compressing objects: 100% (6660/6660), done.
remote: Total 15256 (delta 8238), reused 15132 (delta 8146), pack-reused 1
Receiving objects: 100% (15256/15256), 101.59 MiB | 8.08 MiB/s, done.
Resolving deltas: 100% (8238/8238), done.
Code language: PHP (php)

Then install Habana’s DeepSpeed fork that includes additional optimizations for performance and functionality on the Gaudi HPU.

pip install git+https://github.com/HabanaAI/[email protected]

root@ubuntu2004:~/Model-References/PyTorch/nlp/pretraining/deepspeed-bert# pip install git+https://github.com/HabanaAI/[email protected]
Collecting git+https://github.com/HabanaAI/[email protected]
  Building wheels for collected packages: deepspeed
  Building wheel for deepspeed (setup.py) ... done
  Created wheel for deepspeed:
Successfully installed deepspeed-0.7.0+309ca18 hjson-3.1.0 psutil-5.9.4 py-cpuinfo-9.0.0 pydantic-1.10.4
Code language: PHP (php)

Finally, download and prepare the dataset for training. The script below will download the Wikipedia dataset as a starting point for running the model. The download of the Wikipedia dataset will require over 600GB of storage, so ensure that the cloud computing instance you select is created with ~1TB of storage to allow download of the dataset and runtime execution. Our model has a simple script that will download the dataset, establish some baseline weights in the model (instead of starting from a set of random values), format and shard the text files, and download the .txt files needed for training.

cd ./data
bash create_datasets_from_start.sh

root@ubuntu2004:~/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/data# ./create_datasets_from_start.sh
Checkout WikiExtractor repository
Cloning into 'wikiextractor'...
remote: Enumerating objects: 771, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 771 (delta 17), reused 24 (delta 14), pack-reused 741
Receiving objects: 100% (771/771), 1.31 MiB | 3.11 MiB/s, done.
Resolving deltas: 100% (450/450), done.
---
Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Running: ['wget', 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2', '--output-document=/root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/data/download/wikicorpus_en/wikicorpus_en.xml.bz2']
  https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Resolving proxy-us.intel.com (proxy-us.intel.com)... 10.1.192.48
Connecting to proxy-us.intel.com (proxy-us.intel.com)|10.1.192.48|:912... connected.
Proxy request sent, awaiting response... 200 OK
Length: 20580559456 (19G) [application/octet-stream]
Saving to: ‘/root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/data/download/wikicorpus_en/wikicorpus_en.xml.bz2’
 
pretraining/deepspeed-bert/data/download/   17%[---------------->                                              ] 3.35G   4.23MB/s    eta 55m 10s
Code language: PHP (php)

Running the Model

Now that the environment has been set up, we can run the full pre-training. For simplicity, Habana has provided a full that executes on an 8-card example with the BERT 1.5B model. This script contains the –deepspeed run command argument and the pointer to the DeepSpeed configuration .json file. Before running this script, let’s look at some of the key changes in the model and review what the DeepSpeed engine is doing with the model.

What is DeepSpeed and ZeRO

The DeepSpeed library itself is an optimization library that is used to manage model gradients, optimizer states and the distribution of the workload across the Gaudi accelerators, and manages the system memory. The library abstracts the difficult aspects of large-scale training, such as parallelization, mixed precision, gradient accumulation; taking advantage of the ZeRO Optimizer.

Using ZeRO in a DeepSpeed model is quick and easy; all that is needed to change a few configurations is the DeepSpeed configuration JSON. No model code changes are needed to enable the benefits of ZeRO on a model.

Setting up a model for DeepSpeed

There are several steps for enabling a model for DeepSpeed, in Habana’s BERT1.5B model. For more details on how to convert a model to use the DeepSpeed library, you can refer to the DeepSpeed getting started here. First, initialize the DeepSpeed model execution and distribution. This is done by calling the deepspeed.initialize() function and the deepspeed.init_distributed() function. Examples from the BERT1.5B are shown below:

model, optimizer, _, lr_scheduler = deepspeed.initialize( 
    args=args, 
    model_parameters=None if optimizer else optimizer_grouped_parameters, 
    model=model, 
    optimizer=optimizer, 
    lr_scheduler=lr_scheduler

An important part of the init_distributed function is that the dist_backend variable is set to “hccl”, this is using Habana’s collective communications library (hccl). This function replaces the original use of the PyTorch DistributedDataParallel() function call in a non-DeepSpeed model.

if args.use_hpu: 
        import habana_frameworks.torch.hpu 
        import habana_frameworks.torch.distributed.hccl 
        device = torch.device("hpu") 
        dist_backend = "hccl" 
        
        deepspeed.init_distributed(dist_backend=dist_backend, init_method=init_method)

Additionally, the traditional backward training pass function is changed from loss.backward() to model.backward(loss) and the Optimizer is changed from optimizer.step() to model.step(); taking advantage of the optimizer built into the DeepSpeed Engine. For the BERT1.5B the LANS optimizer is used.

To manage the runtime of the DeepSpeed model, there are two new files that are also used to set the parameters of the execution:

ds_config.json – the file that sets all the config parameters for the DeepSpeed execution hyperparameters, ZeRO state, and other DeepSpeed specific runtime variables. For Habana’s BERT1.5B model, this file is deepspeed_config_bert_1.5b.json and sets the ZeRO state to use ZeRO Stage 1, sets the mixed precision to use BF16, and other variables

run_deepspeed.sh – the run script that calls the model, points to the deepspeed_config_bert_1.5b.json and initiates the run on Habana by setting the –use_hpu option in the run commands. For Habana’s BERT1.5B model, this file is called run_bert_1.5b_8x.sh. This script calls the run_pretraining.py script.

Running Habana’s BERT1.5B Model:

Once the initial steps have been completed, you can call the run script:

bash ./scripts/run_bert_1.5b_8x.sh

This will execute the model on eight Gaudi HPUs and create a set of checkpoints for use in the Fine-Tuning steps later.

Beginning of run script:

root@ubuntu2004:~/Model-References/PyTorch/nlp/pretraining/deepspeed-bert# bash ./scripts/run_bert_1.5b_8x.sh
[runner.py:517:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_python --no_local_rank python -u ./run_pretraining.py --use_hpu --disable_progress_bar --optimizer=lans --use_lr_scheduler --resume_from_checkpoint --do_train --bert_model=bert-base-uncased --config_file=./scripts/bert_1.5b_config.json --json-summary=./results/bert_1.5b/dllogger.json --output_dir=./results/bert_1.5b/checkpoints --seed=12439 --input_dir=/data/pytorch/bert_pretraining/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/books_wiki_en_corpus --max_seq_length 128 --max_predictions_per_seq=20 --max_steps=155000 --steps_this_run=-1 --num_steps_per_checkpoint=200 --learning_rate=0.0015 --warmup_proportion=0.05 --constant_proportion=0.25 --scheduler_degree=1.0 --log_freq=10 --deepspeed --deepspeed_config=./scripts/deepspeed_config_bert_1.5b.json
[INFO] [launch.py:139:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[INFO] [launch.py:145:main] nnodes=1, num_local_procs=8, node_rank=0
[INFO] [launch.py:158:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[INFO] [launch.py:159:main] dist_world_size=8
Distributed training with backend=hccl, device=hpu, local_rank=3
Distributed training with backend=hccl, device=hpu, local_rank=5
Distributed training with backend=hccl, device=hpu, local_rank=7
Distributed training with backend=hccl, device=hpu, local_rank=1
Distributed training with backend=hccl, device=hpu, local_rank=0
Distributed training with backend=hccl, device=hpu, local_rank=2
Distributed training with backend=hccl, device=hpu, local_rank=4
Distributed training with backend=hccl, device=hpu, local_rank=6
[INFO] [comm.py:628:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl
Using LANS
Using PolyWarmUpScheduler with args={'warmup': 0.05, 'total_steps': 155000.0, 'degree': 1.0, 'constant': 0.25}
Code language: PHP (php)

End of run script:

[INFO] [engine.py:3268:_save_zero_checkpoint] zero checkpoint saved deepspeed-run/1.7.1-85/bert_1.5b/8/2023-01-04_17-52/checkpoints/20/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
[INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint 20 is ready now!
DLL 2023-01-04 18:43:28.908788 -  e2e_train_time : 2313.8388447761536  training_sequences_per_second : 63.06757577335175  final_loss : 11.55532455444336  raw_train_time : 1948.3862903118134
[INFO] [launch.py:322:main] Process 12605 exits successfully.
[INFO] [launch.py:322:main] Process 12609 exits successfully.
[INFO] [launch.py:322:main] Process 12608 exits successfully.
[INFO] [launch.py:322:main] Process 12610 exits successfully.
[INFO] [launch.py:322:main] Process 12607 exits successfully.
[INFO] [launch.py:322:main] Process 12606 exits successfully.
[INFO] [launch.py:322:main] Process 12604 exits successfully.
[INFO] [launch.py:322:main] Process 12603 exits successfully.
/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/__init__.py:94: UserWarning: habana_frameworks.torch.core.get_device_count is deprecated. Please use habana_frameworks.torch.hpu.device_count
  warnings.warn("habana_frameworks.torch.core.get_device_count is deprecated. "
LOG_FILE   = /deepspeed-bert_1.5b-8-p1-2023-01-04_17-52.log
TOPOLOGY   = pretraining
BATCH_SIZE =
WORLD_SIZE = 8
Pretraining phase1 results
8 card(s) average sentences per second:302.5176
average dps: 37.8147
Code language: PHP (php)

Results

Below is the throughput as well as the overall memory consumption for the BERT 1.5B param model running on 8, 16, 32, 64 and 128 devices. If the ZeRO optimizer was not used, the model would not fit in Gaudi’s memory and cause an OOM (out of memory) error.

Figure 3: Gaudi Max Memory consumption for BERT1.5B model

Model	# HPU	Precision	Input Length	Output Length	Throughput	Latency	Batch	Task	Framework Version
Llama 2-7B (torch.compile)	1	bf16	128	128	5820 token/sec	51.54 ms	300	text-generation	Optimum Habana 1.11.1
Falcon 180B	8	bf16	128	2,048	700 token/sec	57.14 ms	40	text-generation	Optimum Habana 1.11.1
Falcon-40B 2048 Tokens	8	bf16	128	2,048	92.34 token/sec	10.82 ms	1	text-generation	Optimum Habana 1.11.1
Falcon-7B 8192 Tokens	1	bf16	128	8,192	118.19 token/sec	8.46 ms	1	text-generation	Optimum Habana 1.11.1
GPT-J	8	bf16	128	100	628.74 token/sec	6.36 ms	4	text-generation	Optimum Habana 1.11.1
StableLM-3B	1	bf16	128	2,048	250 token/sec	4 ms	1	text-generation	Optimum Habana 1.11.1
StableLM-7B	1	bf16	128	2,048	128 token/sec	7.81 ms	1	text-generation	Optimum Habana 1.11.1
MPT-7B	1	bf16	128	1,932	121 token/sec	8.26 ms	1	text-generation	Optimum Habana 1.11.1
Bloomz	8	bf16	128	100	36.78 token/sec	27.18 ms	1	text-generation	DeepSpeed 0.14.0, Optimum Habana 1.11.1
StarCoder	1	bf16	100	100	65 token/sec	15.38 ms	1	text-generation	DeepSpeed 0.14.0, Optimum Habana 1.11.1
OPT	1	bf16	100	100	1120 token/sec	0.89 ms	1	text-generation	Optimum Habana 1.11.1
T5-3B Summarization 1024-128 Beam4	1	bf16	1,024	128	0.94 token/sec	1063.82 ms	1	summarization	Optimum Habana 1.11.1
Bert (Text Classification)	1	bf16		128	2125 token/sec	3.76 ms	8	text-classification	Optimum Habana 1.11.1
Bert (Language Modeling)	1	bf16			66.64 token/sec	60.02 ms	4	language-modeling	Optimum Habana 1.11.1
Bert (Question Answering)	1	bf16		384	613 token/sec	13.05 ms	8	question-answering	Optimum Habana 1.11.1
StableDiffusion v2.1 (512x512)	1	bf16			1.33 images/sec	3007.51 ms	4	stable-diffusion	PyTorch Lightning 2.2.4
Bart	1	bf16			6.79 token/sec	294.55 ms	2	summarization	Optimum Habana 1.11.1
BridgeTower	1	bf16			321 token/sec	49.84 ms	16	constrastive-image-text	Optimum Habana 1.11.1
ESMFold	1	bf16			2.97 token/sec	336.7 ms	1	protein-folding	Optimum Habana 1.11.1
T5-3B Summarization Greedy	1	bf16			2.46 token/sec	406.5 ms	1	summarization	Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy	1	bf16			30.85 token/sec	129.65 ms	4	translation	Optimum Habana 1.11.1
Wav2vec(Audio Classification)	1	bf16			1002 token/sec	3.99 ms	4	audio-classification	Optimum Habana 1.11.1
Wav2vec(Speech Recoginition)	1	bf16			16.62 token/sec	240.67 ms	4	speech-recoginition	Optimum Habana 1.11.1

Table 1: BERT 1.5B LANS Pre-Training Phase 1 Throughput on First-Gen Gaudi

Next Steps

You are invited to experiment with several options for running DeepSpeed based models on Gaudi. You can use one of the following examples:

A simple CIFAR based model using DeepSpeed to show how to get started with Habana Model References GitHub here
A DeepSpeed Tutorial with a configurable model to change the size and experiment with the effect of ZeRO Stage 1 and Stage 2 and Activation Checkpointing here
The model from this paper: BERT 1.5B on 8 Gaudi and 5B on 128 Gaudi here
HuggingFace GPT2 example here

Tags: BERT, DeepSpeed, Gaudi, Gaudi2, pytorch, synapseai

Share this article:

Pre-Training the BERT 1.5B model with DeepSpeed

In this post, we show you how to run Habana’s DeepSpeed enabled BERT1.5B model from our Model-References repository.

Set Up the Environment

Running the Model

What is DeepSpeed and ZeRO

Setting up a model for DeepSpeed

Running Habana’s BERT1.5B Model:

Results

Next Steps

You might also be interested in:

Memory-Efficient Training on Habana^® Gaudi^® with DeepSpeed

Fine tuning GPT2 with Hugging Face and Habana Gaudi

The Habana team is happy to announce the release of SynapseAI® version 1.7.0.

Pre-Training the BERT 1.5B model with DeepSpeed

In this post, we show you how to run Habana’s DeepSpeed enabled BERT1.5B model from our Model-References repository.

Set Up the Environment

Running the Model

What is DeepSpeed and ZeRO

Setting up a model for DeepSpeed

Running Habana’s BERT1.5B Model:

Results

Next Steps

You might also be interested in:

Memory-Efficient Training on Habana® Gaudi® with DeepSpeed

Fine tuning GPT2 with Hugging Face and Habana Gaudi

The Habana team is happy to announce the release of SynapseAI® version 1.7.0.

Memory-Efficient Training on Habana^® Gaudi^® with DeepSpeed