BLOOM 176B Inference on Habana Gaudi2

Published: 02/14/2023

With Habana’s SynapseAI 1.8.0 release support of DeepSpeed Inference, users can run inference on large language models, including BLOOM 176B.

Large Language Models have become popular in the field of AI, especially since the introduction of OpenAI’s GPT-3 in 2020. LLMs can complete several natural language processing tasks by successfully generating text that is indistinguishable from human-generated text. Users have been using ChatGPT, OpenAI’s publicly available GPT API, to create AI generated email, stories, recipes, and even film scripts. The caveat is that the source code and the trained checkpoints are not publicly available, with users only able to interface with the model through the API.

BLOOM (Big Science Language Open-science Open-access Multilingual) is an open-source initiative to bring a large language model of similar scale to GPT-3 to the public. It has a similar transformer architecture as GPT-3 and was trained on 384 graphics cards on the Jean Zay supercomputer provided by the French government.

Habana has enabled DeepSpeed inference capabilities to run the 176B parameter BLOOM model on 8 Gaudi2 devices. You’ll need Habana’s fork of DeepSpeed to run this model. Additionally, we make use of the HPUGraph Instructions to reduce time on the host.

To get started you can clone the Habana Model-References repository.

git clone https://github.com/HabanaAI/Model-References 
cd Model-References/PyTorch/nlp/bloom

In this folder you can find modeling_bloom.py, which was adapted from Hugging Face version found here. Additionally, you will find graph_utils.py, which uses the HPU Graph API to optimize the inference graph for HPU. Please check the documentation for information about using HPU graphs and the API’s current limitations.
Next, install the requirements with pip and download the checkpoints of the models.

$PYTHON -m pip install -r requirements.txt
mkdir checkpoints; 
$PYTHON utils/fetch_weights.py --weights ./checkpoints --model bigscience/bloom

Note that the checkpoints for this model are about 330 GB, so ensure that the instance has enough storage to run. Alternatively, you can run the model using a smaller parameter set (bloom-3b or bloom-7b1) but should expect to see slightly less comprehensible output in your sentence completion queries than the full 176B parameter model.

Once checkpoints are loaded, we can run some sentence completion tasks on different sized versions of the model.

The size of BLOOM 176B requires the use of DeepSpeed library to ensure that the model will fit and run on a minimum of eight Gaudi2 accelerators. Smaller versions of the BLOOM model including BLOOM 7B1 can run on a single first-gen Gaudi or Gaudi2. Inference using DeepSpeed allows for model parallelism of large transformer models. In order to initialize the model for inference using DeepSpeed in our code, we first wrap the model definition in OnDevice() call which allows us to declare the datatype and the use of meta tensors.

with deepspeed.OnDevice(dtype=dtype, device='meta'):
   model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16)

We then call deepspeed.init_inference(). The injection policy argument is needed for the all-reduce to accumulate results across multiple Gaudi accelerators . For more information about injection policy and DeepSpeed inference you can consult the official DeepSpeed docs.

model = deepspeed.init_inference(model, mp_size=args.world_size, dtype=dtype,
                                 injection_policy={code.BloomBlock: 
                                 ('mlp.dense_4h_to_h', 'self_attention.dense')},
                                 args=args, enable_cuda_graph=args.use_graphs,
                                 checkpoint=f.name)

Install the Habana DeepSpeed library and then use the command below to call the model with a prompt for the text entry. By default, the inference script applies a greedy search on HPU and ignores end of sentence tokens while searching for the next most probable token at each step until it reaches the user-specified –max_length. We have found ignoring end of sentence tokens to be more performant as it allows the device to run continuously without needing to sync with CPU after each token. Due to the size of the model, it will take a few minutes to load the weights into memory.

pip install git+https://github.com/HabanaAI/[email protected]
deepspeed --num_gpus 8 ./bloom.py --weights ./checkpoints --model bloom --max_length 128 --dtype bf16 "Does he know about phone hacking"

At the end of your console output, you should see the following.

Init took 141.525s
Starting inference...
------------------------------------------------------
step:0 time:8.433s tokens:122 tps:14.466 hpu_graphs:13
------------------------------------------------------
Q0.0: Does he know about phone hacking
A0.0: Does he know about phone hacking?
- No.
- Good.
- What about the rest of the team?
- No.
Good.
This is going to be a closed shop.
I want you to keep it that way.
Understood?
Tony, I think we should meet.
Yeah, of course.
- Tomorrow?
- Yeah, good.
The House of Commons is expected to vote on the new anti-terrorism bill today.
The legislation has been controversial and some have argued that it infringes on civil liberties.
The government claims the bill is necessary to keep the country safe.
The vote is expected to be close.
Back to
Code language: CSS (css)

Next Steps

We welcome users to start running Inference using the BLOOM model. To run the full BLOOM 176B model and other Large Language models using DeepSpeed, you can access Gaudi2 on the Intel Developer Cloud. To run the smaller BLOOM 7B1 or 3B model, you can access first-gen Gaudi on the Amazon EC2 DL1 instance.

Tags: BLOOM, DeepSpeed, Inference

Share this article: