As demonstrated in the benchmark presented in this post, this will enable you to run inference faster than with any GPU currently available on the market. As models get bigger and bigger, deploying them into production to run inference has become increasingly challenging. Both hardware and software have seen a lot of innovations to address these challenges, so let’s dive in to see how to efficiently overcome them!
BLOOM is a 176-billion-parameter autoregressive model that was trained to complete sequences of text. It can handle 46 different languages and 13 programming languages. Designed and trained as part of the BigScience initiative, BLOOM is an open-science project that involved a large number of researchers and engineers all over the world. More recently, another model with the exact same architecture was released: BLOOMZ, which is a fine-tuned version of BLOOM on several tasks leading to better generalization and zero-shot capabilities.
Such large models raise new challenges in terms of memory and speed for both training and inference. Even in 16-bit precision, one instance requires 352 GB to fit! You will probably struggle to find any device with so much memory at the moment, but state-of-the-art hardware like Habana Gaudi2 does make it possible to perform inference on BLOOM and BLOOMZ models with low latencies.
Gaudi2 is the second-generation AI hardware accelerator designed by Habana Labs. A single server contains 8 accelerator devices (called Habana Processing Units, or HPUs) with 96GB of memory each, which provides room to make very large models fit in. However, hosting the model is not very interesting if the computation is slow. Fortunately, Gaudi2 shines on that aspect: it differs from GPUs in that its architecture enables the accelerator to perform General Matrix Multiplication (GeMM) and other operations in parallel, which speeds up deep learning workflows. These features make Gaudi2 a great candidate for LLM training and inference.
Habana’s SDK, SynapseAI™, supports PyTorch and DeepSpeed for accelerating LLM training and inference. The SynapseAI graph compiler will optimize the execution of the operations accumulated in the graph (e.g. operator fusion, data layout management, parallelization, pipelining and memory management, and graph-level optimizations).
Moreover, support for HPU graphs and DeepSpeed-inference have just recently been introduced in SynapseAI, and these are well-suited for latency-sensitive applications as shown in our benchmark below.
All these features are integrated into the 🤗Optimum Habana library so that deploying your model on Gaudi is very simple. Check out the quick-start page here.
If you would like to get access to Gaudi2, go to the Intel Developer Cloud and follow this guide.
In this section, we are going to provide an early benchmark of BLOOMZ on Gaudi2, first-generation Gaudi and Nvidia A100 80GB. Although these devices have quite a lot of memory, the model is so large that a single device is not enough to contain a single instance of BLOOMZ. To solve this issue, we are going to use DeepSpeed, which is a deep learning optimization library that enables many memory and speed improvements to accelerate the model and make it fit the device. In particular, we rely here on DeepSpeed-inference: it introduces several features such as model (or pipeline) parallelism to make the most of the available devices. For Gaudi2, we use Habana’s DeepSpeed fork that adds support for HPUs.
We measured latencies (batch of one sample) for two different sizes of BLOOMZ, both with multibillion parameters:
- 176 billion parameters
- 7 billion parameters
Runs were performed with DeepSpeed-inference in 16-bit precision with 8 devices and using a key-value cache. Note that while CUDA graphs are not currently compatible with model parallelism in DeepSpeed (DeepSpeed v0.8.2, see here), HPU graphs are supported in Habana’s DeepSpeed fork. All benchmarks are doing greedy generation of 100 token outputs. The input prompt is:
"DeepSpeed is a machine learning framework"Code language: JSON / JSON with Comments (json)
which consists in 7 tokens with BLOOM’s tokenizer.
The results for inference latency are displayed in the table below (the unit is seconds).
|Model||Number of devices||Gaudi2 latency (Seconds)||A100-80GB latency (seconds)||First-gen Gaudi latency (seconds)|
The Habana team recently introduced support for DeepSpeed-inference in SynapseAI 1.8, and thereby quickly enabled inference for 100+ billon parameter models. For the 176-billionparameter checkpoint, Gaudi2 is 1.2x faster than A100 80GB. Smaller checkpoints present interesting results too. Gaudi2 is 3x faster than A100 for BLOOMZ-7B! It is also interesting to note that it manages to benefit from model parallelism whereas A100 is faster on a single device.
We also ran these models on first-gen Gaudi. While it is slower than Gaudi2, it is interesting from a price perspective as a DL1 instance on AWS costs approximately 13$ per hour. Latency for BLOOMZ-7B on first-gen Gaudi is 2.865 seconds. Thus, first-gen Gaudi offers for the 7-billion checkpoint a better price-performance ratio than A100 which costs more than 30$ per hour!
We expect the Habana team will optimize the performance of these models in the upcoming SynapseAI releases. For example, in our last benchmark, we saw that Gaudi2 performs Stable Diffusion inference 2.2x faster than A100 and this has since been improved further to 2.37x with the latest optimizations provided by Habana. With the pre-release version of SynapseAI 1.9, we saw the inference latency for BLOOMZ-176B was improved to 3.5 seconds. We will update these numbers as new versions of SynapseAI are released and integrated within Optimum Habana.
Running inference on a complete dataset
The script we wrote enables using your model to complete sentences over a whole dataset. This is useful to try BLOOMZ inference on Gaudi2 on your own data.
Here is an example with the tldr_news dataset. It contains both the headline and content of several articles (you can visualize it on the Hugging Face Hub). The first few samples look like:
Input: 'Businesses Will Not Be Able to Hide': Spy Satellites May Give Edge From Above ; Output: 'Businesses Will Not Be Able to Hide': Spy Satellites to Track Illegal Logging in Indonesia The Indonesian government has announced that it will use spy satellites to track illegal logging in the country. Input: Alphabet and SoftBank’s solar-powered drone provides first LTE connection ; Output: Alphabet and SoftBank’s solar-powered drone provides first LTE connection ⯑⯑⯑ ⯑⯑⯑The drone, which is about the size of a small car, is equipped with a solar panel and can fly for up Input: SQLFlow (GitHub Repo); Output: SQLFlow (GitHub Repo): https://github.com/davidfowl/SQLFlow The code is written in Java and is available on GitHub. It is a simple, lightweight Input: Secret to keeping ice cream creamy (not crunchy); Output: Secret to keeping ice cream creamy (not crunchy) is to freeze it in a container that is airtight. Input: Tesla's giant battery saved $40 million during its first year, report says ; Output: Tesla's giant battery saved $40 million during its first year, report says Tesla's giant battery saved $40 million during its first year, report says Tesla's giant battery saved $40 million during its first year, Input: Python 3.9: Cool New Features for You to Try (28 minute read); Output: Python 3.9: Cool New Features for You to Try (28 minute read): This is a great article for those Input: A company aims to power the world for millions of years by digging the deepest holes ever ; Output: A company aims to power the world for millions of years by digging the deepest hole ever made in Input: In Nevada desert, a technology firm aims to be a government ; Output: In Nevada desert, a technology firm aims to be a government Introduction The use of the Internet has become a common practice in the daily life of people. The Internet has becomeCode language: PHP (php)
In the next section, we explain how to use the script we wrote to perform this benchmark or to apply it on any dataset you like from the Hugging Face Hub!
How to reproduce these results?
The script used for benchmarking BLOOMZ on Gaudi2 and first-gen Gaudi is available here.
Before running it, please make sure that the latest versions of SynapseAI and the Gaudi drivers are installed following the instructions given by Habana.
Then, run the following:
git clone https://github.com/huggingface/optimum-habana.git cd optimum-habana && pip install . && cd examples/text-generation pip install git+https://github.com/HabanaAI/[email protected]
Finally, you can launch the script as follows:
git clone https://github.com/huggingface/optimum-habana.git cd optimum-habana && pip install . && cd examples/text-generation pip install git+https://github.com/HabanaAI/[email protected]1.8.0
For multi-node inference, you can follow this guide from the documentation of Optimum Habana.
You can also load any dataset from the Hugging Face Hub to get prompts that will be used for generation using the argument
This benchmark was performed with Transformers v4.27.1, SynapseAI v1.8.0 and an install from source of Optimum Habana.
For GPUs, this repo contains the script that led to the results that were previously presented in this blog post. To use CUDA graphs, static shapes are necessary and this is not supported in 🤗 Transformers. You can use this repo written by the Habana team to enable them.
We see in this article that Habana Gaudi2 performs BLOOMZ inference faster than Nvidia A100 80GB. And there is no need to write a complicated script as 🤗Optimum Habana provides easy-to-use tools to run inference with multi-billion-parameter models on HPUs. Future releases of Habana’s SynapseAI SDK are expected to speed up performance, so we will update this benchmark regularly as LLM inference optimizations on SynapseAI continue to advance. We are also looking forward to the performance benefits that will come with FP8 inference on Gaudi2.
We also presented the results achieved with first-generation Gaudi. For smaller models, it can perform on par with or even better than A100 for almost a third of its price. It is a good alternative option to using GPUs for running inference with such a big model like BLOOMZ.
If you are interested in accelerating your Machine Learning training and inference workflows using the latest AI hardware accelerators and software libraries, check out our Expert Acceleration Program. To learn more about Habana solutions, read about our partnership here and contact them. To learn more about Hugging Face efforts to make AI hardware accelerators easy to use, check out our Hardware Partner Program.
1 “Zero-shot” refers to the ability of a model to complete a task on new or unseen input data, i.e. without having been provided any training examples of this kind of data. We provide the model with a prompt and a sequence of text that describes what we want our model to do, in natural language. Zero-shot classification excludes any examples of the desired task being completed. This differs from single or few-shot classification, as these tasks include a single or a few examples of the selected task.