Home » Habana Developer Blog » Optimizing Large Language Model Inference on Gaudi2 with Hugging Face Optimum-Habana

Optimizing Large Language Model Inference on Gaudi2 with Hugging Face Optimum-Habana

We have optimized additional Large Language Models on Hugging Face using the Optimum Habana library.

Hugging Face and Habana Labs recently published several LLM inference reference models on Gaudi and Gaudi2 with the optimum-Habana library, including the GPT-2, GPTNeoX, GPT-J, OPT and LLAMA models.

Developers can now easily run these models as well as other LLMs using the repository hosted by Hugging Face. To access Habana hardware, you can use the Gaudi2 instance on the Intel Developer Cloud or first-gen Gaudi on the Amazon EC2 DL-1 Instance . In this guide, we will briefly discuss how the Habana team enabled GPTNeoX to run on Gaudi, then demonstrate how you can run text generation with the GPTNeoX model with a single Gaudi device, as well as with 8 devices using DeepSpeed lightweight framework.

To run any of these models, you can simply use the model’s name in the `–model_name_or_path` in your python command; for example, to use GPT-J, you will set `–model_name_or_path EleutherAI/gpt-gpt-j-20b`.  

We show you how to implement the two optimizations to improve the model performance for these new models. These existing models already have the following modifications and are ready to use as part of the Optimum Habana library.   For other LLMs, users are recommended to make the needed changes to apply these optimizations to get better inference performance.

First, we fix the input shapes to be static during new token generation, which prevents unnecessary graph recompilations. We accomplish this by padding the vectors of all input and generated tokens of the self-attention mask to the max token length during each call to the generate function, as you can see in the code block below. Also, we introduce a new variable ‘token_idx’ to keep track of the current token being generated in the output sequence.  During padding, this variable is used to track the latest token in the padded group of content.   See the list of current models optimized with static shapes.

if generation_config.static_shapes:
#token_idx is the current index in the generation process, it is incremented each time a new token is generated
  model_kwargs["token_idx"] = torch.tensor(inputs_tensor.shape[-1], device=inputs_tensor.device)
#Pad inputs to have static shapes during generation, this gives better performance than dynamic shapes on HPUs
  inputs_tensor = torch.nn.functional.pad(
  inputs_tensor, (0, generation_config.max_new_tokens), value=generation_config.pad_token_id
            )
if model_kwargs["attention_mask"] is not None:
       model_kwargs["attention_mask"] = torch.nn.functional.pad(
              model_kwargs["attention_mask"], (0, generation_config.max_new_tokens), value=0
                )

Code language: PHP (php)

Our second optimization was to use a static key-value cache to eliminate the need for recompilations of self-attention forward passes needed to generate new tokens. See below for our implementation.

if layer_past is not None:
            past_key, past_value = layer_past
            if token_idx is not None:
                past_key.index_copy_(2, token_idx - 1, key)
                past_value.index_copy_(2, token_idx - 1, value)
                key = past_key
                value = past_value
            else:
                key = torch.cat((past_key, key), dim=-2)
                value = torch.cat((past_value, value), dim=-2)
Code language: PHP (php)

The model examples with these optimizations include a dedicated Gaudi subclass that inherits from the original upstream model code. We went through the same process, creating causal language modeling subclasses for each of the classes in the table above. The only differences are the two changes related to enabling static shapes and key-value cache optimizations we have just discussed.

We already have enabled support for hpu_graphs and deepspeed inference using optimum-habana, which are additional methods of optimizing model performance. These can be enabled via command line options we will show in the next sections.

Ensure that you are setting up the HPU_graphs for Training or Inference.

Developers interested in enabling inference on other LLMs on Gaudi platforms can implement these two additional optimization techniques to get better inference performance.

Set up and Run Inference on a Single Gaudi Device

You can run the following commands to clone optimum-habana and install the necessary dependencies.   This example is using Synapse 1.10.0 with Hugging Face version 1.6.1:

git clone https://github.com/huggingface/optimum-habana.git

cd optimum-habana && pip install . && cd examples/text-generation

pip install –r requirements.txt

pip install git+https://github.com/HabanaAI/[email protected]

Next, run the following command to get text generation output using the 20-billion parameter version of GPTNeoX. Feel free to modify the prompt. Note that you should include–use_kv_cache argument, which implements the optimization we discussed earlier.

python run_generation.py  \

–model_name_or_path EleutherAI/gpt-neox-20b \

–batch_size 1 \

–max_new_tokens 100  \

–use_kv_cache  \

–use_hpu_graphs  \

–bf16  \

–prompt ‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’

With the above prompt you should observe the following output using Gaudi2.

Stats:

———————————————————————-

Throughput (including tokenization) = 42.177969335170665 tokens/second
Memory allocated                    = 39.27 GB
Max memory allocated           = 39.68 GB
Total memory available          = 94.65 GB
Graph compilation duration   = 15.923236011061817 seconds

———————————————————————-

Input/outputs:

———————————————————————-

input 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’,)

output 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the storage industry.\n\nThe company, called Storj, is a peer-to-peer cloud storage service. It is a peer-to-peer network that allows users to store their data on other users’ computers.\n\nThe company is based in San Francisco, and was founded by Shawn Wilkinson, a former Google employee.\n\nThe company is currently in the process of raising a $30 million round of funding.\n\nThe company is currently in the process of raising’,)

Run Inference on multiple Gaudi Devices Using DeepSpeed

Now we will run with 8 Gaudi2 devices with DeepSpeed enabled. We will use the same arguments as above and we use the gaudi_spawn.py script, which invokes mpirun to launch the multi-card run.

python ../gaudi_spawn.py –use_deepspeed –world_size 8  run_generation.py  \

–model_name_or_path EleutherAI/gpt-neox-20b

–batch_size 1  \

–max_new_tokens 100  \

–use_kv_cache  \

–use_hpu_graphs  \

–bf16  \

–prompt ‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’

Here is the output:

Stats:

——————————————————————-

Throughput (including tokenization) = 85.745799548246 tokens/second
Memory allocated                    = 6.5 GB
Max memory allocated           = 6.54 GB
Total memory available          = 94.65 GB
Graph compilation duration   = 5.841280916007236 seconds

——————————————————————-

Input/outputs:

——————————————————————-

input 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt’,)

output 1: (‘A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the cloud storage market.\n\nThe company, called Storj, is a peer-to-peer cloud storage service that is currently in beta. The company is currently in the process of raising a $1.2 million seed round.\n\nThe company is led by John Quinn, a former executive at Dropbox, and Shawn Wilkinson, a former executive at Box.\n\nThe company is currently in the process of raising a $1.2 million seed round.\n\nThe’,)

Next Steps

Hugging Face and Habana Labs continue to enable reference models and publish them in optimum-habana and Model-References where anyone can freely access them. We encourage users to try these as well as their own new LLMs to enjoy the benefits of Gaudi performance. You can refer to our developer site for helpful articles and forum posts to get you up and running.

Share this article:
Stay Informed: Register for the latest Intel Gaudi AI Accelerator developer news, events, training, and updates.