Gaudi: Intro to TPC Kernels

A how-to tutorial to write and build Habana custom kernels.

Video Transcript

Hello, my name is Zhongkai Zhang, application engineer at Habana, working on TPC kernels.

This video is entitled, Gaudi: Intro to TPC kernels, which target some developer with experience in AI model who wants to create their own custom kernels, for more advanced performance.

Today, I’m going to talk about three topics. First, what are TPC and the TPC kernel? Then we will introduce the Habana tools, and show you how to install and use them. Last but not least, how to write TPC kernels, using the Habana template project we provide in GitHub. So, let’s get started. The TPC Core was designed to support deep learning, training, and inference workloads.

We have eight TPC Cores, integrated into every Habana processor. They were designed for customization, and they were fully programmable VLIW4, the SP core, for workloads that do not map to MME. Habana provided extensive 1400+ kernel libraries to enable workloads and operators to be implemented.

For Gaudi, the TPC core native support, the data including FP32, bf16, signed and unassigned, eight, 16, and 32 integers. The kernel library Habana provided is good enough for excellent performance.

But in some cases, dependent on the application, the users may want to develop their own kernels, even their own custom apps. We create this Habana custom kernel template to make kernel development much easier.

So, in the following video, I’m gonna show you how to do that. So, first things first, how to install the Habana tool. So, we have a tool called the Habana tool which is the kernel creation tool. This tool will compile the TPC kernel code, and then build a custom kernel library that draws on the simulation to verify the correctness of the kernel.

What I will show you in this video is to download the latest package by visiting its GitHub page, and installing the tool to your system. So, there are instructions on the GitHub page. So, after you click the Habana vault link, it will bring you to this webpage. They show you the latest package.

So, you either click the top right corner download button or use the URL file to get the latest package. So, we have the regular release of the package. So, make sure to check and always choose the latest one.

Here, I just used the Ubuntu H as an example. The tool also works on other supported OS. Since I already have the package downloaded, what I will do next is run this sudo dpkg command. dpkg. Dash i. After that, you can check those files already in your system.

Let me run this command. Here, you can see that there’s a TPC compiler, the simulation library, and the hydra file, including the TPC intrinsic. Now we have the Habana tool installed.

The last but not least, I’m going to show what the TPC kernel and the color code look like, and how to write the custom TPC kernel, using the template project we provided in GitHub. So, please visit this GitHub page to clone the ripple. So, make sure you create access to the public key and add it to your GitHub profile, and the long Git clone. So, let’s follow the steps to write your own custom kernels. So, first, write the TPC kernel, which was on our Habana device. Second, write the Glue codes, which set up the kernel and load the kernel library. I will talk about the Glue code in a second. Then write a Unit test, which is optional, but it’s necessary to verify the correctness of the kernel. After that, go to Cmake and make to build the kernel library. Last, add the kernel path to GC_KERNEL_PATH to this environmental variable. Let’s just look back to the demo, and it goes through those codes.

So, you can see here, we have those three folders. We have a kernel, source, and test. So, inside the kernel, we have the Gaudi. The Gaudi folder contains all the kernels on the Gaudi device. So, let’s use the filter as an example here. So, the TPC kernel is written in the TPC-C, which is the say-like language with PBC intrinsic. So, every kernel has the main function. It has input and output tensor, and the scalers parameter follows. So, in every kernel, we always start with getting the start and coordinate of index space in this tool building function.

Then we use the load intrinsic to load the tensor, and the store intrinsic to store the output tensor. So, different date-time has different loads in the store intrinsic. To implement the kernel function, we use the TPC intrinsic to maximize the TPC performance.

For the definition of each TPC intrinsic, you can check the developer website for more detail. The second check was the Glue code. So, inside this SRC folder, which has all the Glue code. So, Glue code, a piece of code to set up the kernel parameter, and load the kernel binary to a graph compiler. So, the Glue codes run on the host, not on the Gaudi device. The two major functions, are to get kernel names and to get the GC definition. Or Habana kernel, the name doesn’t really matter. So, the function that gets kernel names, is to get the kernel GUID, which is used to identify each kernel uniquely. So, the function GetGcDefinitions is the main part of the Glue code. To check the number of input and output tensors, verify the dimension of the tensor setup index space, and load the kernel binary to memory. It also has a global entry point for all the Glue codes. So, the last component is the unit test. So, in each unit test, there are two parts.

The first part is the reference code implementation, whose output I use to wire the kernel output data. Another one, as the main part, is the runTest function, which contains the main text body, which initialized input/output tensor data, called the kernel entry point, and run using the simulation, and compare the output with reference.

Now, let’s show you how to build a kernel library, along with the unit test. So, we have created the Cmake list file to simplify the build process. The first thing, let’s create, let’s go back to this folder. Let’s create the build folder. The cd build folder. Go to Cmake. Then do the make. So what make does is core TP’s compiler, and generate the TPC library. And also create the unit test executable binary. So, we don’t have a lot of kernel in this template project. So, the build process will be quick. So in this, after finished, you can check the SRC folder. It has the customer TBC library generated.

We also can run the TPC test executable. Yeah, you can see it runs through other unit tests, and everything passed. So for multiple kernel libraries, we need to set up an environmental variable, GC_KERNEL_PATH, with each kernel library separated by a colon. So, the custom kernel library normally goes first. For more detail, you can always visit the Habana developer website, or read the doc.

So, finally, let’s summarize what we learned today, to create the customer kernel. So, install the Habana tool, use the package you downloaded. Develop the kernel using a template project. You will have this shared to the library created. Third, export the library you just created to the GC_KERNEL_PATH, which we will use by the core compiler. The user can also create the TensorFlow custom app, using the custom kernel you just created. Please visit the following website for detail. Thank you for watching. For more detail, please visit developer.habana.ai. Thanks, bye.

Documentation Habana GitHub