Welcome, everyone. My name is Omri Almog, I’m a Deep Learning Software Engineer here at Habana. Today we’ll go through an introduction on how to get started with Gaudi, System Setup, and go through some demos of some of the available tools when setting up a system.
This video is meant for anybody with a Gaudi product, to get a better understanding of what the available tools are, and how to use them when setting up a system. In this video, we will go over the material in the following order. We will start with an introduction of what Gaudi is, then go over how to check for the right software packages installed on your system, go through how to review kernel module information, load the kernel module and ensure it has been loaded correctly.
Lastly, we’ll go through some Habana tools that will help understand the system configuration, as well as provide some system setup testing. To start off, as many of you know, the Gaudi platform architecture has been designed from the ground up, for deep learning training workloads in data centers.
It comprises a cluster of fully programmable Tensor Processing Cores also known as TPCs. Configurable Matrix Math engine, GEMM, along with its associated development tools, libraries, and compiler. Gaudi also includes four integrated HBMs, in a single-chip mezzanine package, our HL-205. For high-efficiency scaling, Gaudi includes 20 pairs of 56 gigabits per second serdes that can be configured into 10 x 100 Gigabits per second standard ethernet ports or 20 x 50 Gigabits per second or 25 Gigabits per second ethernet ports or any combination in between.
These ports are designed to scale out the inner Gaudi communication by integrating a complete communication engine Gondi, and with integrated RDMA over converged Ethernet RoCE v2. That can communicate directly between each other. These technologies provide an easily scalable and efficient and performant product for deep learning training. If you would like to know more details about what Gaudi has to offer, please visit our Habana developer website.
The next section of the video, we will go over a technical guide on how to set up and check your system with Gaudi. We will use Ubuntu as our example operating system. Prior to following the next steps, please follow the setup and install GitHub or the installation guide on reading the docs for instructions on proper package installation.
These both are linked in the Habana developer website. Once those steps are completed, you can follow the steps in this section, to verify that the current package versions installed on the system are correct. To do this, you can run the following command: dpkg -l | grep Habana labs This will show the package and installed revenue new system for Habana labs.
The key point we are looking for here is that we need to make sure that the versions installed are the expected versions that we are looking for. For more information on each one of these packages, please refer to the links on the Habana developer website.
Next, I will show you how to get some information about the driver and how to load the driver. To get more information about the Habana labs driver, you can run the following command. modinfo habanalabs This will give you all the information about the kernel module available. This provides extra information for developers that want more advanced settings for their kernel module. Some useful information here to reference is the file name which shows you to the directory of where the kernel module is the version which shows you the version of the kernel module.
And then all the advanced parameters that you can use set for the driver. Once you verified that the right driver is installed, you can use the following command, to load the driver onto the kernel. To do this use the following command. modprobe habanalabs This will load the driver. And now we will show you how to check that there’s no error messages. Once the modprobe is run, you can run dmesg to verify that the driver is properly loaded. Dmesg is a command on most Unix like operating systems that prints the message buffer of the kernel.
We can see after loading the module, that the following output will notify if there’s any errors while loading the driver. Mainly you’re looking for, successfully added advice to Habana Labs driver. Next, we’ll go through some explanations on how to run the HL-QUAL to help qualify your Gaudi hardware platform and to make sure it’s ready for use. We’ll go through the following test as an example. The PCI-BW test.
This is a test plugin that measures the PCI bandwidth when moving data from host to and from the device HBM memory the power stress test puts the device into constant and equal level power load. It can be used, but not limited to thermal stress, test power limiter, and clock relaxation mechanisms, and checking for long work periods in typical power workloads. The EDP test verifies the functionality of the Gaudi power supply by generating a fast power usage transient, for low power to high power and vice-versa.
The functional test runs all available functionality of the hardware and all the available components on the Gaudi SOC to test the functionality and the interaction between the different units during the parallel execution. Next, we’ll show you how to run the HL core application and the test that we’ve listed before.
To start off we’ll go to the directory where HL Qual is installed. In this directory, this has all the required files for HL Qual to run. To run our PCIE bandwidth test, we run the following command: hl_qual -c, with the dash-C provided all the devices are going to be run on, -rmod serial -t for RMO serial for 20 seconds, dash P for the PCA bandwidth, and dash B for bidirectional. Here you can see the monitor has started, and you can see a bunch of information about the devices each of the devices listed here.
As the test is running this information will be updated. You can also see the duration of the test, and the current time, as well as the devices that are running on. Here you can see the results of the PCIE bandwidth test. We can see that, there are some min-max values for the monitors. There’s also pass or fail for each of the cards, and there’s an overall pass or fail. Up above we can see more detailed information for each of the tests run on each of the cards. Next I’ll show you how to run the power stress test. To do this you can run the following command. So we run hl_qual before the same application, dash-C all for all the cards to be used.
– rmod parallel, we’re gonna run in parallel mode this time. Dash T-20 for the time to run dash-S for Power Stress test and dash-L for the level we’re gonna use extreme. Just as before we’ll be able to see some monitors come up. As the test is running, these monitors will update. We’ll also see the duration and then devices that are run. Here you can see the results of the power stress test, just like the PCIE, we have min-max values for the statistics and the monitors that are collected. We have the pass or fail criteria for each of the tests, each of the cards run. We have an overall pass or fail and then more details about each one of the tests.
Next, I will show you how to run the EDP test. To do this you can use the following command. This time we’re gonna use the same starting command so dash C all for all the cards dash rmod parallel to run in parallel mode T for time, E for EDP test, and L for level, we’re gonna use extreme again. Same as before, we will skip to the end of the test when we get the results. Here we can see the results for the EDP test, very similar for the monitors, the pass or fail criteria, and here we have more details for each of the runs for the EDP test itself. Lastly, I will show you how to run the functional test. To do this use the following command. The first few inputs are similar to the previous test. The differentiator here is dash F for functional test. Dash serdes type to put this all gather, and that’s J for the JSON for the HCL configuration. As before we will skip to the end of the test when we get the results. As you can see, we have similar information about the monitors, the pass or fail criteria for each of the cards, and overall pass or fail criteria as well as more information for each of the cards that ran the test. That is it for the HR Qual tool section.
For more information, please refer to the links in the Habana developer website, developer.Habana.ai. Lastly, we’ll go over some functionality of the HL-SMI monitoring tool. HL-SMI returns system-level information about the Habana hardware installed on the server. This is some of the information that can be used to fetch. There’s some hardware identification information as well as working condition of the hardware. Some of the hardware identification information includes hardware serial number, PCB revision, PCI BUS AIP UUID as well as driver version. Some of the working condition hardware of the hardware information that’s included is hardware DRAM usage, clocks, temperature, power, and power limits. Now, I’ll show you a demo. To query the information from the system you can use the following command. hl-smi -q This will list the information for each of the cards for your viewing. We suggest that you do the following checks to ensure that the cards are ready.
First, check that the firmware version are matching the version that you are expecting. These are listed here. Next, we also want to make sure that the clock speeds listed here, as well as the PCIE link speed and links widths, are up for performance. You can also use the following command to have a summary of all the information. If you don’t pass the dash Q command, it will summarize all the information into a table. This is an example of the output. Thanks for watching, for more information, please visit developer.habana.ai.