Video Transcript
Hi, my name is Milind Pandit and I’m a deep learning manager at Habana. I lead a team of data scientists who help our customers get maximum value out of Habana products. This video entitled Getting Started with Gaudi: Model Migration is intended for data scientists. It covers the simple steps you can take with your deep learning model to train it on Gaudi.
The Habana Gaudi processor was custom built for training acceleration. It features heterogeneous compute with a cluster of fully programmable tensor processing cores, or TPCs, and configurable matrix math engines, or MMEs. The ASIC integrates 32 gigabytes of high bandwidth memory and networking ports for our DMA over converged ethernet. We also provide a full software stack supporting popular deep learning frameworks.
In this presentation, I’ll be assuming that you’re running a model using TensorFlow or PyTorch under one of these configurations. For your convenience, we provide docker containers for you to run your training in that match these configurations. We’re constantly validating our software for more and more configurations. So check your release notes for one that matches yours. If you must run on a different configuration or you run into problems with your model please let us know.
When you have a deep learning model that you can train on a CPU or a GPU, there are just one or two simple steps you need to take to train it on Gaudi. First, be sure to load the Habana libraries into your framework. Second, explicitly target the Habana device for execution.
For TensorFlow, you have to import the load Habana module function and then call it to target the Habana device. Disabling eager execution is recommended for scripts using tf.Session. Otherwise, it is optional.
For PyTorch, you have to load a library explicitly. The first statement sets the directory where these libraries are stored. The second statement includes that directory in Python’s search path for libraries. The third statement concatenates the name of the library to the directory and calls the load library method on the path to load it. To target the Habana device call the torch.device method with the string HPU. Also, call two device on your model, however, your script may already include this last statement.
Now I’ll show you a demo running on an actual Gaudi system. I’m running an interactive session with the TensorFlow docker container. In this container, the necessary libraries are installed in user lib Habana labs. This is a very simple script that trains a TensorFlow model on the industry standard and this framework. We import TensorFlow here. Here we can turn on and off the code that loads the Habana libraries and targets the Habana device.
For now, I’ll turn this off. Here we disable eager execution which is required in order to train on Gaudi. Here, we load the data set, normalize it, build a very simple model, define a loss function, define an optimizer, compile the model, fit it and evaluate it. So let’s see how this runs on CPU.
All right, the training is complete. Now let’s go in here and turn on these lines. Which I showed you in the slides. Which load the Habana libraries and target the Habana device. You’ll see a slightly different output that references Habana. But then the model training will proceed very similarly to how it previously did on CPU. And if you happen to be starting with a model that trains on a GPU, then the modifications are very similar.
Now here, I’m running an interactive session with the PyTorch docker container. In this container, the necessary libraries are also installed in user lib Habana labs.
This is a more complex script that trains a PyTorch model on the industry standard and this dataset. There’s an accuracy function. A definition of the deep neural network. Methods to train the model. A method to test the model. And here is the code that I previously showed you that loads the Habana library and targets the Habana device.
Again, I’m gonna turn it off for now to show you how it runs on the CPU. You can see that the selected device is the CPU. And now the model training is complete. Now let’s turn on the code to target the Habana device. You can see that the selected device is the Habana device.
And the training of this model completes with very similar results. I hope I’ve shown you how fast and easy it is to take a PyTorch or TensorFlow model and train it on Gaudi. Be sure to check out our other videos on advanced topics like setting up your system with Gaudi and profiling for performance optimization. Thanks for your attention and for more information, visit developer.habana.ai.