Hello. I’m Harshvardhan Chauhan. I’m a Software Engineer at Habana Labs, Intel, working on enabling deep learning models on Gaudi. Today, I’m going to show you how you can set up Gaudi TensorFlow Environment with just a few commands, and then I will walk you through TensorFlow Resnet50 Keras Training on the ImageNet dataset.
Let’s get started. In this video, I’ll be using a Gaudi-based server. I’m assuming your device is set up with Gaudi device and driver firmware, and you are familiar with the TensorFlow framework, Resnet50 topology, and how to use our docker.
If you have not already set up the device or have any problems, please check out other setup and install resources from our website. The first step is to download the docker image, as shown here in the command, to download the docker image from the vault that I’ll be using. The vault is a public repository for all images and software content.
The next step is to run the docker container and here is the command. There are a few flags that a user can set for the convenience of transfer of files from the docker container to the local system on which the container is running.
One can map a directory using flag + hyphen V to transfer training logs from the container to the system or access other data files from the system to inside the container. Also, one can use flag + hyphen, hyphen work DIR to set a default work directory. In my docker container, there are eight cards available and all of them will be available for training. That’s it.
Everything we need to run the TensorFlow model is ready now. You can also check out the Havana TensorFlow user guide to learn more about Havana TensorFlow integration and also learn how you can port your own models on Gaudi.
Keep watching if you would like to learn more about how to run ResNet50 Karas Training from our model reference repository on GitHub. We’ll start by cloning the model reference repository from GitHub. You can use the HL SMI command to get the right branch version. So to clone the model references, this is the command. We can see that the repository is cloned and available in the root directory. Here is the repository.
We need to add the path to model references to our Python path environment variable. We’ll also set the Python variable to the right version of the Python library. For the ResNet Keras model, we’ll go to the path shown here. Let’s take a look inside.
The important script is ResNets control ImageNetmain.py. The ResNet control ImageNetmain.py script executes the ResNet model. The multi-card training can be done either by using Horovod or TF distribute strategy. I’ll be using the Horovod in this example. The default values set here are the best configuration for the hardware. We need to get and prepare the ImageNet dataset in TF record format.
This slide shows the command to get the ImageNet dataset and process them in the right format. Please refer to the Read Me section of ResNet Keras in our model garden repository for more information. The ImageNet dataset is quite large and will take a few hours to download and process. Once the dataset is downloaded and processed into TF record format, it is present in our default directory.
The directory can also be altered. Once the ImageNet dataset is ready, we can proceed towards training. To launch that training, we’ll be using this command. We train the model using MPI run command, which uses some of its own parameters like hyphen np, which is a number of processes and which is right here. And then it runs the resnetmain.py file, which has its own parameters like D type, which is a data type that can be BF16 or FP32, Data direct, which is not a mandatory parameter.
Provide this only if you have your ImageNet data set stored in your own location. By default, it picks the ImageNet dataset from the default location, which is in the slash data directory. Use the horovod flag as we use horovod for multicard training. Train epoch used for setting the number of epochs to train the model, batch size for changing the batch size, optimizer, which can be used and changed to LARS or SGD for training.
The other parameters like learning rate, weight decay, and others are self-explanatory and the values used here are the best configuration. During the training, the graph gets compiled and runs on all eight vocals. It will take several minutes to start seeing the performance numbers on the terminal. Once the graph gets compiled for each card, and if the checkpoint flag is activated during the training command, the checkpoints get created.
Every vocal will be printing its own individual performance number. The example per second is simply the images per second each card is processing. It is the final performance measured across all the cards. The calculation is done by taking each worker’s steps per second, then multiplying by the batch size, and then multiplying by a number of cards, which is eight.
Thus it will give corresponding examples per second. We are now at the end of our training. I wish to summarize the process. We first download the ImageNet dataset and pre-process it into TF records format, which is stored in the default location. Then we run the ResNet mean script for training, and then finally to the evaluation.
We’ll now go ahead and take a look at the results, which are by default at /tmp/resnet folder. Here in the ResNet folder, you can find the output for each worker. Let’s go-to worker zero Here, the checkpoint should be saved if the checkpoint flag is activated. This is the standard output for worker zero.
You can see the evaluation results stored here across each worker. The evaluation results for worker zero are present over here. The ResNet Karas model can also be trained for multi-Gaudi card systems like 16 cards and 32 cards. This is all for TensorFlow ResNet Karas training. Don’t forget to check out other TensorFlow models from the model reference is a repository. You can also find more information on a website about how to put your own models on Gaudi. Thanks for watching. For more information, watch our website developer.habana.ai.