Home » Videos » Gaudi: Training with PyTorch

Gaudi: Training with PyTorch

A how-to tutorial to help you run your model on PyTorch.

Video Transcript

Hello. I’m Harshvardhan Chauhan. I work as a Software Engineer at Habana Labs Intel, working on enabling deep learning models on Gaudi. 

Today I’m going to show you how you can enable and set up Gaudi PyTorch environment with just a few commands and then I will walk you through PyTorch BERT Large fine-tuning on SQuAD dataset. In this video, I’ll be using a Gaudi based server. I’m assuming, your device is set up with Gaudi device and driver firmware and you are familiar with PyTorch framework, BERT topology, and how to use a docker. 

If you have not already set up the device or have any problems, please check out other setup and install video from our website. Let’s get started. The first step is to download the docker image from Habana Vault. As shown, here is the command to download a docker image from the Vault, which I’ll be using. 

The Vault is a public repository for all images and software content. The next step is to run the container and here is the command. There are a few flags which a user can set for the convenience of transfer of files from docker container to the local system on which container is running. 

One can map a directory using flag -V to transfer training logs from container to the system or access other data files from system to inside the container. Also one can use flag –workdir to set a default work directory. In my docker container, all eight cards are available and all of them will be available for training. 

That’s it. Everything we need to run PyTorch model is ready now. You can also check out Habana PyTorch User Guide to learn more about Habana PyTorch integration and also learn how you can put your own models on Gaudi. Keep watching if you would like to learn more about how to run BERT fine-tuning from our model reference on GitHub. We’ll start by cloning the model reference repository from GitHub. You can use hl-smi command to get the right version. So the command is, git clone and the repository name, let’s check. 

We can see that the repository is now cloned and available for training. We need to add the path to model references to our Python path environment variable. And this is the command. We’ll also set the Python variable to the right version of Python library. For BERT Large fine-tuning model, we’ll go to this particular directory. Let’s take a look inside. The important script is demo_bert.py. The demo_bert.py is a wrapper script, which facilitates running all versions of BERT model for pre-training and fine-tuning. 

Let’s look inside the BERT, demo_bert.py file. It uses the fine-tuning scripts present inside bert/transformers directory. This script contains different hyperparameter settings and default values for different versions of BERT and helps to run fine-tuning and pre-training. It also sets different environment variables and does required checks for the command line arguments. Here you can see the different parameters and the default values. 

The default values set here are the best configurations for the hardware. Now we’ll move towards fine-tuning. To launch the fine-tuning, we’ll use this command. So this command runs demo BERT Python script which has its own parameters like fine-tuning, for specifying the fine-tuning process. It can be changed to fine-tuning or pre-training process. Model name or path. This is mentioned over here as large since we will be using BERT Large model. Mode, which can be lazy or eager. Dataset name, since we are training on SQuAD dataset, we’ll be using the squad. Data type, which can be BF16 or FP32. Number train epochs, we’ll be using two. Batch size, which is 24. Maximum sequence length, Which we take it over here as 384. 

The other commands like learning rate, which is three minus five. And we’ll use the do_eval flag as well as since we are training it on eight cards, we’ll be using world size as eight. As seen earlier, the default list and the value are mentioned in argument section of demo BERT Python script. 

The SQuAD dataset is downloaded automatically based on task name and hence will be available in the current directory. The script also automatically downloads the pre-train model in the current directory and then starts processing the dataset. This will take several minutes. For subsequent runs, if you are running with similar parameters, then download and pre-processing of the dataset step will be skipped. 

During pre-training on eight cards, the dataset will be partitioned across all eight workers and the dataset will be saved in default directory. During the pre-training, the graph gets compiled and runs on all eight workers. It’ll take several minutes to start seeing the performance number on the terminal. 

Once the graph gets compiled for each card and the checkpoint flag is activated during training command, the checkpoints gets created. Every worker we will be printing its own individual performance number. The sentences per second of all the cards is the final performance measured as an output. 

Now we are at the end of our training. I wish to summarize the process. We ran the demo_bert.py script, which downloaded and processed the SQuAD dataset. It also downloads the BERT Large model for fine-tuning and then finally, the evaluation. We will go ahead and take a look at the results now. By default, the results are in /tmp/squad directory. You can find checkpoints and evaluation results in the run directory. 

That is all for PyTorch BERT fine-tuning. Don’t forget to check out other PyTorch models from model repository. You can also find more information on our website about how you can put your own models on Gaudi. Thanks for watching. For more information, visit developer.habana.ai.

Sign up for the latest Habana developer news, events, training, and updates.