Intel AI Cloud gives developers access to Intel hardware,
including the latest Gaudi2 deep learning accelerator from Habana.
Welcome to the Intel AI Cloud! These instructions will show you how to get started running models on the Gaudi2 HPU using the Intel AI Cloud. In summary, these are the steps you will follow:
- Login to the AI Cloud and start a Gaudi2 instance and ssh into the instance
- Load a Habana Docker image
- Clone the Habana Model-References repository
- Select and run a model according to the instructions
Let’s get started!
You will first need to register and create an account on the Intel AI Cloud, After logging into the Dev Cloud, you will have access to the Gaudi Instances.
Get started by running the Habana PyTorch Docker image.
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
user@10.1.63.124:~$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
Unable to find image 'vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest' locally
latest: Pulling from gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1
Digest: sha256:18d306fb631d4ec793ff2d8025447b14e88b4b0542dc40cbb2086327bbbf73f3
Status: Downloaded newer image for vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
5aef555954d71c719a2d59981865c62aaae55ab828ecc7d9423794c98685db59
* Starting OpenBSD Secure Shell server sshd
root@devcloud:/#
Code language: PHP (php)
Now that you’re in the docker, go to the $HOME directory and Clone Habana’s Model-References Repository to access the models
cd $HOME
git clone https://github.com/habanaai/Model-References
root@devcloud:/# cd $HOME
root@devcloud:~# git clone https://github.com/habanaai/Model-References
Cloning into 'Model-References'...
remote: Enumerating objects: 16166, done.
remote: Counting objects: 100% (1955/1955), done.
remote: Compressing objects: 100% (1037/1037), done.
remote: Total 16166 (delta 834), reused 1865 (delta 801), pack-reused 14211
Receiving objects: 100% (16166/16166), 105.69 MiB | 28.15 MiB/s, done.
Resolving deltas: 100% (8520/8520), done.
root@devcloud:~#
Code language: PHP (php)
Now set the Python Path and correct Python release so all the models can run from Habana Model-References:
export PYTHONPATH=/root/Model-References:$PYTHONPATH
export PYTHON=/usr/bin/python3.8 # this is /python3.10 for Ubuntu22.04
Now you can run the PyTorch Examples or other PyTorch models on the Model References repository. For this example, we’ll start with the simple “hello world” model examples:
cd /root/Model-References/PyTorch/examples/computer_vision/hello_world
First, we’ll run a Convolutional Neural Network (CNN) with no modifications, this uses the small MNIST dataset that will be loaded during runtime.
mkdir checkpoints
$PYTHON example.py
root@devcloud:~# export PYTHONPATH=/root/Model-References:$PYTHONPATH
root@devcloud:~# export PYTHON=/usr/bin/python3.8
root@devcloud:~# cd /root/Model-References/PyTorch/examples/computer_vision/hello_world
root@devcloud:~/Model-References/PyTorch/examples/computer_vision/hello_world$ mkdir checkpoints
root@devcloud:~/Model-References/PyTorch/examples/computer_vision/hello_world$ $PYTHON example.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
=============================HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
=============================SYSTEM CONFIGURATION =========================================
Num CPU Cores = 160
CPU RAM = 1056427064 KB
============================================================================================
=====================================================================
Epoch : 1
Training loss is 0.8049848758653283 and training accuracy is 77.35333333333332
Testing loss is 0.30580335511248324 and testing accuracy is 91.33
=====================================================================
Epoch : 2
Training loss is 0.2820201388744911 and training accuracy is 91.85166666666666
Testing loss is 0.24607752678515035 and testing accuracy is 92.86999999999999
=====================================================================
=====================================================================
Epoch : 20
Training loss is 0.051834161554191155 and training accuracy is 98.685
Testing loss is 0.07477731682077239 and testing accuracy is 97.63
root@devcloud:~/Model-References/PyTorch/examples/computer_vision/hello_world$
Code language: PHP (php)
Now we can run a similar model and pass in some hyperparameters to control the training configuration:
$PYTHON mnist.py --batch-size=64 --epochs=1 --lr=1.0 --gamma=0.7 --hpu # runs an MNIST model with more config options
root@devcloud:~/Model-References/PyTorch/examples/computer_vision/hello_world$ $PYTHON mnist.py --batch-size=64 --epochs=1 --lr=1.0 --gamma=0.7 --hpu
Not using distributed mode
=============================HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 2
=============================SYSTEM CONFIGURATION =========================================
Num CPU Cores = 160
CPU RAM = 1056427064 KB
============================================================================================
Train Epoch: 1 [0/60000.0 (0%)] Loss: 2.275952
Train Epoch: 1 [640/60000.0 (1%)] Loss: 1.870048
Train Epoch: 1 [1280/60000.0 (2%)] Loss: 0.649141
Train Epoch: 1 [1920/60000.0 (3%)] Loss: 0.679428
Train Epoch: 1 [58880/60000.0 (98%)] Loss: 0.002423
Train Epoch: 1 [59520/60000.0 (99%)] Loss: 0.012831
Total test set: 10000, number of workers: 1
* Average Acc 98.430 Average loss 0.047
Code language: JavaScript (javascript)
Finally, we now use the `mpirun` command to run allow all eight Gaudi2 accelerators to work together to train the model faster:
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root $PYTHON mnist.py --batch-size=64 --epochs=1 --lr=1.0 --gamma=0.7 --hpu --use_lazy_mode
root@devcloud:~# mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root $PYTHON mnist.py --batch-size=64 --epochs=1 --lr=1.0 --gamma=0.7 --hpu --hmp --hmp-bf16=ops_bf16_mnist.txt --hmp-fp32=ops_fp32_mnist.txt --use_lazy_mode
[devcloud:161708] MCW rank 7 bound to socket 1[core 58[hwt 0-1]], socket 1[core 59[hwt 0-1]], socket 1[core 60[hwt 0-1]], socket 1[core 61[hwt 0-1]], socket 1[core 62[hwt 0-1]], socket 1[core 63[hwt 0-1]]:
[devcloud:161708] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]:
[devcloud:161708] MCW rank 1 bound to socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]]:
[devcloud:161708] MCW rank 2 bound to socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]]:
[devcloud:161708] MCW rank 3 bound to socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]]:
[devcloud:161708] MCW rank 4 bound to socket 1[core 40[hwt 0-1]], socket 1[core 41[hwt 0-1]], socket 1[core 42[hwt 0-1]], socket 1[core 43[hwt 0-1]], socket 1[core 44[hwt 0-1]], socket 1[core 45[hwt 0-1]]:
[devcloud:161708] MCW rank 5 bound to socket 1[core 46[hwt 0-1]], socket 1[core 47[hwt 0-1]], socket 1[core 48[hwt 0-1]], socket 1[core 49[hwt 0-1]], socket 1[core 50[hwt 0-1]], socket 1[core 51[hwt 0-1]]:
[devcloud:161708] MCW rank 6 bound to socket 1[core 52[hwt 0-1]], socket 1[core 53[hwt 0-1]], socket 1[core 54[hwt 0-1]], socket 1[core 55[hwt 0-1]], socket 1[core 56[hwt 0-1]], socket 1[core 57[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../.
| distributed init (rank 5): env://
| distributed init (rank 6): env://
| distributed init (rank 4): env://
| distributed init (rank 0): env://
| distributed init (rank 1): env://
| distributed init (rank 2): env://
| distributed init (rank 7): env://
| distributed init (rank 3): env://
=============================HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
=============================SYSTEM CONFIGURATION =========================================
Num CPU Cores = 160
CPU RAM = 1056427064 KB
============================================================================================
Train Epoch: 1 [0/7500.0 (0%)] Loss: 2.306436
Train Epoch: 1 [640/7500.0 (9%)] Loss: 1.247670
Train Epoch: 1 [1280/7500.0 (17%)] Loss: 0.496396
Train Epoch: 1 [1920/7500.0 (26%)] Loss: 0.275942
Train Epoch: 1 [2560/7500.0 (34%)] Loss: 0.231989
Train Epoch: 1 [3200/7500.0 (43%)] Loss: 0.174838
Train Epoch: 1 [3840/7500.0 (51%)] Loss: 0.119858
Train Epoch: 1 [4480/7500.0 (60%)] Loss: 0.144650
Train Epoch: 1 [5120/7500.0 (68%)] Loss: 0.154049
Train Epoch: 1 [5760/7500.0 (77%)] Loss: 0.089073
Train Epoch: 1 [6400/7500.0 (85%)] Loss: 0.050248
Train Epoch: 1 [7040/7500.0 (94%)] Loss: 0.077905
Total test set: 10000, number of workers: 8
* Average Acc 97.615 Average loss 0.072
Code language: PHP (php)
Next Steps
Congratulations! You have just started your journey into Deep Learning with Habana. You should now explore additional models in the Model-References repository. We have models for training and inference using Natural Language Processing, Generative AI, computer vision, and more. You’ll be able to follow the instructions step-by-step in each model.
You can also run models from Hugging Face on Gaudi2 by following the instructions here to use all the examples and other model from Hugging Face. If you’d like to migrate your own models to Gaudi, you can start with our Model Porting guide to move your model from other architectures to Gaudi2.