An adaptation of Use a GPU tutorial using Habana Gaudi AI processors.
TensorFlow code, and tf.keras
models run on a single HPU (Gaudi) with only a few lines of code changes.
Note: Use tf.config.list_physical_devices('HPU')
to confirm that TensorFlow is using the HPU.
The simplest way to run on multiple HPUs, on one or many machines, is using Distribution Strategies.
This tutorial is for users who have tried these approaches and found that they need fine-grained control of how TensorFlow uses the HPU. To learn how to debug performance issues for single and multi-HPU scenarios, see the Model Performance Optimization tutorial.
Setup
Ensure you have the latest SynapseAI and supported TensorFlow release installed.
import tensorflow as tf
Enable Habana
Let’s enable the Gaudi device by loading the Habana module:
from habana_frameworks.tensorflow import load_habana_module
load_habana_module()
2022-04-06 21:51:10.658540: W /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/helpers/op_registry_backdoor.cpp:92] Couldn't find definition of RemoteCall:GPU: to register on HPU
Code language: JavaScript (javascript)
print("Num HPUs Available: ", len(tf.config.list_physical_devices('HPU')))
Num HPUs Available: 1
Overview
Habana TensorFlow supports running computations on CPU and HPU. They are represented with string identifiers for example:
"/device:CPU:0"
: The CPU of your machine."/HPU:0"
: Short-hand notation for the first HPU of your machine that is visible to TensorFlow."/job:localhost/replica:0/task:0/device:HPU:0"
: Fully qualified name of the first HPU of your machine that is visible to TensorFlow.
If a TensorFlow operation has both CPU and HPU implementations, by default, the HPU device is prioritized when the operation is assigned. For example, tf.matmul
has both CPU and HPU kernels and on a system with devices CPU:0
and HPU:0
, the HPU:0
device is selected to run tf.matmul
unless you explicitly request to run it on another device.
If a TensorFlow operation has no corresponding HPU implementation, then the operation falls back to the CPU device.
Logging device placement
To find out which devices your operations and tensors are assigned to, put tf.debugging.set_log_device_placement(True)
as the first statement of your program. Enabling device placement logging causes any Tensor allocations or operations to be printed.
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
2022-04-06 21:51:10.730441: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:HPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:HPU:0
Executing op _function_MatMul_2918744514792547455 in device /job:localhost/replica:0/task:0/device:HPU:0
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
2022-04-06 21:51:13.237326: W /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/habana_device.cpp:201] HPU initialization done for library version 1.3.0_c61303b7_tf2.8.0
2022-04-06 21:51:13.259458: I tensorflow/core/common_runtime/placer.cc:114] a_0: (_Arg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259483: I tensorflow/core/common_runtime/placer.cc:114] b_1: (_Arg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259496: I tensorflow/core/common_runtime/placer.cc:114] MatMul: (MatMul): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259502: I tensorflow/core/common_runtime/placer.cc:114] HabanaEagerMsg: (HabanaEagerMsg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259507: I tensorflow/core/common_runtime/placer.cc:114] product_0_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:HPU:0
Code language: JavaScript (javascript)
The above code will print an indication the MatMul
op was executed on HPU:0
.
Manual device placement
If you would like a particular operation to run on a device of your choice instead of what’s automatically selected for you, you can use with tf.device
to create a device context, and all the operations within that context will run on the same designated device.
tf.debugging.set_log_device_placement(True)
# Place tensors on the CPU
with tf.device('/CPU:0'):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
# Run on the HPU
c = tf.matmul(a, b)
print(c)
Executing op _function_MatMul_7595306614168636033 in device /job:localhost/replica:0/task:0/device:HPU:0
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
2022-04-06 21:51:13.285843: I tensorflow/core/common_runtime/placer.cc:114] a_0: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
2022-04-06 21:51:13.285867: I tensorflow/core/common_runtime/placer.cc:114] b_1: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
2022-04-06 21:51:13.285875: I tensorflow/core/common_runtime/placer.cc:114] MatMul: (MatMul): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.285881: I tensorflow/core/common_runtime/placer.cc:114] HabanaEagerMsg: (HabanaEagerMsg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.285887: I tensorflow/core/common_runtime/placer.cc:114] product_0_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.347018: W /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/habana_device_binding_iface.cpp:59] Found TensorFlow library with SHA256: 1f4e3d3c8f90c158c442f60b6b1fafd64cfb678fd7c4f954804e0ba91497c2a0
Code language: JavaScript (javascript)
You will see that now a
and b
are assigned to CPU:0
. Since a device was not explicitly specified for the MatMul
operation, the TensorFlow runtime will choose one based on the operation and available devices (HPU:0
in this example) and automatically copy tensors between devices if required.
Copyright (c) 2021 Habana Labs, Ltd. an Intel Company.
Licensed under the Apache License, Version 2.0 (the “License”);
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.