Home » Tutorials » TensorFlow » Use an HPU

Use an HPU

This tutorial is for users who have tried these approaches and found that they need fine-grained control of how TensorFlow uses the HPU.

An adaptation of Use a GPU tutorial using Habana Gaudi AI processors.

TensorFlow code, and tf.keras models run on a single HPU (Gaudi) with only a few lines of code changes.

Note: Use tf.config.list_physical_devices('HPU') to confirm that TensorFlow is using the HPU.

The simplest way to run on multiple HPUs, on one or many machines, is using Distribution Strategies.

This tutorial is for users who have tried these approaches and found that they need fine-grained control of how TensorFlow uses the HPU. To learn how to debug performance issues for single and multi-HPU scenarios, see the Model Performance Optimization tutorial.

Setup

Ensure you have the latest SynapseAI and supported TensorFlow release installed.

import tensorflow as tf

Enable Habana

Let’s enable the Gaudi device by loading the Habana module:

from habana_frameworks.tensorflow import load_habana_module
load_habana_module()
2022-04-06 21:51:10.658540: W /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/helpers/op_registry_backdoor.cpp:92] Couldn't find definition of RemoteCall:GPU: to register on HPUCode language: JavaScript (javascript)
print("Num HPUs Available: ", len(tf.config.list_physical_devices('HPU')))
Num HPUs Available:  1

Overview

Habana TensorFlow supports running computations on CPU and HPU. They are represented with string identifiers for example:

  • "/device:CPU:0": The CPU of your machine.
  • "/HPU:0": Short-hand notation for the first HPU of your machine that is visible to TensorFlow.
  • "/job:localhost/replica:0/task:0/device:HPU:0": Fully qualified name of the first HPU of your machine that is visible to TensorFlow.

If a TensorFlow operation has both CPU and HPU implementations, by default, the HPU device is prioritized when the operation is assigned. For example, tf.matmul has both CPU and HPU kernels and on a system with devices CPU:0 and HPU:0, the HPU:0 device is selected to run tf.matmul unless you explicitly request to run it on another device.

If a TensorFlow operation has no corresponding HPU implementation, then the operation falls back to the CPU device.

Logging device placement

To find out which devices your operations and tensors are assigned to, put tf.debugging.set_log_device_placement(True) as the first statement of your program. Enabling device placement logging causes any Tensor allocations or operations to be printed.

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)
2022-04-06 21:51:10.730441: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:HPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:HPU:0
Executing op _function_MatMul_2918744514792547455 in device /job:localhost/replica:0/task:0/device:HPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)
2022-04-06 21:51:13.237326: W /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/habana_device.cpp:201] HPU initialization done for library version 1.3.0_c61303b7_tf2.8.0
2022-04-06 21:51:13.259458: I tensorflow/core/common_runtime/placer.cc:114] a_0: (_Arg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259483: I tensorflow/core/common_runtime/placer.cc:114] b_1: (_Arg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259496: I tensorflow/core/common_runtime/placer.cc:114] MatMul: (MatMul): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259502: I tensorflow/core/common_runtime/placer.cc:114] HabanaEagerMsg: (HabanaEagerMsg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.259507: I tensorflow/core/common_runtime/placer.cc:114] product_0_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:HPU:0Code language: JavaScript (javascript)

The above code will print an indication the MatMul op was executed on HPU:0.

Manual device placement

If you would like a particular operation to run on a device of your choice instead of what’s automatically selected for you, you can use with tf.device to create a device context, and all the operations within that context will run on the same designated device.

tf.debugging.set_log_device_placement(True)

# Place tensors on the CPU
with tf.device('/CPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

# Run on the HPU
c = tf.matmul(a, b)
print(c)
Executing op _function_MatMul_7595306614168636033 in device /job:localhost/replica:0/task:0/device:HPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)
2022-04-06 21:51:13.285843: I tensorflow/core/common_runtime/placer.cc:114] a_0: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
2022-04-06 21:51:13.285867: I tensorflow/core/common_runtime/placer.cc:114] b_1: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
2022-04-06 21:51:13.285875: I tensorflow/core/common_runtime/placer.cc:114] MatMul: (MatMul): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.285881: I tensorflow/core/common_runtime/placer.cc:114] HabanaEagerMsg: (HabanaEagerMsg): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.285887: I tensorflow/core/common_runtime/placer.cc:114] product_0_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:HPU:0
2022-04-06 21:51:13.347018: W /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/habana_device_binding_iface.cpp:59] Found TensorFlow library with SHA256: 1f4e3d3c8f90c158c442f60b6b1fafd64cfb678fd7c4f954804e0ba91497c2a0Code language: JavaScript (javascript)

You will see that now a and b are assigned to CPU:0. Since a device was not explicitly specified for the MatMul operation, the TensorFlow runtime will choose one based on the operation and available devices (HPU:0 in this example) and automatically copy tensors between devices if required.

Licensed under the Apache License, Version 2.0 (the “License”);

you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Share this article:

Stay Informed: Register for the latest Intel Gaudi AI Accelerator developer news, events, training, and updates.