See the latest performance data for Gaudi2 training, Gaudi training and Gaudi inference. For information on models and containers that are currently integrated with Habana’s Synapse AI software suite visit the Habana catalog. For more information on future model support, please refer to our SynapseAI roadmap page.
Gaudi2 MLPerf™ 2.1 Training Performance
These performance numbers have been generated with the latest version of SynapseAI and are improvements over the officially submitted numbers posted on MLCommons website.
Framework Version | Model | # HPU | Precision | Time To Train |
---|---|---|---|---|
PyTorch 1.13.1 | MLPerf 2.1 - ResNet | 8 | bf16 | 17.23 min |
PyTorch 1.13.1 | MLPerf 2.1 - BERT | 8 | bf16 | 15.45 min |
TensorFlow 2.8.4 | MLPerf 2.1 - ResNet | 8 | bf16 | 16.40 min |
TensorFlow 2.8.4 | MLPerf 2.1 - BERT | 8 | bf16 | 14.90 min |
Gaudi2 Reference Models Training Performance
Framework Version | Model | ||||||
---|---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size |
PyTorch 1.13.1 | ResNet50 LARS | 32 | bf16 | 172086.7 img/sec | 76.25 | 8.06 min | 256 |
PyTorch 1.13.1 | ResNet50 LARS | 16 | bf16 | 87880.53 img/sec | 76.4 | 12.35 min | 256 |
PyTorch 1.13.1 | ResNet50 LARS | 8 | bf16 | 44192.3 img/sec | 76.01 | 18.9 min | 256 |
TensorFlow 2.8.4 | ResNet50 Keras LARS | 32 | bf16 | 169206.97 img/sec | 76.38 | 7.3 min | 256 |
TensorFlow 2.11.0 | ResNet50 Keras LARS | 16 | bf16 | 84633.8 img/sec | 76.04 | 12 min | 256 |
TensorFlow 2.8.4 | ResNet50 Keras LARS | 8 | bf16 | 42921.54 img/sec | 76.42 | 25.1 min | 256 |
TensorFlow 2.8.4 | ResNet50 Keras LARS | 1 | bf16 | 5918.78 img/sec | 256 | ||
PyTorch 1.13.1 | BERT-Large Pre Training Phase 1 | 32 | bf16 | 30583.55 sent/sec | Loss: 1.51 | 280.1 min | 64 |
PyTorch 1.13.1 | BERT-Large Pre Training Phase 1 | 8 | bf16 | 8331.28 sent/sec | Loss: 1.49 | 944.5 min | 64 |
PyTorch 1.13.1 | BERT-Large Pre Training Phase 1 | 1 | bf16 | 1054.07 sent/sec | 64 | ||
PyTorch 1.13.1 | BERT-Large Pre Training Phase 2 | 32 | bf16 | 10074.11 sent/sec | Loss: 1.34 | 92.9 min | 16 |
PyTorch 1.13.1 | BERT-Large Pre Training Phase 2 | 8 | bf16 | 2658.16 sent/sec | Loss: 1.33 | 329.62 min | 16 |
PyTorch 1.13.1 | BERT-Large Pre Training Phase 2 | 1 | bf16 | 335.33 sent/sec | 16 | ||
PyTorch 1.13.1 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 1536 sent/sec | 93.17 | 4.75 min | 24 |
PyTorch 1.13.1 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 257.4 sent/sec | 24 | ||
TensorFlow 2.11.0 | BERT-Large Pre Training phase 1 | 32 | bf16 | 29565.87 sent/sec | 1.43 | 263.76 min | 64 |
TensorFlow 2.8.4 | BERT-Large Pre Training phase 1 | 16 | bf16 | 15124.9 sent/sec | 1.43 | 517.67 min | 64 |
TensorFlow 2.8.4 | BERT-Large Pre Training phase 1 | 8 | bf16 | 7731 sent/sec | 1.43 | 1009.7 min | 64 |
TensorFlow 2.8.4 | BERT-Large Pre Training phase 1 | 1 | bf16 | 983.06 sent/sec | 64 | ||
TensorFlow 2.11.0 | BERT-Large Pre Training phase 2 | 32 | bf16 | 9989.23 sent/sec | 1.28 | 88.65 min | 16 |
TensorFlow 2.8.4 | BERT-Large Pre Training phase 2 | 16 | bf16 | 5130.4 sent/sec | 1.29 | 169.6 min | 16 |
TensorFlow 2.8.4 | BERT-Large Pre Training phase 2 | 8 | bf16 | 2612.8 sent/sec | 1.23 | 329.3 min | 16 |
TensorFlow 2.11.0 | BERT-Large Pre Training phase 2 | 1 | bf16 | 333.93 sent/sec | 16 | ||
TensorFlow 2.11.0 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 1800.58 sent/sec | 93.44 | 4.78 min | 24 |
TensorFlow 2.11.0 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 253.1 sent/sec | 24 | ||
DeepSpeed 0.7.0 | Megatron-DeepSpeed BLOOM 13B | 64 | bf16 | 18.2 samples/sec | 48 | ||
PyTorch 1.13.1 | Wav2Vec 2.0 | 8 | bf16 | 948.14 sent/sec | WER: 6.45 | 80 h | 384 |
PyTorch 1.13.1 | Wav2Vec 2.0 | 1 | bf16 | 128.85 sent/sec | 384 | ||
PyTorch 1.13.1 | ResNext101 | 8 | bf16 | 20983.6 img/sec | 77.87 | 112.36 min | 256 |
PyTorch 1.13.1 | ResNext101 | 1 | bf16 | 2683.43 img/sec | 256 | ||
TensorFlow 2.11.0 | ResNext101 | 8 | bf16 | 18718.73 img/sec | 79.1 | 109.55 min | 256 |
TensorFlow 2.11.0 | ResNext101 | 1 | bf16 | 2469.5 img/sec | 256 | ||
PyTorch 1.13.1 | SSD | 8 | bf16 | 15389.53 img/sec | 22.9 | 9.58 min | 128 |
PyTorch 1.13.1 | SSD | 1 | bf16 | 2023.75 img/sec | 128 | ||
TensorFlow 2.11.0 | SSD | 8 | bf16 | 8617.16 img/sec | 120 | ||
TensorFlow 2.11.0 | SSD | 1 | bf16 | 2311.04 img/sec | 23.56 | 15.65 min | 240 |
PyTorch 1.13.1 | Transformer | 8 | bf16 | 856260 tokens/sec | 28.2 | 287.5 min | 8192 |
PyTorch 1.13.1 | Transformer | 1 | bf16 | 121269.4 tokens/sec | 8192 | ||
TensorFlow 2.11.0 | Transformer | 8 | bf16 | 993835.77 tokens/sec | 26.8 | 166.56 min | 16384 |
TensorFlow 2.11.0 | Transformer | 1 | bf16 | 128746.48 tokens/sec | 16384 | ||
TensorFlow 2.11.0 | MaskRCNN | 8 | bf16 | 289.64 img/sec | 34.04 | 99.2 min | 12 |
TensorFlow 2.11.0 | MaskRCNN | 1 | bf16 | 87.57 img/sec | 12 | ||
Lightning 1.8.6 | Unet2D | 8 | bf16 | 14605.54 img/sec | 72.56 | 15.45 min | 64 |
Lightning 1.8.6 | Unet2D | 1 | bf16 | 1928.38 img/sec | 64 | ||
Lightning 1.8.6 | Unet3D | 8 | bf16 | 195.98 img/sec | 74.26 | 23.73 min | 2 |
Lightning 1.8.6 | Unet3D | 1 | bf16 | 25.07 img/sec | 2 | ||
TensorFlow 2.11.0 | UNet2D | 8 | bf16 | 1324.36 img/sec | 88.86 | 1.73 min | 8 |
TensorFlow 2.11.0 | UNet2D | 1 | bf16 | 190.37 img/sec | 8 | ||
TensorFlow 2.11.0 | UNet3D | 8 | bf16 | 164.91 img/sec | 88.46 | 4.78 min | 2 |
TensorFlow 2.11.0 | UNet3D | 1 | bf16 | 22.22 img/sec | 2 |
Gaudi2 Reference Models Inference Performance
Framework Version | Model | # HPU | Precision | Throughput | Latency | Batch Size |
---|---|---|---|---|---|---|
PyTorch 1.13.1 | Stable Diffusion v1.5 | 1 | bf16 | 0.728 img/sec | 4.12 sec | 3 |
DeepSpeed 0.7.0 | HuggingFace Bloom 176B | 8 | bf16 | 27.667 tokens/sec | 0.036 sec | 1 |
PyTorch 1.13.1 | Wav2Vec2ForCTC | 1 | bf16 | 5.8M tokens/sec | 1 | |
PyTorch 1.13.1 | BERT Large | 1 | bf16 | 392.36 tokens/sec | 0.063 sec | 24 |
PyTorch 1.13.1 | ResNet50 | 1 | bf16 | 16056.89 img/sec | 0.016 sec | 256 |
PyTorch 1.13.1 | ResNext101 | 1 | bf16 | 8724.81 img/sec | 0.029 sec | 256 |
Gaudi Reference Models Training Performance
Framework Version | Model | ||||||
---|---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size |
PyTorch 1.13.1 | ResNet50 SGD | 16 | bf16 | 22715.04 img/sec | 76.04 | 98.71 min | 256 |
PyTorch 1.13.1 | ResNet50 SGD | 8 | bf16 | 12660.96 img/sec | 76.13 | 183.66 min | 256 |
PyTorch 1.13.1 | ResNet50 SGD | 1 | bf16 | 1671.73 img/sec | 256 | ||
TensorFlow 2.11.0 | ResNet50 Keras LARS | 32 | bf16 | 50043.81 img/sec | 76.03 | 21.85 min | 256 |
TensorFlow 2.11.0 | ResNet50 Keras LARS | 16 | bf16 | 25050.74 img/sec | 75.53 | 39.38 min | 256 |
TensorFlow 2.11.0 | ResNet50 Keras LARS | 8 | bf16 | 12571.16 img/sec | 75.92 | 70.21 min | 256 |
TensorFlow 2.11.0 | ResNet50 Keras LARS | 1 | bf16 | 1629.12 img/sec | 256 | ||
PyTorch 1.13.1 | BERT-Large Pre Training combine | 32 | bf16 | 4849.73 sent/sec | 1809.29 min | 64 | |
PyTorch 1.13.1 | BERT-Large Pre Training combine | 8 | bf16 | 1239.87 sent/sec | 64 | ||
PyTorch 1.13.1 | BERT-Large Pre Training combine | 1 | bf16 | 155.58 sent/sec | 64 | ||
PyTorch 1.13.1 | BERT-L Pre Training Phase 1 | 32 | bf16 | 5823.45 sent/sec | Loss: 1.488 | 1353.26 min | 64 |
PyTorch 1.13.1 | BERT-L Pre Training Phase 1 | 8 | bf16 | 1491.68 sent/sec | 64 | ||
PyTorch 1.13.1 | BERT-L Pre Training Phase 1 | 1 | bf16 | 187.31 sent/sec | 64 | ||
PyTorch 1.13.1 | BERT-L Pre Training Phase 2 | 32 | bf16 | 1916.94 sent/sec | Loss: 1.32 | 456.03 min | 8 |
PyTorch 1.13.1 | BERT-L Pre Training Phase 2 | 8 | bf16 | 487.26 sent/sec | 8 | ||
PyTorch 1.13.1 | BERT-L Pre Training Phase 2 | 1 | bf16 | 61.01 sent/sec | 8 | ||
PyTorch 1.13.1 | BERT-L SQUAD Fine Tuning (SQUAD) | 8 | bf16 | 365.71 sent/sec | 93.16 | 11.43 min | 24 |
PyTorch 1.13.1 | BERT-L SQUAD Fine Tuning (SQUAD) | 1 | bf16 | 50.4 sent/sec | 24 | ||
DeepSpeed 0.7.0 * | DeepSpeed BERT 1.5B LANS | 128 | bf16 | 3,821.69 sent/sec | 16 | ||
DeepSpeed 0.7.0 * | DeepSpeed BERT 1.5B LANS | 64 | bf16 | 2187.52 sent/sec | 16 | ||
DeepSpeed 0.7.0 * | DeepSpeed BERT 1.5B LANS | 32 | bf16 | 1164.8 sent/sec | 16 | ||
DeepSpeed 0.7.0 * | DeepSpeed BERT 1.5B LANS | 16 | bf16 | 603.2 sent/sec | 16 | ||
DeepSpeed 0.7.0 * | DeepSpeed BERT 1.5B LANS | 8 | bf16 | 306.4 sent/sec | 16 | ||
DeepSpeed 0.7.0 * | DeepSpeed BERT 5B LANS | 128 | bf16 | 646 sent/sec | 24 | ||
TensorFlow 2.8.4 | BERT-Large Pre Training combine | 32 | bf16 | 5380.62 sent/sec | 1608.75 min | 64 | |
TensorFlow 2.8.4 | BERT-Large Pre Training combine | 8 | bf16 | 1367.62 sent/sec | 64 | ||
TensorFlow 2.8.4 | BERT-Large Pre Training combine | 1 | bf16 | 172.53 sent/sec | 64 | ||
TensorFlow 2.8.4 | BERT-Large Pre Training phase 1 | 32 | bf16 | 6468.36 sent/sec | Loss: 1.448 | 1200.83 min | 64 |
TensorFlow 2.8.4 | BERT-Large Pre Training phase 1 | 8 | bf16 | 1647.51 sent/sec | 64 | ||
TensorFlow 2.8.4 | BERT-Large Pre Training phase 1 | 1 | bf16 | 207.29 sent/sec | 64 | ||
TensorFlow 2.8.4 | BERT-Large Pre Training phase 2 | 32 | bf16 | 2119.45 sent/sec | Loss: 1.288 | 407.91 min | 8 |
TensorFlow 2.8.4 | BERT-Large Pre Training phase 2 | 8 | bf16 | 535.4 sent/sec | 8 | ||
TensorFlow 2.8.4 | BERT-Large Pre Training phase 2 | 1 | bf16 | 68.08 sent/sec | 8 | ||
TensorFlow 2.8.4 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 412.34 sent/sec | 93.21 | 12.53 min | 24 |
TensorFlow 2.8.4 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 54.61 sent/sec | 24 | ||
PyTorch 1.13.1 | ALBERT-Large Fine Tuning ** | 8 | bf16 | 431.15 sent/sec | 92.47 | 9.2 min | 32 |
PyTorch 1.13.1 | ALBERT-XXL Fine Tuning ** | 8 | bf16 | 73.14 sent/sec | 95.06 | 42.95 min | 12 |
PyTorch 1.13.1 | BART Fine Tuning | 8 | bf16 | 1195.01 sent/sec | 8.61 min | 32 | |
TensorFlow 2.11.0 | Densenet 121 TFD | 8 | bf16 | 6080.1 img/sec | 74.78 | 333.4 min | 128 |
PyTorch 1.13.1 | DINO | 8 | bf16 | 1275.82 exmpl/sec | 64 | ||
PyTorch 1.13.1 | DistilBERT ** | 8 | bf16 | 1312 sent/sec | 85.53 | 4.08 min | 8 |
TensorFlow 2.8.4 | EfficientDet | 8 | fp32 | 264.03 img/sec | 33.53 | 3918.08 min | 8 |
PyTorch 1.13.1 | GoogLeNet | 8 | bf16 | 15463.92 img/sec | 72.65 | 175.38 min | 256 |
TensorFlow 2.8.4 | MaskRCNN | 8 | bf16 | 133.17 img/sec | 34.06 | 186.95 min | 4 |
PyTorch 1.13.1 | MobileNetV2 | 8 | bf16 | 12081.66 img/sec | 71.23 | 502.28 min | 256 |
PyTorch 1.13.1 | ResNet152 | 8 | bf16 | 4849.91 img/sec | 78.29 | 398.7 min | 128 |
PyTorch 1.13.1 | ResNext101 | 8 | bf16 | 5701.93 img/sec | 78.08 | 414.5 min | 128 |
TensorFlow 2.8.4 | Resnext-101 | 8 | bf16 | 5117.36 img/sec | 79.14 | 382.6 min | 128 |
PyTorch 1.13.1 | RoBERTa Base ** | 8 | bf16 | 853.33 sent/sec | 91.85 | 5.1 min | 12 |
PyTorch 1.13.1 | RoBERTa Large ** | 8 | bf16 | 301.17 sent/sec | 94.62 | 12.9 min | 12 |
PyTorch 1.13.1 | SSD | 8 | bf16 | 3802.88 img/sec | 23.1 | 40.08 min | 128 |
TensorFlow 2.8.4 | SSD | 8 | bf16 | 4032.67 img/sec | 23.46 | 32.21 min | 128 |
PyTorch 1.13.1 | Swin Transformer ** | 8 | bf16 | 558.16 img/sec | 98.81 | 11.18 min | 32 |
PyTorch 1.13.1 | Transformer | 8 | bf16 | 185553.6 tokens/sec | 28.1 | 1013.15 min | 4096 |
TensorFlow 2.8.4 | Transformer | 8 | bf16 | 187617.78 tokens/sec | 26.4 | 879.15 min | 4096 |
Lightning 1.8.6 | Unet2D | 8 | bf16 | 4606.16 img/sec | 73.47 | 53.3 min | 64 |
TensorFlow 2.8.4 | UNet2D | 8 | bf16 | 382.96 img/sec | 88.13 | 3.6 min | 8 |
Lightning 1.8.6 | Unet3D | 8 | bf16 | 59.22 img/sec | 74.41 | 58.7 min | 2 |
TensorFlow 2.8.4 | UNet3D | 8 | bf16 | 51.59 img/sec | 86.92 | 11.96 min | 2 |
PyTorch 1.13.1 | Vision Transformer ** | 8 | bf16 | 1270.51 img/sec | 97.41 | 5.65 min | 64 |
TensorFlow 2.8.4 | Vision Transformer | 8 | bf16 | 604.7 img/sec | 84.34 | 360.9 min | 32 |
PyTorch 1.13.1 | Wav2Vec 2.0 | 8 | bf16 | 256 sent/sec | 384 | ||
PyTorch 1.13.1 | YOLOX | 8 | bf16 | 285.68 img/sec | 39.47 | 1992.21 min | 16 |
TensorFlow 2.11.0 | ResNet50 Keras LARS tf.distribute | 8 | bf16 | 12306.91 img/sec | 76.08 | 70.81 min | 256 |
TensorFlow 2.8.4 | ResNet50 Keras LARS Host NIC (HVD and Libfabric) | 16 | bf16 | 24628.69 img/sec | 256 | ||
TensorFlow 2.8.4 | ResNet50 Keras LARS Host NIC (tf.distribute and Libfabric) | 16 | bf16 | 24580.75 img/sec | 256 | ||
PyTorch 1.13.1 | ResNet50 Host NIC (libfabric) | 16 | bf16 | 23354.4 img/sec | 256 | ||
PyTorch 1.13.1 | Stable Diffusion | 8 | bf16 | 72.63 img/sec | 8 |
Gaudi Reference Models Inference Performance
Framework Version | Model | # HPU | Precision | Throughput | Latency | Batch Size |
---|---|---|---|---|---|---|
PyTorch 1.13.1 | Stable Diffusion V1-5 | 1 | bf16 | 0.26 img/sec | 11.52 sec | 3 |
PyTorch 1.13.1 | Wav2Vec-B | 1 | bf16 | 4M tokens/sec | 3 | |
PyTorch 1.13.1 | Bloom 7B | 1 | bf16 | 38.18 sample/sec | 0.026 sec | 1 |
PyTorch 1.13.1 | BERT-Large | 1 | bf16 | 118.41 tokens/sec | 0.21 sec | 24 |
* Performance measurements taken on Amazon EC2 DL1 Instance
** These models are available on Hugging Face Habana hub. Performance measurements based on optimum-habana version 1.3
System Configuration:
Gaudi® Platform
System: HLS-1 with eight Habana Gaudi HL-205 Mezzanine cards and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Gaudi®2 Platform
System: HLS-Gaudi2 with eight Habana Gaudi2 HL-225H Mezzanine cards and two Intel® Xeon® Platinum 8380 CPU @ 2.30GHz, and 1TB of System Memory
Amazon EC2 DL1 Instance
System: Custom Server with eight Habana Gaudi HL-205 Mezzanine cards and two Intel® Xeon® Platinum 8275CL CPU @ 3.00GHz, and 756GB of System Memory
Common Software
Ubuntu20.04, SynapseAI Software version 1.8.0-690
TensorFlow: Models run with TensorFlow v2.11.0 use this Docker image; ones with v2.8.4 use this Docker image
PyTorch: Models run with PyTorch v1.13.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. Please refer to the Model-References GitHub page for each model’s support and validation coverage. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.