See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.
Gaudi2 Reference Models Performance
Framework Version | Model | Time to Train | Batch Size | ||||
---|---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Time to Train | Accuracy | Throughput | Batch Size |
PyTorch 1.10.2 | ResNet50 SGD | 1 | bf16 | 351 min | 76.03 | 5838 img/sec | 256 |
TensorFlow 2.8.0 | ResNet50 Keras LARS | 1 | bf16 | 161 min | 76.07 | 5506 img/sec | 256 |
TensorFlow 2.8.0 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 20.2 min | 93.3 | 216 sent/sec | 24 |
TensorFlow 2.8.0 | BERT-Large Pre Training phase 1 | 1 | bf16 | 840 sent/sec | 64 | ||
TensorFlow 2.8.0 | BERT-Large Pre Training phase 2 | 1 | bf16 | 269 sent/sec | 16 |
Gaudi TensorFlow Reference Models Performance
Framework Version | Model | # HPU | Precision | TTT | Accuracy | Throughput | Batch |
---|---|---|---|---|---|---|---|
2.8.0 | ResNet50 Keras LARS | 32 | bf16 | 22.91 min | 75.94 | 48565.27 img/sec | 256 |
2.8.0 | ResNet50 Keras LARS | 16 | bf16 | 41.46 min | 75.63 | 24356.33 img/sec | 256 |
2.8.0 | ResNet50 Keras LARS | 8 | bf16 | 69.7 min | 76.14 | 12864.72 img/sec | 256 |
2.8.0 | ResNet50 Keras LARS | 1 | bf16 | 513.15 min | 76.01 | 1697.55 img/sec | 256 |
2.8.0 | BERT-Large Pre Training combine | 32 | bf16 | 1601.16 min | 5462 sent/sec | 64 | |
2.8.0 | BERT-Large Pre Training combine | 8 | bf16 | 1374 sent/sec | 64 | ||
2.8.0 | BERT-Large Pre Training combine | 1 | bf16 | 173 sent/sec | 64 | ||
2.8.0 | BERT-Large Pre Training phase 1 | 32 | bf16 | 1189.75 min | Loss: 1.41 | 6595.46 sent/sec | 64 |
2.8.0 | BERT-Large Pre Training phase 1 | 8 | bf16 | 1660.21 sent/sec | 64 | ||
2.8.0 | BERT-Large Pre Training phase 1 | 1 | bf16 | 208.75 sent/sec | 64 | ||
2.8.0 | BERT-Large Pre Training phase 2 | 32 | bf16 | 411.41 min | Loss: 1.273 | 2123.97 sent/sec | 8 |
2.8.0 | BERT-Large Pre Training phase 2 | 8 | bf16 | 532.51 sent/sec | 8 | ||
2.8.0 | BERT-Large Pre Training phase 2 | 1 | bf16 | 67.45 sent/sec | 8 | ||
2.8.0 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 15.1 min | 93.29 | 402.94 sent/sec | 24 |
2.8.0 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 67.05 min | 93.48 | 53.74 sent/sec | 24 |
2.8.0 | SSD | 8 | bf16 | 38.28 min | 23.27 | 3886.09 img/sec | 128 |
2.8.0 | SSD | 1 | bf16 | 249.2 min | 23.59 | 513.1 img/sec | 128 |
2.8.0 | Resnext-101 | 8 | bf16 | 382.13 min | 79.15 | 4956.17 img/sec | 128 |
2.8.0 | Resnext-101 | 1 | bf16 | 2700.1 min | 79.1 | 712.4 img/sec | 128 |
2.8.0 | UNet2D | 8 | bf16 | 4.08 min | 88.28 | 400.37 img/sec | 8 |
2.8.0 | UNet2D | 1 | bf16 | 17.76 min | 88.58 | 51.6 img/sec | 8 |
2.8.0 | UNet3D | 8 | bf16 | 13.2 min | 89.32 | 47.34 img/sec | 2 |
2.8.0 | UNet3D | 1 | bf16 | 78.7 min | 89.65 | 6.89 img/sec | 2 |
2.8.0 | Transformer | 8 | bf16 | 1087.55 min | 26.6 | 158587.7 sent/sec | 4.096 |
2.8.0 | Transformer | 1 | bf16 | 1005.8 min | 18.5 | 22757.98 sent/sec | 4.096 |
2.8.0 | MaskRCNN | 8 | bf16 | 205.06 min | 34.06 | 122.43 img/sec | 4 |
2.8.0 | MaskRCNN | 1 | bf16 | 1371.68 min | 33.97 | 17.6 img/sec | 4 |
2.8.0 | Vision Transformer | 8 | bf16 | 421.21 min | 84.34 | 509.9 img/sec | 32 |
2.8.0 | RetinaNet | 8 | bf16 | 426.00 min | 27.77 | 98.7 img/sec | 64 |
2.8.0 | Densenet 121 TFD | 8 | bf16 | 380.36 min | 74.24 | 5361 img/sec | 1.024 |
2.8.0 | T5 Base | 1 | bf16 | 18.3 min | 93.86 | 94.62 img/sec | 16 |
2.8.0 | VGG SegNet | 1 | bf16 | 7.45 min | 89.31 | 112.76 img/sec | 16 |
2.8.0 | EfficientDet | 8 | fp32 | 5334.05 min | 33.39 | 180.03 img/sec | 8 |
2.8.0 | CycleGAN | 1 | bf16 | 242.5 min | 15.48 img/sec | 2 | |
2.8.0 | Albert-Large Fine Tuning (SQUAD) | 1 | bf16 | 95.68 min | 91.02 | 53.08 sent/sec | 32 |
2.8.0 | Albert-Large Fine Tuning (SQUAD) | 8 | bf16 | 19.51 min | 90.65 | 420.26 sent/sec | 32 |
2.8.0 | ResNet50 Keras LARS tf.distribute | 8 | bf16 | 71.3 min | 75.9 | 12767.98 img/sec | 256 |
2.7.1 | ResNet50 Keras LARS Host NIC (HVD and Libfabric) | 16 | bf16 | 24021.51 img/sec | 256 | ||
2.7.1 | ResNet50 Keras LARS Host NIC (tf.distribute and Libfabric) | 16 | bf16 | 23507.43 img/sec | 256 | ||
2.8.0 | WideAndDeep | 1 | bf16 | 32.73 min | 65.59 | 736669.78 smpl/sec | 131.072 |
2.8.0 | Electra Fine Tuning | 1 | bf16 | 124.52 img/sec | 16 | ||
2.8.0 | DistilBERT | 8 | bf16 | 2.05 min | 85.64 | 2387.37 sent/sec | 32 |
2.8.0 | Unet Industrial | 8 | bf16 | 2.18 min | 96.37 | 639.34 img/sec | 2 |
Gaudi PyTorch Reference Models Performance
System Configuration:
Gaudi® Platform
HPU: Habana Gaudi HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Gaudi®2 Platform
HPU: Habana Gaudi2 HL-225 Mezzanine cards
System: HLS-2 with eight HL-225 HPU and two Intel® Xeon® Platinum 8380 CPU @ 2.30GHz, and 1TB of System Memory
Common Software
Ubuntu20.04, SynapseAI Software version 1.4.1-11
Tensorflow: Models run with Tensorflow v2.8.0 use this Docker image; ones with v2.7.1 use this Docker image
PyTorch: Models run with PyTorch v1.10.2 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.