See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.
TensorFlow Reference Models Performance
Framework | Model | #HPU | Precision | Time to Train | Accuracy | Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
TensorFlow 2.7.0 | ResNet50 Keras LARS | 32 | bf16 | 0h 21m | 75.92 | 49060 images/sec | 256 | |
TensorFlow 2.7.0 | ResNet50 Keras LARS | 16 | bf16 | 0h 40m | 75.69 | 24598 images/sec | 256 | |
TensorFlow 2.7.0 | ResNet50 Keras LARS | 8 | bf16 | 1h 9m | 76.06 | 12880 images/sec | 256 | |
TensorFlow 2.7.0 | ResNet50 Keras LARS | 1 | bf16 | 8h 36m | 76.03 | 1695.58 images/sec | 256 | |
TensorFlow 2.7.0 | BERT-Large Pre Training phase 1 | 32 | bf16 | 6586.64 sentences/sec | 64 | |||
TensorFlow 2.7.0 | BERT-Large Pre Training phase 1 | 8 | bf16 | 1669.71 sentences/sec | 64 | |||
TensorFlow 2.7.0 | BERT-Large Pre Training phase 1 | 1 | bf16 | 210.36 sentences/sec | 64 | |||
TensorFlow 2.7.0 | BERT-Large Pre Training phase 2 | 32 | bf16 | 26h 39m | 2124.07 sentences/sec | 8 | The TTT of 26h and 39m is for Phase 1 and Phase 2 | |
TensorFlow 2.7.0 | BERT-Large Pre Training phase 2 | 8 | bf16 | 538.94 sentences/sec | 8 | |||
TensorFlow 2.7.0 | BERT-Large Pre Training phase 2 | 1 | bf16 | 67.98 sentences/sec | 8 | |||
TensorFlow 2.7.0 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 0h 14m | 93.03 | 392.76 sentences/sec | 24 | |
TensorFlow 2.7.0 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 1h 7m | 93.47 | 53.34 sentences/sec | 24 | |
TensorFlow 2.7.0 | SSD ResNet34 | 8 | bf16 | 0h 32m | 22.19 | 3620.58 images/sec | 128 | |
TensorFlow 2.7.0 | SSD ResNet34 | 1 | bf16 | 4h 47m | 23.68 | 502.29 images/sec | 128 | |
TensorFlow 2.7.0 | ResNext101 | 8 | bf16 | 6h 41m | 79.21 | 5002 images/sec | 128 | |
TensorFlow 2.7.0 | ResNext101 | 1 | bf16 | 46h 38m | 79.26 | 689.77 images/sec | 128 | |
TensorFlow 2.7.0 | Unet2D | 8 | bf16 | 0h 3m | 88.2 | 392.07 images/sec | 8 | |
TensorFlow 2.7.0 | Unet2D | 1 | bf16 | 0h 18m | 88.83 | 51.72 images/sec | 8 | |
TensorFlow 2.7.0 | Unet3D | 8 | bf16 | 0h 14m | 88.22 | 42.75 images/sec | 2 | |
TensorFlow 2.7.0 | Unet3D | 1 | bf16 | 1h 20m | 89.7 | 6.7 images/sec | 2 | |
TensorFlow 2.7.0 | Transformer | 8 | bf16 | 19h 31m | 26.6 | 155888 sentences/sec | 4096 | |
TensorFlow 2.7.0 | Transformer | 1 | bf16 | 17h 48m | 23.7 | 22245 sentences/sec | 4096 | |
TensorFlow 2.7.0 | Mask R-CNN | 8 | bf16 | 4h 21m | 34.14 | 107.6 images/sec | 4 | |
TensorFlow 2.7.0 | Mask R-CNN | 1 | bf16 | 25h 28m | 33.98 | 15.72 images/sec | 4 | |
TensorFlow 2.7.0 | VisionTransformer | 8 | bf16 | 7h 37m | 84.44 | 442.47 images/sec | 32 | |
TensorFlow 2.7.0 | RetinaNet | 8 | bf16 | 7h 43m | 38.71 | 79.94 images/sec | 64 | |
TensorFlow 2.6.2 | Densenet 121 tf.distribute | 8 | bf16 | 6h 33m | 74.69 | 6575 images/sec | 1024 | |
TensorFlow 2.7.0 | T5 Base | 1 | bf16 | 0h 20m | 94.32 | 96.74 sentences/sec | 16 | |
TensorFlow 2.7.0 | VGG SegNet | 1 | bf16 | 0h 10m | 88.63 | 109.4 images/sec | 16 | |
TensorFlow 2.7.0 | MobileNet V2 | 1 | bf16 | 1119.96 images/sec | 96 | |||
TensorFlow 2.7.0 | EfficientDet | 8 | fp32 | 152.46 images/sec | 8 | |||
TensorFlow 2.7.0 | CycleGAN | 1 | bf16 | 4h 9m | 15.78 | 2 | ||
TensorFlow 2.7.0 | Albert-Large Fine Tuning (SQUAD) | 8 | bf16 | 0h 25m | 90.68 | 438.36 sentences/sec | 32 | |
TensorFlow 2.7.0 | Albert-Large Fine Tuning (SQUAD) | 1 | bf16 | 1h 28m | 91.03 | 55.06 sentences/sec | 32 | |
TensorFlow 2.7.0 | ResNet50 Keras LARS tf.distribute | 8 | bf16 | 1h 10m | 76.05 | 12799 images/sec | 256 | |
TensorFlow 2.7.0 | ResNet50 Keras SGD | 8 | bf16 | 2h 39m | 76.19 | 12612 images/sec | 256 | |
TensorFlow 2.7.0 | ResNet50 Keras LARS Host NIC | 16 | bf16 | 21474 images/sec | 256 | using Horovod and libfabric, using HCCL_OVER_OFI=1 | ||
TensorFlow 2.7.0 | ResNet50 Keras LARS Host NIC | 16 | bf16 | 19841 images/sec | 256 | using tf.distribute and libfabric, using HCCL_OVER_OFI=1 |
PyTorch Reference Models Performance
Framework | Model | #HPU | Precision | Time to Train | Accuracy | Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
PyTorch 1.10.0 | ResNet50 | 32 | bf16 | 0h 52m | 74.66 | 40633 images/sec | 256 | |
PyTorch 1.10.0 | ResNet50 | 16 | bf16 | 1h 28m | 75.91 | 25022 images/sec | 256 | |
PyTorch 1.10.0 | ResNet50 | 8 | bf16 | 2h 47m | 76.1 | 12833 images/sec | 256 | |
PyTorch 1.10.0 | ResNet50 | 1 | bf16 | 21h 56m | 75.88 | 1722 images/sec | 256 | |
PyTorch 1.10.0 | BERT-L Lazy Mode Pre Training Phase 1 | 32 | bf16 | 27h 46m | 4920 sentences/sec | 64 | ph1 final_loss : 1.494 ph2 final_loss : 1.349 | |
PyTorch 1.10.0 | BERT-L Lazy Mode Pre Training Phase 1 | 8 | bf16 | 1282 sentences/sec | 64 | |||
PyTorch 1.10.0 | BERT-L Lazy Mode Pre Training Phase 1 | 1 | bf16 | 154 sentences/sec | 64 | |||
PyTorch 1.10.0 | BERT-L Lazy Mode Pre Training Phase 2 | 32 | bf16 | 14h 18m | 970 sentences/sec | 8 | ||
PyTorch 1.10.0 | BERT-L Lazy Mode Pre Training Phase 2 | 8 | bf16 | 256 sentences/sec | 8 | |||
PyTorch 1.10.0 | BERT-L Lazy Mode Pre Training Phase 2 | 1 | bf16 | 32 sentences/sec | 8 | |||
PyTorch 1.10.0 | BERT-L Lazy Mode Fine Tuning | 8 | bf16 | 0h 10m | 93.17 | 341 sentences/sec | 24 | |
PyTorch 1.10.0 | BERT-L Lazy Mode Fine Tuning | 1 | bf16 | 1h 12m | 92.94 | 49 sentences/sec | 24 | |
PyTorch 1.10.0 | ResNext101 | 8 | bf16 | 6h 39m | 78.04 | 5768 images/sec | 128 | |
PyTorch 1.10.0 | ResNext101 | 1 | bf16 | 48h 47m | 78.13 | 777.53 images/sec | 128 | |
PyTorch 1.10.0 | ResNet152 | 8 | bf16 | 7h 53m | 78.03 | 5191 images/sec | 128 | |
PyTorch 1.10.0 | ResNet152 | 1 | bf16 | 46h 50m | 77.61 | 729 images/sec | 128 | |
PyTorch 1.10.0 | Unet2D | 8 | bf16 | 1h 8m | 72.82 | 4624.22 images/sec | 64 | |
PyTorch 1.10.0 | Unet2D | 1 | bf16 | 9h 24m | 72.84 | 609.44 images/sec | 64 | |
PyTorch 1.10.0 | Unet3D | 8 | bf16 | 1h 27m | 74.13 | 59.77 images/sec | 2 | |
PyTorch 1.10.0 | Unet3D | 1 | bf16 | 13h 34m | 74.3 | 7.56 images/sec | 2 | |
PyTorch 1.10.0 | SSD | 8 | bf16 | 1h 25m | 22.93 | 1664 images/sec | 32 | |
PyTorch 1.10.0 | SSD | 1 | bf16 | 4h 12m | 23.07 | 449 images/sec | 32 | |
PyTorch 1.10.0 | Transformer | 8 | bf16 | 20h 49m | 28.1 | 150407 sentences/sec | 4096 | |
PyTorch 1.10.0 | Transformer | 1 | bf16 | 22h 22m | 21525.8 sentences/sec | 4096 | ||
PyTorch 1.10.0 | GoogLeNet | 8 | bf16 | 4h 18m | 72.44 | 15056 images/sec | 256 | |
PyTorch 1.10.0 | GoogLeNet | 1 | bf16 | 19h 9m | 72.31 | 1851 images/sec | 256 | |
PyTorch 1.10.0 | DistilBERT | 8 | bf16 | 0h 10m | 85.49 | 770 sentences/sec | 8 | |
PyTorch 1.10.0 | DistilBERT | 1 | bf16 | 0h 41m | 85.47 | 149 sentences/sec | 8 | |
PyTorch 1.10.0 | RoBERTa Large | 8 | bf16 | 0h 11m | 94.53 | 284 sentences/sec | 12 | |
PyTorch 1.10.0 | RoBERTa Large | 1 | bf16 | 1h 30m | 94.27 | 42.6 sentences/sec | 12 | |
PyTorch 1.10.0 | RoBERTa Base | 8 | bf16 | 0h 4m | 91.85 | 731 sentences/sec | 12 | |
PyTorch 1.10.0 | RoBERTa Base | 1 | bf16 | 0h 34m | 92.39 | 128 sentences/sec | 12 | |
PyTorch 1.10.0 | ALBERT-XXL Fine Tuning | 8 | bf16 | 0h 43m | 94.91 | 74 sentences/sec | 12 | |
PyTorch 1.10.0 | ALBERT-XXL Fine Tuning | 1 | bf16 | 5h 29m | 94.79 | 9 sentences/sec | 12 | |
PyTorch 1.10.0 | ALBERT-Large Fine Tuning | 8 | bf16 | 0h 10m | 91.9 | 362 sentences/sec | 32 | |
PyTorch 1.10.0 | ALBERT-Large Fine Tuning | 1 | bf16 | 1h 7m | 93.25 | 44 sentences/sec | 32 | |
PyTorch 1.10.0 | MobileNetV2 | 1 | bf16 | 1515 images/sec | 256 | |||
PyTorch 1.10.0 | BART Fine Tuning | 8 | bf16 | 0h 8m | 1364 sentences/sec | 32 | ||
PyTorch 1.10.0 | BART Fine Tuning | 1 | bf16 | 0h 50m | 193 sentences/sec | 32 | ||
PyTorch 1.10.0 | ResNet50 Host NIC | 16 | bf16 | 2h 5m | 75.96 | 16311 images/sec | 256 |
System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.2.0-585
Tensorflow: Models run with Tensorflow v2.7.0 use this Docker image; ones with v2.6.2 use this Docker image
PyTorch: Models run with PyTorch v1.10.0 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.