See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.
TensorFlow Reference Models Performance
Framework | Model | # HPU | Time to Train | Accuracy | Throughput | Batch Size |
---|---|---|---|---|---|---|
TensorFlow 2.5.0 | ResNet50 Keras LARS | 1 | 8h45m | 75.91 | 1690 images/sec | 256 |
TensorFlow 2.5.0 | ResNet50 Keras LARS | 8 | 1h15m | 75.93 | 12900 images/sec | 256 |
TensorFlow 2.4.1 | ResNet50 Keras LARS | 16 | N/A | 75.94 | 23231 images/sec | 256 |
TensorFlow 2.4.1 | ResNet50 Keras LARS | 32 | N/A | 75.14 | 46000 images/sec | 256 |
TensorFlow 2.5.0 | ResNet50 Keras SGD | 1 | 19h-20m | 76.06 | 1690 images/sec | 256 |
TensorFlow 2.5.0 | ResNet50 Keras SGD | 8 | 3h | 76.18 | 12400 images/sec | 256 |
TensorFlow 2.4.1 | ResNet50 Keras LARS (tfdistribute) | 8 | 75.99 | 10600 images/sec | 256 | |
TensorFlow 2.5.0 | BERT-Large Fine Tuning (SQUAD) | 1 | 1h20 | 92.95 | 54 sentences/sec | 24 |
TensorFlow 2.5.0 | BERT-Large Fine Tuning (SQUAD) | 8 | 20m | 93.2 | 300 sentences/sec | 24 |
TensorFlow 2.5.0* | BERT-Large Pre Training | 1 | N/A | N/A | Phase 1 166 sentences/sec Phase 2 30 sentences/sec | Phase 1 64 Phase 2 8 |
TensorFlow 2.5.0* | BERT-Large Pre Training | 8 | N/A | Phase 1 1310sentences/sec Phase 2 246 sentences/sec | Phase 1 64 Phase 2 8 | |
TensorFlow 2.5.0* | BERT-Large Pre Training | 32 | 39h | Phase 1 5152 sentences/sec Phase 2 980 sentences/sec | Phase 1 64 Phase 2 8 | |
TensorFlow 2.4.1 | Mask R-CNN | 1 | 36h | 34.14 | 12 images/sec | 4 |
TensorFlow 2.4.1 | Mask R-CNN | 8 | 7h | 34.17 | 76 images/sec | 4 |
TensorFlow 2.5.0 | Unet2D | 1 | 1h50 | 88.74 | 49 images/sec | 8 |
TensorFlow 2.5.0 | Unet2D | 8 | 41m | 88.4 | 373 images/sec | 8 |
TensorFlow 2.5.0 | ResNext101 | 1 | 48h | 79.19 | 650 images/sec | 128 |
TensorFlow 2.5.0 | ResNext101 | 8 | 6h45m | 79.19 | 4510 images/sec | 128 |
TensorFlow 2.5.0 | SSD ResNet34 | 1 | 3h55m | 22.98 | 475 images/sec | 128 |
TensorFlow 2.5.0 | SSD ResNet34 | 8 | 45m | 22.24 | 3455 images/sec | 128 |
TensorFlow 2.5.0** | Transformer | 1 | 17h | 21.4 | 18760 | 4096 |
TensorFlow 2.5.0** | Transformer | 8 | 22h30m | 26 | 138550 | 4096 |
TensorFlow 2.4.1 | DenseNet | 1 | N/A | 0.712 | 836 images/sec | 128 |
TensorFlow 2.5.0 | ALBERT-Large Fine Tuning (SQUAD) | 1 | 45 sentences/sec | 32 | ||
TensorFlow 2.5.0 | ALBERT-Large Fine Tuning (SQUAD) | 8 | F1 90.9 EM 84.1 | 358 sentences/sec | 32 | |
TensorFlow 2.5.0 | ALBERT-Large Pre Training | 1 | 142 sentences/sec | 64 |
* With accumulation steps
** Evaluation graph in Transformer is run on CPU and may impact TTT performance.
PyTorch Reference Models Performance
Framework | Model | # HPU | Time to Train | Accuracy | Throughput | Batch Size |
---|---|---|---|---|---|---|
PyTorch 1.7.1 | ResNext101 | 1 | N/A | 77.95 | 730 images/sec | 128 |
PyTorch 1.7.1 | ResNext101 | 8 | 15h 1min | 78.3 | 2860 images/sec | 128 |
PyTorch 1.7.1 | ResNet50 | 1 | N/A | 76.08 | 1330 images/sec | 256 |
PyTorch 1.7.1* | ResNet50 | 8 | 9h 30 min | 76.22 | 5700 images/sec | 256 |
PyTorch 1.7.1* | ResNet50 | 16 | 6h | 76.13 | 6657 images/sec | 256 |
PyTorch 1.7.1 | BERT-Large Fine Tuning (SQUAD) Lazy Mode | 1 | 1h 12min | 93.09 | 45 sentences/sec | 24 |
PyTorch 1.7.1 | BERT-Large Fine Tuning (SQUAD) Lazy Mode | 8 | 20 min | 93.05 | 303 sentences/sec | 24 |
PyTorch 1.7.1 | BERT-Large Pre Training Lazy Mode | 1 | N/A | Phase 1 123 sentences/sec Phase 2 23 sentences/sec | 64 | |
PyTorch 1.7.1 | BERT-Large Pre Training Lazy Mode | 8 | N/A | Phase 1 950 sentences/sec Phase 2 176 sentences/sec | 64 | |
PyTorch 1.7.1 | BERT-Large Pre Training Graph Mode | 1 | N/A | Phase 1 128sentences/sec Phase 2 24 sentences/sec | 64 | |
PyTorch 1.7.1 | BERT-Large Pre Training Graph Mode | 8 | N/A | Phase 1 1016 sentences/sec Phase 2 190 sentences/sec | 64 |
* PyTorch dataloader consumes a significant portion of the training time, improving overall model performances.
System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.0.0-532
Tensorflow: Models run with Tensorflow v2.5.0 use this Docker image;
PyTorch: Models run with PyTorch v1.8.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.