See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.
TensorFlow Reference Models Performance
Framework | Model | #HPU | Precision | Time to Train | Accuracy | Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
TensorFlow 2.5.1 | ResNet50 Keras LARS | 1 | Mixed | 8h36m | 76.09 | 1700 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras LARS (with horovod) | 8 | Mixed | 1h11m | 76.13 | 12200 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras LARS (with tf distribute) | 8 | Mixed | 1h 9min | 75.96 | 12900 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras SGD | 1 | Mixed | 19h 31min | 76.2 | 1700 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras SGD | 8 | Mixed | 2h 39min | 76.3 | 12580 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras SGD | 16 | Mixed | 43min | 75.55 | 23900 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras SGD | 32 | Mixed | 24min | 75.97 | 46700 images/sec | 256 | |
TensorFlow 2.6.0 | ResNext101 | 1 | Mixed | 79.07 | 663 images/sec | 128 | ||
TensorFlow 2.6.0 | ResNext101 | 8 | Mixed | 6h 56min | 79.15 | 4780 images/sec | 128 | |
TensorFlow 2.5.1 | SSD ResNet34 | 1 | Mixed | 3h 35min | 22.97 | 470 images/sec | 128 | |
TensorFlow 2.5.1 | SSD ResNet34 | 8 | Mixed | 35min | 22.04 | 3406 images/sec | 128 | |
TensorFlow 2.6.0 | Mask R-CNN | 1 | Mixed | 25h 18min | 33.99 | 15 images/sec | 4 | |
TensorFlow 2.6.0 | Mask R-CNN | 8 | Mixed | 4h 31min | 34.23 | 99 images/sec | 4 | |
TensorFlow 2.5.1 | Unet2D | 1 | Mixed | 20min | 88.79 | 48 images/sec | 8 | Results reported for single fold training time |
TensorFlow 2.5.1 | Unet2D | 8 | Mixed | 7min | 88.09 | 360 images/sec | 8 | Results reported for single fold training time |
TensorFlow 2.5.1 | Unet3D | 1 | Mixed | 1h 47min | 88.96 | 5.2 images/sec | 2 | |
TensorFlow 2.5.1 | Unet3D | 8 | Mixed | 19min | 89.06 | 35 images/sec | 2 | |
TensorFlow 2.5.1 | DenseNet (with tf.distribute) | 8 | Mixed | 5h 15min | 73.44 | 5423 images/sec | 128 | |
TensorFlow 2.5.1 | RetinaNet | 1 | fp32 | 8h 53min | 27.35 | 12 images/sec | 8 | |
TensorFlow 2.6.0 | BERT-Large Fine Tuning (SQUAD) | 1 | Mixed | 1h 8m | 92.91 | 52 sentences/sec | 24 | |
TensorFlow 2.6.0 | BERT-Large Fine Tuning (SQUAD) | 8 | Mixed | 22min | 93.26 | 391 sentences/sec | 24 | |
TensorFlow 2.6.0 | BERT-Large Pre Training | 1 | Mixed | Phase 1 165 sps Phase 2 31 sps | Phase 1 – 64 Phase 2 – 8 | With accumulation steps | ||
TensorFlow 2.6.0 | BERT-Large Pre Training | 8 | Mixed | Phase 1 1310sps Phase 2 249 sps | Phase 1 – 64 Phase 2 – 8 | With accumulation steps | ||
TensorFlow 2.6.0 | BERT-Large Pre Training | 32 | Mixed | 39h | Phase 1 loss 1.12 Phase 2 loss 0.86 | Phase 1 – 5400sps Phase 2 – 1030sps | Phase 1 – 64 Phase 2 – 8 | With accumulation steps |
TensorFlow 2.6.0 | Transformer | 8 | Mixed | 17h 43min | 26.5 | 154020 | 4096 | |
TensorFlow 2.5.1 | T5-base Fine Tuning | 1 | Mixed | 16min | 94.1 | 115 | 16 | |
TensorFlow 2.5.1 | Albert-Large Fine Tuning (SQUAD) | 8 | Mixed | 14min 42s | F1 90.9 EM 84.18 | 436 sentences/sec | 32 | Time to train does not include tokenization |
TensorFlow 2.5.1 | Albert-Large Pre Training | 1 | Mixed | Phase 1 177sps Phase 2 36sps | Phase 1 – 64 Phase 2 – 8 | |||
TensorFlow 2.5.1 | EfficientDet | 8 | fp32 | 4days 22h | 33.8 | 91.4 images/sec | 8 | |
TensorFlow 2.5.1 | CycleGan | 1 | Mixed | 9h 25min | 5.9 | 2 | ||
TensorFlow 2.5.1 | SegNet | 1 | Mixed | 8.5min | 89.57 | 303 images/sec | 16 | |
TensorFlow 2.5.1 | SegNet | 4 | Mixed | 3.9min | 90.6 | 104 images/sec | 16 |
PyTorch Reference Models Performance
Framework | Model | #HPU | Precision | 1.0 TTT | 1.0 Accuracy | 1.0 Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
PyTorch 1.8.1 | ResNet50 | 1 | Mixed | 76.04 | 1583 images/sec | 256 | ||
PyTorch 1.8.2 | ResNet50 | 8 | Mixed | 5h 37min | 75.95 | 7350 images/sec | 256 | PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance. |
PyTorch 1.8.2 | ResNet50 | 16 | Mixed | 3h 54min | 75.86 | 12600 images/sec | 256 | PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance. |
PyTorch 1.8.2 | ResNext101 | 1 | Mixed | N/A | 725 images/sec | 128 | ||
PyTorch 1.8.2 | ResNext101 | 8 | Mixed | 10h 50min | 78.01 | 3780 images/sec | 128 | |
PyTorch 1.8.2 | BERT-Large Fine Tuning (SQUAD) | 1 | Mixed | 1h 11min | 93.3 | 46 sentences/sec | 24 | |
PyTorch 1.8.2 | BERT-Large Fine Tuning (SQUAD) | 8 | Mixed | 30min | 92.8 | 330 sentences/sec | 24 | Graph compilation time has an overall impact on time to train. This will be fixed in a subsequent release |
PyTorch 1.8.2 | BERT-Large Pre Training | 1 | Mixed | Phase 1 – 155 sentences/sec Phase 2 – 31 sentences/sec | 64 | |||
PyTorch 1.8.2 | BERT-Large Pre Training | 8 | Mixed | Phase 1 – 1230 sentences/sec Phase 2 – 245 sentences/sec | 64 | |||
PyTorch 1.8.2 | DLRM | 1 | Mixed | 47086 queries/sec | 512 | Using random data as input |
System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.0.1-81
Tensorflow: Models run with Tensorflow v2.5.1 use this Docker image; ones with v2.6.0 use this Docker image
PyTorch: Models run with PyTorch v1.8.2 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.