See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.
TensorFlow Reference Models Performance
Framework | Model | #HPU | Precision | Time to Train | Accuracy | Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
TensorFlow 2.6.0 | ResNet50 Keras SGD | 8 | Mixed | 2h 37m | 76.14 | 12810 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras SGD | 8 | Mixed | 2h 35m | 75.98 | 12943 images/sec | 256 | |
TensorFlow 2.6.0 | ResNet50 Keras LARS tf.distribute | 8 | Mixed | 1h 10m | 76.04 | 13175 images/sec | 256 | |
TensorFlow 2.5.1 | ResNet50 Keras LARS tf.distribute | 8 | Mixed | 1h 10m | 76.04 | 13205 images/sec | 256 | |
TensorFlow 2.6.0 | ResNet50 Keras LARS | 32 | Mixed | 0h 23m | 75.77 | 48710 images/sec | 256 | |
TensorFlow 2.6.0 | ResNet50 Keras LARS | 16 | Mixed | 0h 40m | 75.46 | 24567 images/sec | 256 | |
TensorFlow 2.6.0 | ResNet50 Keras LARS | 8 | Mixed | 1h 8m | 76.09 | 13183 images/sec | 256 | |
TensorFlow 2.6.0 | ResNet50 Keras LARS | 1 | Mixed | 8h 24m | 76.12 | 1731 images/sec | 256 | |
TensorFlow 2.6.0 | ResNext101 | 8 | Mixed | 6h 33m | 79.27 | 5066 images/sec | 128 | |
TensorFlow 2.6.0 | ResNext101 | 1 | Mixed | 45h 37m | 79.2 | 697 images/sec | 128 | |
TensorFlow 2.6.0 | SSD ResNet34 | 8 | Mixed | 0h 29m | 22.45 | 3637 images/sec | 128 | |
TensorFlow 2.6.0 | SSD ResNet34 | 1 | Mixed | 3h 16m | 22.77 | 506 images/sec | 128 | |
TensorFlow 2.6.0 | Mask R-CNN | 8 | Mixed | 3h 45m | 34.09 | 104 images/sec | 4 | |
TensorFlow 2.6.0 | Mask R-CNN | 1 | Mixed | 25h 33m | 34.1 | 15 images/sec | 4 | |
TensorFlow 2.6.0 | Unet2D | 8 | Mixed | 0h 3m | 88.05 | 371 images/sec | 8 | Results reported for single fold training time |
TensorFlow 2.6.0 | Unet2D | 1 | Mixed | 0h 18m | 88.89 | 50 images/sec | 8 | Results reported for single fold training time |
TensorFlow 2.6.0 | Unet3D | 8 | Mixed | 0h 15m | 89.13 | 39 images/sec | 2 | |
TensorFlow 2.6.0 | Unet3D | 1 | Mixed | 1h 26m | 89.93 | 6 images/sec | 2 | |
TensorFlow 2.5.1 | Densenet 121 tf.distribute | 8 | Mixed | 5h 13m | 73.96 | 6575 images/sec | 2048 | |
TensorFlow 2.5.1 | VGG SegNet | 1 | Mixed | 0h 9m | 89.61 | 102 images/sec | 16 | |
TensorFlow 2.6.0 | RetinaNet | 1 | fp32 | 7h 11m | 27.57 | 12 images/sec | 8 | |
TensorFlow 2.6.0 | MobileNet V2 | 1 | Mixed | 1135 images/sec | 96 | |||
TensorFlow 2.6.0 | EfficientDet | 8 | fp32 | 90h 19m | 33.51 | 157 images/sec | 8 | |
TensorFlow 2.5.1 | CycleGAN | 1 | Mixed | 5h 13m | 12 | 2 | ||
TensorFlow 2.5.1 | Transformer | 8 | Mixed | 17h 24m | 26.4 | 157465 | 4096 | |
TensorFlow 2.5.1 | Transformer | 1 | Mixed | 23.7 | 22145 | 4096 | ||
TensorFlow 2.5.1 | T5 Base | 1 | Mixed | 0h 16m | 94.58 | 109 sentences/sec | 16 | |
TensorFlow 2.6.0 | BERT-Large Fine Tuning (SQUAD) | 8 | Mixed | 0h 14m | 93.44 | 391 sentences/sec | 24 | |
TensorFlow 2.6.0 | BERT-Large Fine Tuning (SQUAD) | 1 | Mixed | 1h 7m | 93.56 | 53 sentences/sec | 24 | |
TensorFlow 2.6.0 | BERT-Large Pre Training | 32 | Mixed | 36h 55m | Phase 1: Loss 1.3 Phase 2: Loss 0.86 | Phase 1: 5527 sentences/sec Phase 2: 1066 sentences/sec | Phase 1: – 64 Phase 2: – 8 | With accumulation steps |
TensorFlow 2.6.0 | BERT-Large Pre Training | 8 | Mixed | Phase 1: 1404 sentences/sec Phase 2: 271 sentences/sec | Phase 1: – 64 Phase 2: – 8 | With accumulation steps | ||
TensorFlow 2.5.1 | Albert-Large Fine Tuning (SQUAD) | 8 | Mixed | 0h 23m | F1 91 EM 84 | 442 sentences/sec | 32 | Time to train does not include tokenization |
TensorFlow 2.5.1 | Albert-Large Fine Tuning (SQUAD) | 1 | Mixed | 1h 11m | 54 sentences/sec | 32 | Time to train does not include tokenization | |
TensorFlow 2.5.1 | Albert-Large Pre Training | 1 | Mixed | Phase 1: 176 sentences/sec Phase 2: 37 sentences/sec | Phase 1 – 64 Phase 2 – 8 |
PyTorch Reference Models Performance
Framework | Model | #HPU | Precision | Time to Train | Accuracy | Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
PyTorch 1.9.1 | ResNet50 | 16 | Mixed | 2h 56m | 75.77 | 22236 images/sec | 256 | |
PyTorch 1.9.1 | ResNet50 Host NIC | 16 | Mixed | 4h 0m | 75.82 | 9634 images/sec | 256 | |
PyTorch 1.9.1 | ResNet50 | 8 | Mixed | 2h 43m | 75.96 | 12752 images/sec | 256 | |
PyTorch 1.9.1 | ResNet152 | 8 | Mixed | 7h 51m | 78.07 | 4927 images/sec | 128 | |
PyTorch 1.9.1 | ResNext101 | 8 | Mixed | 6h 54m | 78.14 | 6053 images/sec | 128 | |
PyTorch 1.9.1 | Unet2D | 8 | Mixed | 1h 24m | 72.74 | 4531 images/sec | 64 | |
PyTorch 1.9.1 | DLRM | 1 | Mixed | 48312 queries/sec | 512 | Uses Random Input Distribution | ||
PyTorch 1.9.1 | Transformer | 8 | Mixed | 25h 1m | 27.6 | 127631 | 4096 | |
PyTorch 1.9.1 | RoBERTa Large | 8 | Mixed | 0h 12m | 94.7 | 259 sentences/sec | 12 | |
PyTorch 1.9.1 | RoBERTa Large | 1 | Mixed | 1h 17m | 94.61 | 38 sentences/sec | 12 | |
PyTorch 1.9.1 | RoBERTa Base | 8 | Mixed | 0h 5m | 91.67 | 640 sentences/sec | 12 | |
PyTorch 1.9.1 | RoBERTa Base | 1 | Mixed | 0h 30m | 92.54 | 102 sentences/sec | 12 | |
PyTorch 1.9.1 | DistilBERT | 8 | Mixed | 0h 13m | 85.24 | 503 sentences/sec | 8 | |
PyTorch 1.9.1 | DistilBERT | 1 | Mixed | 0h 42m | 85.56 | 136 sentences/sec | 8 | |
PyTorch 1.9.1 | BERT-Large Fine Tuning Lazy Mode (SQUAD) | 8 | Mixed | 0h 10m | 93.13 F1 Score: 91.35% | 318 sentences/sec | 24 | |
PyTorch 1.9.1 | BERT-Large Fine Tuning Lazy Mode (SQUAD) | 1 | Mixed | 1h 8m | 93.44 | 43 sentences/sec | 24 | |
PyTorch 1.9.1 | BERT-Large Pre Training Lazy Mode | 32 | Mixed | 40h 13m | Phase 1: Loss 1.3 Phase 2: Loss 1.34 | Phase 1: 4124 sentences/sec Phase 2: 635 sentences/sec | Phase 1: – 64 | |
PyTorch 1.9.1 | BERT-Large Pre Training Lazy Mode | 8 | Mixed | Phase 1: 1290 sentences/sec Phase 2: 259 sentences/sec | Phase 1: – 64 |
System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.1.0-614
Tensorflow: Models run with Tensorflow v2.5.1 use this Docker image; ones with v2.6.0 use this Docker image
PyTorch: Models run with PyTorch v1.9.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.