Get access to Habana’s popular frameworks and optimized models that enable you to quickly and easily build, train and deploy your Gaudi models. For more information on the future model support, please refer to our SynapseAI roadmap page.
See the latest TensorFlow and PyTorch model performance.
TensorFlow Reference Models Performance
Framework | Model | #HPU | Precision | Time to Train | Accuracy | Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
TensorFlow 2.5.0 | ResNet50 Keras LARS | 1 | Mixed | 8h39m | 76.06 | 1690 images/sec | 256 | |
TensorFlow 2.5.0 | ResNet50 Keras LARS (with horovod) | 8 | Mixed | 1h13m | 76.11 | 12950 images/sec | 256 | |
TensorFlow 2.5.0 | ResNet50 Keras LARS (with tf distribute) | 8 | Mixed | 1h 9min | 75.96 | 12900 images/sec | 256 | |
TensorFlow 2.5.0 | ResNet50 Keras SGD | 1 | Mixed | 19h 31min | 76.2 | 1700 images/sec | 256 | |
TensorFlow 2.5.0 | ResNet50 Keras SGD | 8 | Mixed | 2h 39min | 76.3 | 12580 images/sec | 256 | |
TensorFlow 2.5.0 | ResNet50 Keras SGD | 16 | Mixed | 1h 28min | 75.99 | 23400 images/sec | 256 | |
TensorFlow 2.5.0 | ResNet50 Keras SGD | 32 | Mixed | 46min | 75.94 | 46400 images/sec | 256 | |
TensorFlow 2.5.0 | ResNext101 | 1 | Mixed | 79.2 | 650 images/sec | 128 | ||
TensorFlow 2.5.0 | ResNext101 | 8 | Mixed | 6h 56min | 79.02 | 4656 images/sec | 128 | |
TensorFlow 2.5.0 | SSD ResNet34 | 1 | Mixed | 3h 35min | 22.97 | 470 images/sec | 128 | |
TensorFlow 2.5.0 | SSD ResNet34 | 8 | Mixed | 35min | 22.04 | 3406 images/sec | 128 | |
TensorFlow 2.5.0 | Mask R-CNN | 1 | Mixed | 25h 18min | 34.04 | 15 images/sec | 4 | |
TensorFlow 2.5.0 | Mask R-CNN | 8 | Mixed | 4h 26min | 34.1 | 83 images/sec | 4 | |
TensorFlow 2.5.0 | Unet2D | 1 | Mixed | 20min | 88.79 | 48 images/sec | 8 | Results reported for single fold training time |
TensorFlow 2.5.0 | Unet2D | 8 | Mixed | 7min | 88.09 | 360 images/sec | 8 | Results reported for single fold training time |
TensorFlow 2.5.0 | Unet3D | 1 | Mixed | 1h 47min | 88.96 | 5.2 images/sec | 2 | |
TensorFlow 2.5.0 | Unet3D | 8 | Mixed | 19min | 89.06 | 35 images/sec | 2 | |
TensorFlow 2.5.0 | DenseNet (with tf.distribute) | 8 | Mixed | 5h 15min | 73.44 | 5423 images/sec | 128 | |
TensorFlow 2.5.0 | RetinaNet | 1 | fp32 | 8h 53min | 27.35 | 12 images/sec | 8 | |
TensorFlow 2.5.0 | BERT-Large Fine Tuning (SQUAD) | 1 | Mixed | 1h 17m | 93.4 | 52 sentences/sec | 24 | |
TensorFlow 2.5.0 | BERT-Large Fine Tuning (SQUAD) | 8 | Mixed | 22min | 93.3 | 391 sentences/sec | 24 | |
TensorFlow 2.5.0 | BERT-Large Pre Training | 1 | Mixed | Phase 1 165 sps Phase 2 31 sps | Phase 1 – 64 Phase 2 – 8 | With accumulation steps | ||
TensorFlow 2.5.0 | BERT-Large Pre Training | 8 | Mixed | Phase 1 1310sps Phase 2 249 sps | Phase 1 – 64 Phase 2 – 8 | With accumulation steps | ||
TensorFlow 2.5.0 | BERT-Large Pre Training | 32 | Mixed | 39h | Phase 1 5233 sps Phase 2 1000 sps | Phase 1 – 64 Phase 2 – 8 | With accumulation steps | |
TensorFlow 2.5.0 | Transformer | 8 | Mixed | 17h 43min | 26.6 | 154550 | 4096 | |
TensorFlow 2.5.0 | T5-base Fine Tuning | 1 | Mixed | 16min | 94.1 | 115 | 16 | |
TensorFlow 2.5.0 | Albert-Large Fine Tuning (SQUAD) | 8 | Mixed | 14min 42s | F1 90.9 EM 84.18 | 436 sentences/sec | 32 | Time to train does not include tokenization |
TensorFlow 2.5.0 | Albert-Large Pre Training | 1 | Mixed | Phase 1 177sps Phase 2 36sps | Phase 1 – 64 Phase 2 – 8 | |||
TensorFlow 2.5.0 | EfficentDet | 8 | fp32 | 4days 22h | 33.8 | 91.4 images/sec | 8 | |
TensorFlow 2.5.0 | CycleGan | 1 | Mixed | 9h 25min | 5.9 | 2 | 8-card training takes longer than 1-card training, because dataset is not sharded between workers for this topology. There is no accuracy metric for this topology. To verify training correctness we visually inspect generated images and check final loss. | |
TensorFlow 2.5.0 | CycleGan | 8 | Mixed | 9h 40min | 44 | 2 | 8-card training takes longer than 1-card training, because dataset is not sharded between workers for this topology. There is no accuracy metric for this topology. To verify training correctness we visually inspect generated images and check final loss. | |
TensorFlow 2.5.0 | SegNet | 1 | Mixed | 8.5min | 89.57 | 303 images/sec | 16 | |
TensorFlow 2.5.0 | SegNet | 4 | Mixed | 3.9min | 90.6 | 104 images/sec | 16 |
PyTorch Reference Models Performance
Framework | Model | #HPU | Precision | 1.0 TTT | 1.0 Accuracy | 1.0 Throughput | Batch Size | Comments |
---|---|---|---|---|---|---|---|---|
PyTorch 1.8.1 | ResNet50 | 1 | Mixed | 76.04 | 1587 images/sec | 256 | ||
PyTorch 1.8.1 | ResNet50 | 8 | Mixed | 9h 37min | 75.95 | 4545 images/sec | 256 | PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance. |
PyTorch 1.8.1 | ResNet50 | 16 | Mixed | 4h 58min | 76 | 8904 images/sec | 256 | PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance. |
PyTorch 1.8.1 | ResNext101 | 1 | Mixed | N/A | 732 images/sec | 128 | ||
PyTorch 1.8.1 | ResNext10 | 8 | Mixed | 13h 37min | 78.01 | 3041 images/sec | 128 | |
PyTorch 1.8.1 | BERT-Large Fine Tuning (SQUAD) | 1 | Mixed | 1h 11min | 93.3 | 46 sentences/sec | 24 | |
PyTorch 1.8.1 | BERT-Large Fine Tuning (SQUAD) | 8 | Mixed | 30min | 92.8 | 334 sentences/sec | 24 | Graph compilation time has an overall impact on time to train. This will be fixed in a subsequent release |
PyTorch 1.8.1 | BERT-Large Pre Training | 1 | Mixed | Phase 1 – 155 sentences/sec Phase 2 – 31 sentences/sec | 64 | |||
PyTorch 1.8.1 | BERT-Large Pre Training | 8 | Mixed | Phase 1 – 1230 sentences/sec Phase 2 – 245 sentences/sec | 64 | |||
PyTorch 1.8.1 | DLRM | 1 | Mixed | 47086 queries/sec | 512 | Using random data as input |
System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.0.0-532
Tensorflow: Models run with Tensorflow v2.5.0 use this Docker image;
PyTorch: Models run with PyTorch v1.8.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.