Home » Get Started » Intel® Gaudi® AI accelerators Model Performance Data » Habana Model Performance Data – 1.0.1

Habana Model Performance Data – 1.0.1

See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.

TensorFlow Reference Models Performance

FrameworkModel#HPUPrecisionTime to TrainAccuracyThroughputBatch SizeComments
TensorFlow 2.5.1ResNet50 Keras LARS1Mixed8h36m76.091700 images/sec256
TensorFlow 2.5.1 ResNet50 Keras LARS (with horovod)8 Mixed 1h11m76.1312200 images/sec256
TensorFlow 2.5.1ResNet50 Keras LARS
(with tf distribute)
8 Mixed 1h 9min75.9612900 images/sec256
TensorFlow 2.5.1ResNet50 Keras SGD 1 Mixed 19h 31min76.21700 images/sec256
TensorFlow 2.5.1 ResNet50 Keras SGD 8 Mixed 2h 39min76.312580 images/sec256
TensorFlow 2.5.1ResNet50 Keras SGD16 Mixed 43min75.5523900 images/sec256
TensorFlow 2.5.1ResNet50 Keras SGD32 Mixed 24min75.9746700 images/sec256
TensorFlow 2.6.0ResNext1011 Mixed 79.07663 images/sec128
TensorFlow 2.6.0ResNext1018 Mixed 6h 56min79.154780 images/sec 128
TensorFlow 2.5.1 SSD ResNet341Mixed3h 35min22.97470 images/sec128
TensorFlow 2.5.1 SSD ResNet348Mixed35min22.043406 images/sec128
TensorFlow 2.6.0Mask R-CNN1 Mixed 25h 18min33.9915 images/sec 4
TensorFlow 2.6.0 Mask R-CNN8Mixed4h 31min34.23 99 images/sec 4
TensorFlow 2.5.1Unet2D1 Mixed 20min88.79 48 images/sec 8Results reported for single fold training time
TensorFlow 2.5.1Unet2D 8 Mixed 7min88.09360 images/sec8Results reported for single fold training time
TensorFlow 2.5.1Unet3D 1 Mixed 1h 47min88.965.2 images/sec2
TensorFlow 2.5.1Unet3D8 Mixed 19min89.0635 images/sec2
TensorFlow 2.5.1DenseNet (with tf.distribute)8Mixed5h 15min73.445423 images/sec128
TensorFlow 2.5.1RetinaNet1fp328h 53min27.3512 images/sec8
TensorFlow 2.6.0BERT-Large Fine Tuning (SQUAD)1Mixed1h 8m92.9152 sentences/sec24
TensorFlow 2.6.0BERT-Large Fine Tuning (SQUAD)8Mixed22min93.26391 sentences/sec24
TensorFlow 2.6.0BERT-Large Pre Training1MixedPhase 1 165 sps
Phase 2 31 sps
Phase 1 – 64
Phase 2 – 8
With accumulation steps
TensorFlow 2.6.0BERT-Large Pre Training 8MixedPhase 1 1310sps
Phase 2 249 sps
Phase 1 – 64
Phase 2 – 8
With accumulation steps
TensorFlow 2.6.0BERT-Large Pre Training32Mixed39hPhase 1 loss 1.12
Phase 2 loss 0.86
Phase 1 – 5400sps
Phase 2 – 1030sps
Phase 1 – 64
Phase 2 – 8
With accumulation steps
TensorFlow 2.6.0Transformer8Mixed17h 43min26.51540204096
TensorFlow 2.5.1T5-base Fine Tuning1Mixed16min94.111516
TensorFlow 2.5.1Albert-Large Fine Tuning (SQUAD)8Mixed14min 42sF1 90.9
EM 84.18
436 sentences/sec32Time to train does not include tokenization
TensorFlow 2.5.1Albert-Large Pre Training1MixedPhase 1 177sps
Phase 2 36sps
Phase 1 – 64
Phase 2 – 8
TensorFlow 2.5.1 EfficientDet8fp324days 22h33.891.4 images/sec8
TensorFlow 2.5.1 CycleGan1Mixed9h 25min5.92
TensorFlow 2.5.1 SegNet1Mixed8.5min89.57303 images/sec16
TensorFlow 2.5.1 SegNet4Mixed3.9min90.6104 images/sec16

PyTorch Reference Models Performance

FrameworkModel#HPUPrecision1.0 TTT1.0 Accuracy1.0 ThroughputBatch SizeComments
PyTorch 1.8.1ResNet501Mixed76.041583 images/sec256
PyTorch 1.8.2ResNet50 8Mixed5h 37min75.957350 images/sec256PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance.
PyTorch 1.8.2ResNet5016Mixed3h 54min75.8612600 images/sec256PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance.
PyTorch 1.8.2ResNext1011MixedN/A725 images/sec128
PyTorch 1.8.2 ResNext1018Mixed10h 50min78.013780 images/sec128
PyTorch 1.8.2BERT-Large Fine Tuning (SQUAD)1 Mixed1h 11min93.346 sentences/sec24
PyTorch 1.8.2BERT-Large Fine Tuning (SQUAD)8 Mixed 30min92.8330 sentences/sec 24Graph compilation time has an overall impact on time to train. This will be fixed in a subsequent release
PyTorch 1.8.2BERT-Large Pre Training1MixedPhase 1 – 155 sentences/sec
Phase 2 – 31 sentences/sec
PyTorch 1.8.2BERT-Large Pre Training8 Mixed Phase 1 – 1230 sentences/sec
Phase 2 – 245 sentences/sec
PyTorch 1.8.2DLRM1 Mixed 47086 queries/sec512Using random data as input

System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.0.1-81
Tensorflow: Models run with Tensorflow v2.5.1 use this Docker image; ones with v2.6.0 use this Docker image
PyTorch: Models run with PyTorch v1.8.2 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS

Performance varies by use, configuration and other factors.  All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time.  Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.

Stay Informed: Register for the latest Intel Gaudi AI Accelerator developer news, events, training, and updates.