Home » Get Started » Intel® Gaudi® AI accelerators Model Performance Data » Habana Training Models and Performance – 1.0.0

Habana Training Models and Performance – 1.0.0

Get access to Habana’s popular frameworks and optimized models that enable you to quickly and easily build, train and deploy your Gaudi models. For more information on the future model support, please refer to our SynapseAI roadmap page.

See the latest TensorFlow and PyTorch model performance.

TensorFlow Reference Models Performance

FrameworkModel#HPUPrecisionTime to TrainAccuracyThroughputBatch SizeComments
TensorFlow 2.5.0 ResNet50 Keras LARS1Mixed8h39m76.061690 images/sec256
TensorFlow 2.5.0 ResNet50 Keras LARS (with horovod)8 Mixed 1h13m76.1112950 images/sec256
TensorFlow 2.5.0ResNet50 Keras LARS
(with tf distribute)
8 Mixed 1h 9min75.9612900 images/sec256
TensorFlow 2.5.0ResNet50 Keras SGD 1 Mixed 19h 31min76.21700 images/sec256
TensorFlow 2.5.0 ResNet50 Keras SGD 8 Mixed 2h 39min76.312580 images/sec256
TensorFlow 2.5.0ResNet50 Keras SGD16 Mixed 1h 28min75.9923400 images/sec256
TensorFlow 2.5.0ResNet50 Keras SGD32 Mixed 46min75.9446400 images/sec256
TensorFlow 2.5.0ResNext1011 Mixed 79.2650 images/sec128
TensorFlow 2.5.0ResNext1018 Mixed 6h 56min79.024656 images/sec 128
TensorFlow 2.5.0 SSD ResNet341Mixed3h 35min22.97470 images/sec128
TensorFlow 2.5.0 SSD ResNet348Mixed35min22.043406 images/sec128
TensorFlow 2.5.0Mask R-CNN1 Mixed 25h 18min34.0415 images/sec 4
TensorFlow 2.5.0 Mask R-CNN8Mixed4h 26min34.1 83 images/sec 4
TensorFlow 2.5.0Unet2D1 Mixed 20min88.79 48 images/sec 8Results reported for single fold training time
TensorFlow 2.5.0Unet2D 8 Mixed 7min88.09360 images/sec8Results reported for single fold training time
TensorFlow 2.5.0Unet3D 1 Mixed 1h 47min88.965.2 images/sec2
TensorFlow 2.5.0Unet3D8 Mixed 19min89.0635 images/sec2
TensorFlow 2.5.0DenseNet (with tf.distribute)8Mixed5h 15min73.445423 images/sec128
TensorFlow 2.5.0RetinaNet1fp328h 53min27.3512 images/sec8
TensorFlow 2.5.0BERT-Large Fine Tuning (SQUAD)1Mixed1h 17m93.452 sentences/sec24
TensorFlow 2.5.0BERT-Large Fine Tuning (SQUAD)8Mixed22min93.3391 sentences/sec24
TensorFlow 2.5.0BERT-Large Pre Training1MixedPhase 1 165 sps
Phase 2 31 sps
Phase 1 – 64
Phase 2 – 8
With accumulation steps
TensorFlow 2.5.0BERT-Large Pre Training 8MixedPhase 1 1310sps
Phase 2 249 sps
Phase 1 – 64
Phase 2 – 8
With accumulation steps
TensorFlow 2.5.0BERT-Large Pre Training32Mixed39hPhase 1 5233 sps
Phase 2 1000 sps
Phase 1 – 64
Phase 2 – 8
With accumulation steps
TensorFlow 2.5.0Transformer8Mixed17h 43min26.61545504096
TensorFlow 2.5.0T5-base Fine Tuning1Mixed16min94.111516
TensorFlow 2.5.0Albert-Large Fine Tuning (SQUAD)8Mixed14min 42sF1 90.9
EM 84.18
436 sentences/sec32Time to train does not include tokenization
TensorFlow 2.5.0Albert-Large Pre Training1MixedPhase 1 177sps
Phase 2 36sps
Phase 1 – 64
Phase 2 – 8
TensorFlow 2.5.0 EfficentDet8fp324days 22h33.891.4 images/sec8
TensorFlow 2.5.0 CycleGan1Mixed9h 25min5.928-card training takes longer than 1-card training, because dataset is not sharded between workers for this topology.
There is no accuracy metric for this topology.
To verify training correctness we visually inspect generated images and check final loss.
TensorFlow 2.5.0 CycleGan 8Mixed9h 40min4428-card training takes longer than 1-card training, because dataset is not sharded between workers for this topology.
There is no accuracy metric for this topology.
To verify training correctness we visually inspect generated images and check final loss.
TensorFlow 2.5.0 SegNet1Mixed8.5min89.57303 images/sec16
TensorFlow 2.5.0 SegNet4Mixed3.9min90.6104 images/sec16

PyTorch Reference Models Performance

FrameworkModel#HPUPrecision1.0 TTT1.0 Accuracy1.0 ThroughputBatch SizeComments
PyTorch 1.8.1ResNet501Mixed76.041587 images/sec256
PyTorch 1.8.1ResNet50 8Mixed9h 37min75.954545 images/sec256PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance.
PyTorch 1.8.1ResNet5016Mixed4h 58min768904 images/sec256PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance.
PyTorch 1.8.1ResNext1011MixedN/A732 images/sec128
PyTorch 1.8.1 ResNext108Mixed13h 37min78.013041 images/sec128
PyTorch 1.8.1BERT-Large Fine Tuning (SQUAD)1 Mixed1h 11min93.346 sentences/sec24
PyTorch 1.8.1BERT-Large Fine Tuning (SQUAD)8 Mixed 30min92.8334 sentences/sec 24Graph compilation time has an overall impact on time to train. This will be fixed in a subsequent release
PyTorch 1.8.1BERT-Large Pre Training1MixedPhase 1 – 155 sentences/sec
Phase 2 – 31 sentences/sec
64
PyTorch 1.8.1BERT-Large Pre Training8 Mixed Phase 1 – 1230 sentences/sec
Phase 2 – 245 sentences/sec
64
PyTorch 1.8.1DLRM1 Mixed 47086 queries/sec512Using random data as input

System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.0.0-532
Tensorflow: Models run with Tensorflow v2.5.0 use this Docker image;
PyTorch: Models run with PyTorch v1.8.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS


Performance varies by use, configuration and other factors.  All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time.  Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.

Stay Informed: Register for the latest Intel Gaudi AI Accelerator developer news, events, training, and updates.