Habana Training Models and Performance - 1.0.0

Get access to Habana’s popular frameworks and optimized models that enable you to quickly and easily build, train and deploy your Gaudi models. For more information on the future model support, please refer to our SynapseAI roadmap page.

See the latest TensorFlow and PyTorch model performance.

TensorFlow Reference Models Performance

Framework	Model	#HPU	Precision	Time to Train	Accuracy	Throughput	Batch Size	Comments
TensorFlow 2.5.0	ResNet50 Keras LARS	1	Mixed	8h39m	76.06	1690 images/sec	256
TensorFlow 2.5.0	ResNet50 Keras LARS (with horovod)	8	Mixed	1h13m	76.11	12950 images/sec	256
TensorFlow 2.5.0	ResNet50 Keras LARS (with tf distribute)	8	Mixed	1h 9min	75.96	12900 images/sec	256
TensorFlow 2.5.0	ResNet50 Keras SGD	1	Mixed	19h 31min	76.2	1700 images/sec	256
TensorFlow 2.5.0	ResNet50 Keras SGD	8	Mixed	2h 39min	76.3	12580 images/sec	256
TensorFlow 2.5.0	ResNet50 Keras SGD	16	Mixed	1h 28min	75.99	23400 images/sec	256
TensorFlow 2.5.0	ResNet50 Keras SGD	32	Mixed	46min	75.94	46400 images/sec	256
TensorFlow 2.5.0	ResNext101	1	Mixed		79.2	650 images/sec	128
TensorFlow 2.5.0	ResNext101	8	Mixed	6h 56min	79.02	4656 images/sec	128
TensorFlow 2.5.0	SSD ResNet34	1	Mixed	3h 35min	22.97	470 images/sec	128
TensorFlow 2.5.0	SSD ResNet34	8	Mixed	35min	22.04	3406 images/sec	128
TensorFlow 2.5.0	Mask R-CNN	1	Mixed	25h 18min	34.04	15 images/sec	4
TensorFlow 2.5.0	Mask R-CNN	8	Mixed	4h 26min	34.1	83 images/sec	4
TensorFlow 2.5.0	Unet2D	1	Mixed	20min	88.79	48 images/sec	8	Results reported for single fold training time
TensorFlow 2.5.0	Unet2D	8	Mixed	7min	88.09	360 images/sec	8	Results reported for single fold training time
TensorFlow 2.5.0	Unet3D	1	Mixed	1h 47min	88.96	5.2 images/sec	2
TensorFlow 2.5.0	Unet3D	8	Mixed	19min	89.06	35 images/sec	2
TensorFlow 2.5.0	DenseNet (with tf.distribute)	8	Mixed	5h 15min	73.44	5423 images/sec	128
TensorFlow 2.5.0	RetinaNet	1	fp32	8h 53min	27.35	12 images/sec	8
TensorFlow 2.5.0	BERT-Large Fine Tuning (SQUAD)	1	Mixed	1h 17m	93.4	52 sentences/sec	24
TensorFlow 2.5.0	BERT-Large Fine Tuning (SQUAD)	8	Mixed	22min	93.3	391 sentences/sec	24
TensorFlow 2.5.0	BERT-Large Pre Training	1	Mixed			Phase 1 165 sps Phase 2 31 sps	Phase 1 – 64 Phase 2 – 8	With accumulation steps
TensorFlow 2.5.0	BERT-Large Pre Training	8	Mixed			Phase 1 1310sps Phase 2 249 sps	Phase 1 – 64 Phase 2 – 8	With accumulation steps
TensorFlow 2.5.0	BERT-Large Pre Training	32	Mixed	39h		Phase 1 5233 sps Phase 2 1000 sps	Phase 1 – 64 Phase 2 – 8	With accumulation steps
TensorFlow 2.5.0	Transformer	8	Mixed	17h 43min	26.6	154550	4096
TensorFlow 2.5.0	T5-base Fine Tuning	1	Mixed	16min	94.1	115	16
TensorFlow 2.5.0	Albert-Large Fine Tuning (SQUAD)	8	Mixed	14min 42s	F1 90.9 EM 84.18	436 sentences/sec	32	Time to train does not include tokenization
TensorFlow 2.5.0	Albert-Large Pre Training	1	Mixed			Phase 1 177sps Phase 2 36sps	Phase 1 – 64 Phase 2 – 8
TensorFlow 2.5.0	EfficentDet	8	fp32	4days 22h	33.8	91.4 images/sec	8
TensorFlow 2.5.0	CycleGan	1	Mixed	9h 25min		5.9	2	8-card training takes longer than 1-card training, because dataset is not sharded between workers for this topology. There is no accuracy metric for this topology. To verify training correctness we visually inspect generated images and check final loss.
TensorFlow 2.5.0	CycleGan	8	Mixed	9h 40min		44	2	8-card training takes longer than 1-card training, because dataset is not sharded between workers for this topology. There is no accuracy metric for this topology. To verify training correctness we visually inspect generated images and check final loss.
TensorFlow 2.5.0	SegNet	1	Mixed	8.5min	89.57	303 images/sec	16
TensorFlow 2.5.0	SegNet	4	Mixed	3.9min	90.6	104 images/sec	16

PyTorch Reference Models Performance

Framework	Model	#HPU	Precision	1.0 TTT	1.0 Accuracy	1.0 Throughput	Batch Size	Comments
PyTorch 1.8.1	ResNet50	1	Mixed		76.04	1587 images/sec	256
PyTorch 1.8.1	ResNet50	8	Mixed	9h 37min	75.95	4545 images/sec	256	PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance.
PyTorch 1.8.1	ResNet50	16	Mixed	4h 58min	76	8904 images/sec	256	PyTorch dataloader consumes a significant portion of the training time, impacting overall model performance.
PyTorch 1.8.1	ResNext101	1	Mixed		N/A	732 images/sec	128
PyTorch 1.8.1	ResNext10	8	Mixed	13h 37min	78.01	3041 images/sec	128
PyTorch 1.8.1	BERT-Large Fine Tuning (SQUAD)	1	Mixed	1h 11min	93.3	46 sentences/sec	24
PyTorch 1.8.1	BERT-Large Fine Tuning (SQUAD)	8	Mixed	30min	92.8	334 sentences/sec	24	Graph compilation time has an overall impact on time to train. This will be fixed in a subsequent release
PyTorch 1.8.1	BERT-Large Pre Training	1	Mixed			Phase 1 – 155 sentences/sec Phase 2 – 31 sentences/sec	64
PyTorch 1.8.1	BERT-Large Pre Training	8	Mixed			Phase 1 – 1230 sentences/sec Phase 2 – 245 sentences/sec	64
PyTorch 1.8.1	DLRM	1	Mixed			47086 queries/sec	512	Using random data as input

System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.0.0-532
Tensorflow: Models run with Tensorflow v2.5.0 use this Docker image;
PyTorch: Models run with PyTorch v1.8.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS

Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.