Habana Model Performance Data - 1.1.0 - Intel Gaudi Developers

See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.

TensorFlow Reference Models Performance

Framework	Model	#HPU	Precision	Time to Train	Accuracy	Throughput	Batch Size	Comments
TensorFlow 2.6.0	ResNet50 Keras SGD	8	Mixed	2h 37m	76.14	12810 images/sec	256
TensorFlow 2.5.1	ResNet50 Keras SGD	8	Mixed	2h 35m	75.98	12943 images/sec	256
TensorFlow 2.6.0	ResNet50 Keras LARS tf.distribute	8	Mixed	1h 10m	76.04	13175 images/sec	256
TensorFlow 2.5.1	ResNet50 Keras LARS tf.distribute	8	Mixed	1h 10m	76.04	13205 images/sec	256
TensorFlow 2.6.0	ResNet50 Keras LARS	32	Mixed	0h 23m	75.77	48710 images/sec	256
TensorFlow 2.6.0	ResNet50 Keras LARS	16	Mixed	0h 40m	75.46	24567 images/sec	256
TensorFlow 2.6.0	ResNet50 Keras LARS	8	Mixed	1h 8m	76.09	13183 images/sec	256
TensorFlow 2.6.0	ResNet50 Keras LARS	1	Mixed	8h 24m	76.12	1731 images/sec	256
TensorFlow 2.6.0	ResNext101	8	Mixed	6h 33m	79.27	5066 images/sec	128
TensorFlow 2.6.0	ResNext101	1	Mixed	45h 37m	79.2	697 images/sec	128
TensorFlow 2.6.0	SSD ResNet34	8	Mixed	0h 29m	22.45	3637 images/sec	128
TensorFlow 2.6.0	SSD ResNet34	1	Mixed	3h 16m	22.77	506 images/sec	128
TensorFlow 2.6.0	Mask R-CNN	8	Mixed	3h 45m	34.09	104 images/sec	4
TensorFlow 2.6.0	Mask R-CNN	1	Mixed	25h 33m	34.1	15 images/sec	4
TensorFlow 2.6.0	Unet2D	8	Mixed	0h 3m	88.05	371 images/sec	8	Results reported for single fold training time
TensorFlow 2.6.0	Unet2D	1	Mixed	0h 18m	88.89	50 images/sec	8	Results reported for single fold training time
TensorFlow 2.6.0	Unet3D	8	Mixed	0h 15m	89.13	39 images/sec	2
TensorFlow 2.6.0	Unet3D	1	Mixed	1h 26m	89.93	6 images/sec	2
TensorFlow 2.5.1	Densenet 121 tf.distribute	8	Mixed	5h 13m	73.96	6575 images/sec	2048
TensorFlow 2.5.1	VGG SegNet	1	Mixed	0h 9m	89.61	102 images/sec	16
TensorFlow 2.6.0	RetinaNet	1	fp32	7h 11m	27.57	12 images/sec	8
TensorFlow 2.6.0	MobileNet V2	1	Mixed			1135 images/sec	96
TensorFlow 2.6.0	EfficientDet	8	fp32	90h 19m	33.51	157 images/sec	8
TensorFlow 2.5.1	CycleGAN	1	Mixed	5h 13m		12	2
TensorFlow 2.5.1	Transformer	8	Mixed	17h 24m	26.4	157465	4096
TensorFlow 2.5.1	Transformer	1	Mixed		23.7	22145	4096
TensorFlow 2.5.1	T5 Base	1	Mixed	0h 16m	94.58	109 sentences/sec	16
TensorFlow 2.6.0	BERT-Large Fine Tuning (SQUAD)	8	Mixed	0h 14m	93.44	391 sentences/sec	24
TensorFlow 2.6.0	BERT-Large Fine Tuning (SQUAD)	1	Mixed	1h 7m	93.56	53 sentences/sec	24
TensorFlow 2.6.0	BERT-Large Pre Training	32	Mixed	36h 55m	Phase 1: Loss 1.3 Phase 2: Loss 0.86	Phase 1: 5527 sentences/sec Phase 2: 1066 sentences/sec	Phase 1: – 64 Phase 2: – 8	With accumulation steps
TensorFlow 2.6.0	BERT-Large Pre Training	8	Mixed			Phase 1: 1404 sentences/sec Phase 2: 271 sentences/sec	Phase 1: – 64 Phase 2: – 8	With accumulation steps
TensorFlow 2.5.1	Albert-Large Fine Tuning (SQUAD)	8	Mixed	0h 23m	F1 91 EM 84	442 sentences/sec	32	Time to train does not include tokenization
TensorFlow 2.5.1	Albert-Large Fine Tuning (SQUAD)	1	Mixed	1h 11m		54 sentences/sec	32	Time to train does not include tokenization
TensorFlow 2.5.1	Albert-Large Pre Training	1	Mixed			Phase 1: 176 sentences/sec Phase 2: 37 sentences/sec	Phase 1 – 64 Phase 2 – 8

PyTorch Reference Models Performance

Framework	Model	#HPU	Precision	Time to Train	Accuracy	Throughput	Batch Size	Comments
PyTorch 1.9.1	ResNet50	16	Mixed	2h 56m	75.77	22236 images/sec	256
PyTorch 1.9.1	ResNet50 Host NIC	16	Mixed	4h 0m	75.82	9634 images/sec	256
PyTorch 1.9.1	ResNet50	8	Mixed	2h 43m	75.96	12752 images/sec	256
PyTorch 1.9.1	ResNet152	8	Mixed	7h 51m	78.07	4927 images/sec	128
PyTorch 1.9.1	ResNext101	8	Mixed	6h 54m	78.14	6053 images/sec	128
PyTorch 1.9.1	Unet2D	8	Mixed	1h 24m	72.74	4531 images/sec	64
PyTorch 1.9.1	DLRM	1	Mixed			48312 queries/sec	512	Uses Random Input Distribution
PyTorch 1.9.1	Transformer	8	Mixed	25h 1m	27.6	127631	4096
PyTorch 1.9.1	RoBERTa Large	8	Mixed	0h 12m	94.7	259 sentences/sec	12
PyTorch 1.9.1	RoBERTa Large	1	Mixed	1h 17m	94.61	38 sentences/sec	12
PyTorch 1.9.1	RoBERTa Base	8	Mixed	0h 5m	91.67	640 sentences/sec	12
PyTorch 1.9.1	RoBERTa Base	1	Mixed	0h 30m	92.54	102 sentences/sec	12
PyTorch 1.9.1	DistilBERT	8	Mixed	0h 13m	85.24	503 sentences/sec	8
PyTorch 1.9.1	DistilBERT	1	Mixed	0h 42m	85.56	136 sentences/sec	8
PyTorch 1.9.1	BERT-Large Fine Tuning Lazy Mode (SQUAD)	8	Mixed	0h 10m	93.13 F1 Score: 91.35%	318 sentences/sec	24
PyTorch 1.9.1	BERT-Large Fine Tuning Lazy Mode (SQUAD)	1	Mixed	1h 8m	93.44	43 sentences/sec	24
PyTorch 1.9.1	BERT-Large Pre Training Lazy Mode	32	Mixed	40h 13m	Phase 1: Loss 1.3 Phase 2: Loss 1.34	Phase 1: 4124 sentences/sec Phase 2: 635 sentences/sec	Phase 1: – 64
PyTorch 1.9.1	BERT-Large Pre Training Lazy Mode	8	Mixed			Phase 1: 1290 sentences/sec Phase 2: 259 sentences/sec	Phase 1: – 64

System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.1.0-614
Tensorflow: Models run with Tensorflow v2.5.1 use this Docker image; ones with v2.6.0 use this Docker image
PyTorch: Models run with PyTorch v1.9.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS

Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.