Habana Model Performance Data - 1.2.0 - Intel Gaudi Developers

See the latest TensorFlow and PyTorch model performance data. Visit the Habana catalog for information on models and containers that are currently integrated with Habana’s Synapse AI software suite. For more information on future model support, please refer to our SynapseAI roadmap page.

TensorFlow Reference Models Performance

Framework	Model	#HPU	Precision	Time to Train	Accuracy	Throughput	Batch Size	Comments
TensorFlow 2.7.0	ResNet50 Keras LARS	32	bf16	0h 21m	75.92	49060 images/sec	256
TensorFlow 2.7.0	ResNet50 Keras LARS	16	bf16	0h 40m	75.69	24598 images/sec	256
TensorFlow 2.7.0	ResNet50 Keras LARS	8	bf16	1h 9m	76.06	12880 images/sec	256
TensorFlow 2.7.0	ResNet50 Keras LARS	1	bf16	8h 36m	76.03	1695.58 images/sec	256
TensorFlow 2.7.0	BERT-Large Pre Training phase 1	32	bf16			6586.64 sentences/sec	64
TensorFlow 2.7.0	BERT-Large Pre Training phase 1	8	bf16			1669.71 sentences/sec	64
TensorFlow 2.7.0	BERT-Large Pre Training phase 1	1	bf16			210.36 sentences/sec	64
TensorFlow 2.7.0	BERT-Large Pre Training phase 2	32	bf16	26h 39m		2124.07 sentences/sec	8	The TTT of 26h and 39m is for Phase 1 and Phase 2
TensorFlow 2.7.0	BERT-Large Pre Training phase 2	8	bf16			538.94 sentences/sec	8
TensorFlow 2.7.0	BERT-Large Pre Training phase 2	1	bf16			67.98 sentences/sec	8
TensorFlow 2.7.0	BERT-Large Fine Tuning (SQUAD)	8	bf16	0h 14m	93.03	392.76 sentences/sec	24
TensorFlow 2.7.0	BERT-Large Fine Tuning (SQUAD)	1	bf16	1h 7m	93.47	53.34 sentences/sec	24
TensorFlow 2.7.0	SSD ResNet34	8	bf16	0h 32m	22.19	3620.58 images/sec	128
TensorFlow 2.7.0	SSD ResNet34	1	bf16	4h 47m	23.68	502.29 images/sec	128
TensorFlow 2.7.0	ResNext101	8	bf16	6h 41m	79.21	5002 images/sec	128
TensorFlow 2.7.0	ResNext101	1	bf16	46h 38m	79.26	689.77 images/sec	128
TensorFlow 2.7.0	Unet2D	8	bf16	0h 3m	88.2	392.07 images/sec	8
TensorFlow 2.7.0	Unet2D	1	bf16	0h 18m	88.83	51.72 images/sec	8
TensorFlow 2.7.0	Unet3D	8	bf16	0h 14m	88.22	42.75 images/sec	2
TensorFlow 2.7.0	Unet3D	1	bf16	1h 20m	89.7	6.7 images/sec	2
TensorFlow 2.7.0	Transformer	8	bf16	19h 31m	26.6	155888 sentences/sec	4096
TensorFlow 2.7.0	Transformer	1	bf16	17h 48m	23.7	22245 sentences/sec	4096
TensorFlow 2.7.0	Mask R-CNN	8	bf16	4h 21m	34.14	107.6 images/sec	4
TensorFlow 2.7.0	Mask R-CNN	1	bf16	25h 28m	33.98	15.72 images/sec	4
TensorFlow 2.7.0	VisionTransformer	8	bf16	7h 37m	84.44	442.47 images/sec	32
TensorFlow 2.7.0	RetinaNet	8	bf16	7h 43m	38.71	79.94 images/sec	64
TensorFlow 2.6.2	Densenet 121 tf.distribute	8	bf16	6h 33m	74.69	6575 images/sec	1024
TensorFlow 2.7.0	T5 Base	1	bf16	0h 20m	94.32	96.74 sentences/sec	16
TensorFlow 2.7.0	VGG SegNet	1	bf16	0h 10m	88.63	109.4 images/sec	16
TensorFlow 2.7.0	MobileNet V2	1	bf16			1119.96 images/sec	96
TensorFlow 2.7.0	EfficientDet	8	fp32			152.46 images/sec	8
TensorFlow 2.7.0	CycleGAN	1	bf16	4h 9m		15.78	2
TensorFlow 2.7.0	Albert-Large Fine Tuning (SQUAD)	8	bf16	0h 25m	90.68	438.36 sentences/sec	32
TensorFlow 2.7.0	Albert-Large Fine Tuning (SQUAD)	1	bf16	1h 28m	91.03	55.06 sentences/sec	32
TensorFlow 2.7.0	ResNet50 Keras LARS tf.distribute	8	bf16	1h 10m	76.05	12799 images/sec	256
TensorFlow 2.7.0	ResNet50 Keras SGD	8	bf16	2h 39m	76.19	12612 images/sec	256
TensorFlow 2.7.0	ResNet50 Keras LARS Host NIC	16	bf16			21474 images/sec	256	using Horovod and libfabric, using HCCL_OVER_OFI=1
TensorFlow 2.7.0	ResNet50 Keras LARS Host NIC	16	bf16			19841 images/sec	256	using tf.distribute and libfabric, using HCCL_OVER_OFI=1

PyTorch Reference Models Performance

Framework	Model	#HPU	Precision	Time to Train	Accuracy	Throughput	Batch Size	Comments
PyTorch 1.10.0	ResNet50	32	bf16	0h 52m	74.66	40633 images/sec	256
PyTorch 1.10.0	ResNet50	16	bf16	1h 28m	75.91	25022 images/sec	256
PyTorch 1.10.0	ResNet50	8	bf16	2h 47m	76.1	12833 images/sec	256
PyTorch 1.10.0	ResNet50	1	bf16	21h 56m	75.88	1722 images/sec	256
PyTorch 1.10.0	BERT-L Lazy Mode Pre Training Phase 1	32	bf16	27h 46m		4920 sentences/sec	64	ph1 final_loss : 1.494 ph2 final_loss : 1.349
PyTorch 1.10.0	BERT-L Lazy Mode Pre Training Phase 1	8	bf16			1282 sentences/sec	64
PyTorch 1.10.0	BERT-L Lazy Mode Pre Training Phase 1	1	bf16			154 sentences/sec	64
PyTorch 1.10.0	BERT-L Lazy Mode Pre Training Phase 2	32	bf16	14h 18m		970 sentences/sec	8
PyTorch 1.10.0	BERT-L Lazy Mode Pre Training Phase 2	8	bf16			256 sentences/sec	8
PyTorch 1.10.0	BERT-L Lazy Mode Pre Training Phase 2	1	bf16			32 sentences/sec	8
PyTorch 1.10.0	BERT-L Lazy Mode Fine Tuning	8	bf16	0h 10m	93.17	341 sentences/sec	24
PyTorch 1.10.0	BERT-L Lazy Mode Fine Tuning	1	bf16	1h 12m	92.94	49 sentences/sec	24
PyTorch 1.10.0	ResNext101	8	bf16	6h 39m	78.04	5768 images/sec	128
PyTorch 1.10.0	ResNext101	1	bf16	48h 47m	78.13	777.53 images/sec	128
PyTorch 1.10.0	ResNet152	8	bf16	7h 53m	78.03	5191 images/sec	128
PyTorch 1.10.0	ResNet152	1	bf16	46h 50m	77.61	729 images/sec	128
PyTorch 1.10.0	Unet2D	8	bf16	1h 8m	72.82	4624.22 images/sec	64
PyTorch 1.10.0	Unet2D	1	bf16	9h 24m	72.84	609.44 images/sec	64
PyTorch 1.10.0	Unet3D	8	bf16	1h 27m	74.13	59.77 images/sec	2
PyTorch 1.10.0	Unet3D	1	bf16	13h 34m	74.3	7.56 images/sec	2
PyTorch 1.10.0	SSD	8	bf16	1h 25m	22.93	1664 images/sec	32
PyTorch 1.10.0	SSD	1	bf16	4h 12m	23.07	449 images/sec	32
PyTorch 1.10.0	Transformer	8	bf16	20h 49m	28.1	150407 sentences/sec	4096
PyTorch 1.10.0	Transformer	1	bf16	22h 22m		21525.8 sentences/sec	4096
PyTorch 1.10.0	GoogLeNet	8	bf16	4h 18m	72.44	15056 images/sec	256
PyTorch 1.10.0	GoogLeNet	1	bf16	19h 9m	72.31	1851 images/sec	256
PyTorch 1.10.0	DistilBERT	8	bf16	0h 10m	85.49	770 sentences/sec	8
PyTorch 1.10.0	DistilBERT	1	bf16	0h 41m	85.47	149 sentences/sec	8
PyTorch 1.10.0	RoBERTa Large	8	bf16	0h 11m	94.53	284 sentences/sec	12
PyTorch 1.10.0	RoBERTa Large	1	bf16	1h 30m	94.27	42.6 sentences/sec	12
PyTorch 1.10.0	RoBERTa Base	8	bf16	0h 4m	91.85	731 sentences/sec	12
PyTorch 1.10.0	RoBERTa Base	1	bf16	0h 34m	92.39	128 sentences/sec	12
PyTorch 1.10.0	ALBERT-XXL Fine Tuning	8	bf16	0h 43m	94.91	74 sentences/sec	12
PyTorch 1.10.0	ALBERT-XXL Fine Tuning	1	bf16	5h 29m	94.79	9 sentences/sec	12
PyTorch 1.10.0	ALBERT-Large Fine Tuning	8	bf16	0h 10m	91.9	362 sentences/sec	32
PyTorch 1.10.0	ALBERT-Large Fine Tuning	1	bf16	1h 7m	93.25	44 sentences/sec	32
PyTorch 1.10.0	MobileNetV2	1	bf16			1515 images/sec	256
PyTorch 1.10.0	BART Fine Tuning	8	bf16	0h 8m		1364 sentences/sec	32
PyTorch 1.10.0	BART Fine Tuning	1	bf16	0h 50m		193 sentences/sec	32
PyTorch 1.10.0	ResNet50 Host NIC	16	bf16	2h 5m	75.96	16311 images/sec	256

System Configuration:
HPU: Habana Gaudi® HL-205 Mezzanine cards
System: HLS-1 with eight HL-205 HPU and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Software: Ubuntu20.04, SynapseAI Software version 1.2.0-585
Tensorflow: Models run with Tensorflow v2.7.0 use this Docker image; ones with v2.6.2 use this Docker image
PyTorch: Models run with PyTorch v1.10.0 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS

Performance varies by use, configuration and other factors. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.