Intel® Gaudi® AI accelerators Model Performance Data

These performance numbers are measured using the latest Intel Gaudi SW release, version 1.16.0-526, unless otherwise noted.

All Models for both Training and Inference are using the PyTorch 2.2.2 Framework. Other applicable frameworks used for training or inference are noted for each model.

Training Performance Highlights

Megatron DeepSpeed 0.12.4 | LLaMA2 70B-1,024 BS=4096 | LLaMA2 70B-512 BS=2048 | LLaMA2 70B-256 BS=1024

Intel Gaudi 2 MLPerf™ 3.1 Training Performance

These performance numbers have been generated with previous versions of Intel Gaudi SW. They plan to be updated with the upcoming release of new MLPerf Training in the next Intel Gaudi Software release.

Model	# HPU	Precision	Time To Train	Framework Version
MLPerf 3.1 - GPT3	384	fp8	153.58 min*
MLPerf 3.1 - GPT3	256	fp8	223.75 min**
MLPerf 3.1 - Stable Diffusion v2	64	bf16	19.4 min**	Lightning 2.1.2
MLPerf 3.1 - ResNet	8	bf16	16.4 min***
MLPerf 3.1 - BERT	8	bf16	15.01 min***

* The GPT3 measurement with 384 cards was taken using a pre-launch version of the Intel Gaudi 1.13.0 Software stack
** The GPT measurement with 256 cards and Stable Diffusion were taken using the Intel Gaudi 1.13.0 Software stack
*** The Resnet and BERT measurement were taken using the Intel Gaudi 1.15.0 Software stack

Intel Gaudi 2 Large Language Models Training Performance

Model	# HPU	Precision	Throughput	Sequence Length	TP,PP,DP	Batch Size	Framework Version
LLaMA 2 7B	8	FP8	68439 tokens/sec	4,096	1,1,8	1,024	Megatron DeepSpeed PR #372
LLaMA 2 13B	16	FP8	52428 tokens/sec	4,096	2,2,4	256	Megatron DeepSpeed PR #372
LLaMA 2 70B	64	FP8	52838 tokens/sec	4,096	8,2,4	1,024	Megatron DeepSpeed PR #372

TP, PP, DP = These are the Tensor Parallel, Pipeline Parallel and Data Parallel parameters for the Megatron DeepSpeed training

Intel Gaudi 2 Reference Models Training Performance

Model	# HPU	Precision	Throughput	Acc	TTT	Batch	Framework Version
Llama 2 13B	16	bf16	10 samples/sec			256	DeepSpeed 0.14.0
Llama 2 70B	64	bf16	8.88 samples/sec			1,024	DeepSpeed 0.14.0
Llama 2 70B	64	FP8	12.9 samples/sec			1,024	DeepSpeed 0.14.0
Stable Diffusion	64	bf16	11145.8 img/sec			32	Lightning 2.2.4
Stable Diffusion Fine Tuning*	1	bf16	71 img/sec			7	Lightning 2.2.4
Stable Diffusion Fine Tuning Textual Inversion*	1	bf16	20.9 img/sec			7	Lightning 2.2.4
ResNet50 LARS	32	bf16	18399 img/sec	76.12	7.81 min	256
ResNet50 LARS	8	bf16	47070 img/sec	76.14	18.98 min	256
ResNet50 LARS	1	bf16	6233 img/sec			256
BERT Pre Training Phase 1	32	bf16	32450 sent/sec		254 min	64
BERT Pre Training Phase 1	8	bf16	9218 sent/sec	0		64
BERT Pre Training Phase 1	1	bf16	1178 sent/sec			64
BERT Pre Training Phase 2	32	bf16	10861 sent/sec	0	80.21 min	16
BERT Pre Training Phase 2	8	bf16	2777.5 sent/sec	0		16
BERT Pre Training Phase 2	1	bf16	351 sent/sec			16
BERT SQUAD Fine Tuning	8	bf16	2075 sent/sec	90.64	4.68 min	24
BERT SQUAD Fine Tuning	1	bf16	285 sent/sec			24
ResNext101	8	bf16	22184 img/sec	77.93	100 min	256
ResNext101	1	bf16	2853 img/sec			256
SSD	8	bf16	14651 img/sec	23.02	10.3 min	128
SSD	1	bf16	2140 img/sec			128
Transformer	8	bf16	1110435 token/sec	27.8	241.73 min	8,192
Transformer	1	bf16	138173.66 token/sec			8,192
Unet2D (torch.compile)	8	bf16	19938.29 img/sec	72.66	12.55 min	64	Lightning 2.2.4
Unet2D (torch.compile)	1	bf16	2626 img/sec			64	Lightning 2.2.4
Unet3D	8	bf16	252 img/sec	254.7 img/sec	74.26	2	Lightning 2.2.4
Unet3D	1	bf16	32.42 img/sec			2	Lightning 2.2.4

Hugging Face Optimum Habana for Intel Gaudi 2 Training Performance

See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.

Model	# HPU	Precision	Throughput	Accuracy	Time To Train	Batch Size	Task	Framework Version
Llama2-70B Fine Tuning FSDP (LoRA with torch.compile)	8	bf16	1.3 sentences/sec	2.13	81.75 min	10	language-modeling	Optimum Habana 1.11.1
Llama2-70B Fine Tuning (LoRA)	8	bf16	2.6 sentences/sec	2.13	39.43 min	10	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.11.1
Llama1-7B Fine Tuning (LoRA)	8	bf16	150 sentences/sec	2.35	5.08 min	64	language-modeling	Optimum Habana 1.11.1
Falcon-180B Fine Tuning (LoRA)	8	bf16	2.67 sentences/sec	3.71	149.41 min	1	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.11.1
Falcon-40B Fine Tuning (LoRA)	8	bf16	27.99 sentences/sec	4.06	15.85 min	1	language-modeling	Optimum Habana 1.11.1
GPTJ-CLM	8	bf16	22.24 sentences/sec	0.53	17.18 min	4	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.11.1
GPTNEOX-20B-CLM	16	bf16	294 sentences/sec	0.53	27.21 min	2	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.11.1
BridgeTower	8	bf16	726 sentences/sec		20.63 min	40	contrastive-image-text	Optimum Habana 1.11.1
GPT2	8	bf16	651 sentences/sec	0.40	1.61 min	4	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.11.1
GPT2-XL	8	bf16	94.24 sentences/sec	0.47	6.55 min	4	language-modeling	DeepSpeed 0.14.0 Optimum Habana 1.11.1
ALBERT-Large	8	bf16	2479 sentences/sec	91.70	1.86 min	32	question-answering	Optimum Habana 1.11.1
ALBERT-XXL	8	bf16	456 sentences/sec	94.80	6.73 min	16	question-answering	Optimum Habana 1.11.1
BERT Base (torch.compile)	8	bf16	4172 sentences/sec	85.35	1.16 min	24	question-answering	Optimum Habana 1.11.1
BERT-Large Fine Tuning (torch.compile)	8	bf16	2117 sentences/sec	93.40	1.98 min	32	question-answering	Optimum Habana 1.11.1
ClipRoBERTa	8	bf16	16366 images/sec		9.35 min	64	contrastive-image-text	Optimum Habana 1.11.1
DistilBERT	8	bf16	9992 sentences/sec	82.43	0.56 min	64	question-answering	Optimum Habana 1.11.1
Flan-T5 XXL	8	bf16	26.99 sentences/sec	37.06	369.91 min	22	question-answering	Optimum Habana 1.11.1
RoBERTa Base	8	bf16	6640 sentences/sec	92.14	0.73 min	64	question-answering	Optimum Habana 1.11.1
RoBERTa Large (torch.compile)	8	bf16	2122 sentences/sec	94.43	2.06 min	32	question-answering	Optimum Habana 1.11.1
Swin Transformer	8	bf16	5841 images/sec	99.09	1.8 min	160	image-classification	Optimum Habana 1.11.1
T5-LARGE	8	bf16	87.57 sentences/sec	44.34	246.95 min	4	summarization	DeepSpeed 0.14.0 Optimum Habana 1.11.1
T5-Small	8	bf16	553 sentences/sec	26.19	106.61 min	4	translation	DeepSpeed 0.14.0 Optimum Habana 1.11.1
Vision Transformer	8	bf16	6496 images/sec	98.91	1 min	128	image-classification	Optimum Habana 1.11.1
Wav2Vec2.0 AC	8	bf16	1960 sentences/sec	80.94	2.45 min	16	speech-recognition	Optimum Habana 1.11.1
Wav2Vec2.0 ASR	8	bf16	76 sentences/sec	3.96	20.65 min	4	speech-recognition	Optimum Habana 1.11.1

MosaicML on Intel Gaudi 2 Training Performance

Model	# HPU	Precision	Throughput	Accuracy	Time To Train	Batch Size	Framework Version
MosaicML MPT-1B	8	bf16	24145.17 samples/sec	7.35	13.41 min	512	PyTorch 2.2.2
MosaicML MPT-70B	32	bf16	17937.17 samples/sec	6.95	106.43 min	512	PyTorch 2.2.2

First Gen Gaudi Reference Models Training Performance

Model	# HPU	Precision	Throughput	Accuracy	Time To Train	Batch Size	Framework Version
ResNet50 Keras LARS (torch.compile)	32	bf16	45063 img/sec	76.34	24.5 min	256
ResNet50 Keras LARS (torch.compile)	8	bf16	11633 img/sec	76.55	69.76 min	256
ResNet50 Keras LARS (torch.compile)	1	bf16	1621 img/sec			256
BERT Pre Training combine	32	bf16	4792.62 sent/sec		1751 min	64
BERT Pre Training combine	8	bf16	1234 sent/sec			64
BERT Pre Training combine	1	bf16	155 sent/sec			64
BERT Pre Training Phase 1	32	bf16	5732.07 sent/sec	Loss:	1315 min	64
BERT Pre Training Phase 1	8	bf16	1481.31 sent/sec			64
BERT Pre Training Phase 1	1	bf16	186.2 sent/sec			64
BERT Pre Training Phase 2	32	bf16	1917.35 sent/sec	Loss:	436 min	16
BERT Pre Training Phase 2	8	bf16	487.99 sent/sec			16
BERT Pre Training Phase 2	1	bf16	61.25 sent/sec			16
BERT SQUAD Fine Tuning	8	bf16	404.52 sent/sec	90.68	12.96 min	24
BERT SQUAD Fine Tuning	1	bf16	53.58 sent/sec			24
BART Fine Tuning	8	bf16				32
DINO	8	bf16	947 exmpl/sec	77	2315 min	64
MobileNetV2	8	bf16	12632 img/sec	71.49	505 min	256
ResNet152	8	bf16	4967 img/sec	78.63	399 min	128
SSD**	8	bf16	3439 img/sec			128
Transformer	8	bf16	187860.33 tokens/sec	28.1	1023 min	4096
Unet2D (torch.compile)	8	bf16	4773 img/sec	72.86	63 min	64	Lightning 2.2.4
Unet3D	8	bf16	62 img/sec	74.33	73 min	2	Lightning 2.2.4
YOLOX	8	bf16	313.37 img/sec	39.75	2326.8 min	16

Hugging Face Optimum Habana for First Gen Intel Gaudi Training Performance

See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.

Model	# HPU	Precision	Throughput	Accuracy	Time To Train	Batch Size	Task	Framework Version
GPT2-XL	8	bf16	19.37 sentences/sec	0.47	74 min	4	language-modeling	DeepSpeed 0.14.0, Optimum Habana 1.11.1
GPT2	8	bf16	167.41 sentences/sec	0.41	4.2 min	4	language-modeling	DeepSpeed 0.14.0, Optimum Habana 1.11.1
T5-LARGE	8	bf16	50 sentences/sec	44.34	365 min	4	summarization	DeepSpeed 0.14.0, Optimum Habana 1.11.1
T5-Small	8	bf16	192 sentences/sec	26.12	116.8 min	4	translation	DeepSpeed 0.14.0, Optimum Habana 1.11.1
ALBERT-L	8	bf16	490.11 sentences/sec	92.57	7.9 min	32	question-answering	Optimum Habana 1.11.1
ALBERT-XXL	8	bf16	75.34 sentences/sec	94.88	41.4 min	12	question-answering	Optimum Habana 1.11.1
BERT-BASE FT (torch.compile)	8	bf16	1178 sentences/sec	85.53	3 min	24	question-answering	Optimum Habana 1.11.1
BERT-Large FT (torch.compile)	8	bf16	413 sentences/sec	93.29	8.6 min	24	question-answering	Optimum Habana 1.11.1
Clip-RoBERTa	8	bf16	895 images/sec		45.2 min	64	contrastive-image-text	Optimum Habana 1.11.1
DistilBERT	8	bf16	1524 sentences/sec	85.72	3 min	8	question-answering	Optimum Habana 1.11.1
RoBERTa Base	8	bf16	1066 sentences/sec	91.81	3.13 min	12	question-answering	Optimum Habana 1.11.1
RoBERTa Large (torch.compile)	8	bf16	410 sentences/sec	94.76	8.6 min	12	question-answering	Optimum Habana 1.11.1
Swin Transformer	8	bf16	1573 images/sec	98.68	4.8 min	64	question-answering	Optimum Habana 1.11.1
Vision Transformer	8	bf16	2461 images/sec	97.19	2.81 min	64	question-answering	Optimum Habana 1.11.1
Wav2Vec2-AC	8	bf16	667 sentences/sec	81.84	6.3 min	16	speech-recognition	Optimum Habana 1.11.1
Wav2Vec2-ASR	8	bf16	41.83 sentences/sec	4.2	36.73 min	4	speech-recognition	Optimum Habana 1.11.1

Intel^® Gaudi^® 2 with MLPerf* v4.0

Model	# HPU	Precision	Performance	Framework Version
MLPerf4.0 LLama 2 70B Server	8	fp8	6222.9 token/sec	PyTorch 2.2.2
MLPerf4.0 Llama 2 70B Offline	8	fp8	7808 token/sec	PyTorch 2.2.2
MLPerf4.0 Stable Diffusion XL Server	8	fp8	6.25 Queries/s
MLPerf4.0 Stable Diffusion XL Offline	8	fp8	6.45 samples/sec

Intel Gaudi 2 Large Languages Models for Thruput

Model	# HPU	Precision	Input length	Output Length	Throughput	Batch	Framework Version
LLaMA 2 7B	1	fp8	128	128	13163 tokens/sec	1,230	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	128	2048	4777 tokens/sec	163	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	2048	128	1291 tokens/sec	94	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	2048	2048	1943 tokens/sec	81	Optimum Habana 1.11.1
LLaMA 2 70B	2	fp8	128	128	2727 tokens/sec	1,750	DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B	4	fp8	128	2048	7422 tokens/sec	750	DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B	2	fp8	2048	128	276 tokens/sec	95	DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B	2	fp8	2048	2048	958 tokens/sec	78	DeepSpeed 0.14.0, Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	128	128	13112 tokens/sec	896	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	128	2048	7947 tokens/sec	120	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	2048	128	1360 tokens/sec	120	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	2048	2048	3143 tokens/sec	44	Optimum Habana 1.11.1

Intel Gaudi 2 Large Languages Models for Low Latency

Model	# HPU	Precision	Input length	Latency	Batch	Framework Version
LLaMA 2 7B	1	fp8	128	8.19 ms	1	Optimum Habana 1.11.1
LLaMA 2 7B	1	fp8	2048	56.97 ms	1	Optimum Habana 1.11.1
LLaMA 2 70B	8	fp8	128	24.33 ms	1	Optimum Habana 1.11.1
LLaMA 2 70B	8	fp8	2048	122 ms	1	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	128	10.8 ms	1	Optimum Habana 1.11.1
Mistral 7B Instruct	1	fp8	2048	92 ms	1	Optimum Habana 1.11.1

Intel Gaudi 2 Reference Models Inference Performance

Model	# HPU	Precision	Throughput	Latency***	Batch	Framework Version
Stable Diffusion v2.1 (512x512)*	1	bf16	1.23 img/sec	813 ms	1	Lightning 2.2.0
Stable Diffusion v2.1 (768X768)*	1	bf16	0.4 img/sec	2500 ms	1	Lightning 2.2.0
Bert FT (torch.compile)	1	bf16	806 token/sec	29.77 ms	24
Resnet50 (torch.compile)	1	bf16	17172.69 img/sec	14.9 ms	256
Resnext101	1	bf16	10670 img/sec	23.99 ms	256
Unet2D	1	bf16	7483 img/sec	8.55 ms	64	Lightning 2.2.4
Unet3D	1	bf16	128 img/sec	15.62 ms	2	Lightning 2.2.4

Hugging Face Optimum for Intel Gaudi 2 Inference Performance

See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.

Model	# HPU	Precision	Input Length	Output Length	Throughput	Latency	Batch	Task	Framework Version
Llama 2-7B (torch.compile)	1	bf16	128	128	5820 token/sec	51.54 ms	300	text-generation	Optimum Habana 1.11.1
Falcon 180B	8	bf16	128	2,048	700 token/sec	57.14 ms	40	text-generation	Optimum Habana 1.11.1
Falcon-40B 2048 Tokens	8	bf16	128	2,048	92.34 token/sec	10.82 ms	1	text-generation	Optimum Habana 1.11.1
Falcon-7B 8192 Tokens	1	bf16	128	8,192	118.19 token/sec	8.46 ms	1	text-generation	Optimum Habana 1.11.1
GPT-J	8	bf16	128	100	628.74 token/sec	6.36 ms	4	text-generation	Optimum Habana 1.11.1
StableLM-3B	1	bf16	128	2,048	250 token/sec	4 ms	1	text-generation	Optimum Habana 1.11.1
StableLM-7B	1	bf16	128	2,048	128 token/sec	7.81 ms	1	text-generation	Optimum Habana 1.11.1
MPT-7B	1	bf16	128	1,932	121 token/sec	8.26 ms	1	text-generation	Optimum Habana 1.11.1
Bloomz	8	bf16	128	100	36.78 token/sec	27.18 ms	1	text-generation	DeepSpeed 0.14.0, Optimum Habana 1.11.1
StarCoder	1	bf16	100	100	65 token/sec	15.38 ms	1	text-generation	DeepSpeed 0.14.0, Optimum Habana 1.11.1
OPT	1	bf16	100	100	1120 token/sec	0.89 ms	1	text-generation	Optimum Habana 1.11.1
T5-3B Summarization 1024-128 Beam4	1	bf16	1,024	128	0.94 token/sec	1063.82 ms	1	summarization	Optimum Habana 1.11.1
Bert (Text Classification)	1	bf16		128	2125 token/sec	3.76 ms	8	text-classification	Optimum Habana 1.11.1
Bert (Language Modeling)	1	bf16			66.64 token/sec	60.02 ms	4	language-modeling	Optimum Habana 1.11.1
Bert (Question Answering)	1	bf16		384	613 token/sec	13.05 ms	8	question-answering	Optimum Habana 1.11.1
StableDiffusion v2.1 (512x512)	1	bf16			1.33 images/sec	3007.51 ms	4	stable-diffusion	PyTorch Lightning 2.2.4
Bart	1	bf16			6.79 token/sec	294.55 ms	2	summarization	Optimum Habana 1.11.1
BridgeTower	1	bf16			321 token/sec	49.84 ms	16	constrastive-image-text	Optimum Habana 1.11.1
ESMFold	1	bf16			2.97 token/sec	336.7 ms	1	protein-folding	Optimum Habana 1.11.1
T5-3B Summarization Greedy	1	bf16			2.46 token/sec	406.5 ms	1	summarization	Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy	1	bf16			30.85 token/sec	129.65 ms	4	translation	Optimum Habana 1.11.1
Wav2vec(Audio Classification)	1	bf16			1002 token/sec	3.99 ms	4	audio-classification	Optimum Habana 1.11.1
Wav2vec(Speech Recoginition)	1	bf16			16.62 token/sec	240.67 ms	4	speech-recoginition	Optimum Habana 1.11.1

Intel Gaudi First Gen Reference Models Inference Performance

Model	# HPU	Precision	Throughput	Latency	Batch Size	Framework Version
Bert	1	bf16	154.1 token/sec	155.74 ms	24
Unet2D	1	bf16	3730 img/sec	17.15 ms	64	Lightning 2.2.4
Unet3D	1	bf16	64.1 img/sec	31.2 ms	2	Lightning 2.2.4

Hugging Face Optimum Habana on Intel Gaudi First Gen Performance

See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.

Model	# HPU	Precision	Throughput	Latency	Batch	Task	Framework Version
HF Bert (Language Modeling)	1	bf16			4	language-modeling	Optimum Habana 1.11.1
HF Bert (Question Answering)	1	bf16	127.7 token/sec	62.64 ms	8	question-answering	Optimum Habana 1.11.1
HF Bert (Text Classification)	1	bf16	434.4 token/sec	18.41 ms	8	text-classification	Optimum Habana 1.11.1
HF Bart-Greedy	1	bf16	3.1 token/sec	645.16 ms	2	summarization	Optimum Habana 1.11.1
HF ESMFold	1	bf16	13.9 token/sec	71.94 ms	1	protein-folding	Optimum Habana 1.11.1
HF StableDiffusion V2-1 (512x512)	1	bf16	0.4 token/sec	10000 ms	4	text to image generation	Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy	1	bf16	16.8 token/sec	238.09 ms	4	translation	Optimum Habana 1.11.1
HF Wav2vec(Audio Classification)	1	bf16	494.6 token/sec	8.08 ms	4	speech-recognition	Optimum Habana 1.11.1
HF Wav2vec(Speech Recoginition)	1	bf16	9.5 token/sec	421.05 ms	4	speech-recognition	Optimum Habana 1.11.1

* These models used the previous 1.15.0 software release
*** For the Large Language Inference Models, this is the average next token latency

System Configuration:

Gaudi® Platform
System: HLS-1 with eight Habana Gaudi HL-205 Mezzanine cards and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory

Gaudi®2 Platform
System: HLS-Gaudi2 with eight Habana Gaudi2 HL-225H Mezzanine cards and two Intel® Xeon® Platinum 8380 CPU @ 2.30GHz, and 1TB of System Memory

Common Software
Ubuntu22.04, SynapseAI Software version 1.16.0-526
PyTorch: Models run with PyTorch v2.2.2 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS

Performance varies by use, configuration and other factors. Please refer to the Model-References GitHub page for each model’s support and validation coverage. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.

Training Performance Highlights

Intel Gaudi 2 MLPerf™ 3.1 Training Performance

Intel Gaudi 2 Large Language Models Training Performance

Intel Gaudi 2 Reference Models Training Performance

Hugging Face Optimum Habana for Intel Gaudi 2 Training Performance

MosaicML on Intel Gaudi 2 Training Performance

First Gen Gaudi Reference Models Training Performance

Hugging Face Optimum Habana for First Gen Intel Gaudi Training Performance

Intel® Gaudi® 2 with MLPerf* v4.0

Intel Gaudi 2 Large Languages Models for Thruput

Intel Gaudi 2 Large Languages Models for Low Latency

Intel Gaudi 2 Reference Models Inference Performance

Hugging Face Optimum for Intel Gaudi 2 Inference Performance

Intel Gaudi First Gen Reference Models Inference Performance

Hugging Face Optimum Habana on Intel Gaudi First Gen Performance

System Configuration:

Intel^® Gaudi^® 2 with MLPerf* v4.0