Home » Get Started » Intel® Gaudi® AI accelerators Model Performance Data
Intel® Gaudi® AI accelerators Model Performance Data
These performance numbers are measured using the latest Intel Gaudi SW release, version 1.16.0-526, unless otherwise noted.
All Models for both Training and Inference are using the PyTorch 2.2.2 Framework. Other applicable frameworks used for training or inference are noted for each model.
These performance numbers have been generated with previous versions of Intel Gaudi SW. They plan to be updated with the upcoming release of new MLPerf Training in the next Intel Gaudi Software release.
Model
# HPU
Precision
Time To Train
Framework Version
MLPerf 3.1 - GPT3
384
fp8
153.58 min*
MLPerf 3.1 - GPT3
256
fp8
223.75 min**
MLPerf 3.1 - Stable Diffusion v2
64
bf16
19.4 min**
Lightning 2.1.2
MLPerf 3.1 - ResNet
8
bf16
16.4 min***
MLPerf 3.1 - BERT
8
bf16
15.01 min***
* The GPT3 measurement with 384 cards was taken using a pre-launch version of the Intel Gaudi 1.13.0 Software stack ** The GPT measurement with 256 cards and Stable Diffusion were taken using the Intel Gaudi 1.13.0 Software stack *** The Resnet and BERT measurement were taken using the Intel Gaudi 1.15.0 Software stack
Intel Gaudi 2 Large Language Models Training Performance
Model
# HPU
Precision
Throughput
Sequence Length
TP,PP,DP
Batch Size
Framework Version
LLaMA 2 7B
8
FP8
68439 tokens/sec
4,096
1,1,8
1,024
Megatron DeepSpeed PR #372
LLaMA 2 13B
16
FP8
52428 tokens/sec
4,096
2,2,4
256
Megatron DeepSpeed PR #372
LLaMA 2 70B
64
FP8
52838 tokens/sec
4,096
8,2,4
1,024
Megatron DeepSpeed PR #372
TP, PP, DP = These are the Tensor Parallel, Pipeline Parallel and Data Parallel parameters for the Megatron DeepSpeed training
Intel Gaudi 2 Reference Models Training Performance
Model
# HPU
Precision
Throughput
Acc
TTT
Batch
Framework Version
Llama 2 13B
16
bf16
10 samples/sec
256
DeepSpeed 0.14.0
Llama 2 70B
64
bf16
8.88 samples/sec
1,024
DeepSpeed 0.14.0
Llama 2 70B
64
FP8
12.9 samples/sec
1,024
DeepSpeed 0.14.0
Stable Diffusion
64
bf16
11145.8 img/sec
32
Lightning 2.2.4
Stable Diffusion Fine Tuning*
1
bf16
71 img/sec
7
Lightning 2.2.4
Stable Diffusion Fine Tuning Textual Inversion*
1
bf16
20.9 img/sec
7
Lightning 2.2.4
ResNet50 LARS
32
bf16
18399 img/sec
76.12
7.81 min
256
ResNet50 LARS
8
bf16
47070 img/sec
76.14
18.98 min
256
ResNet50 LARS
1
bf16
6233 img/sec
256
BERT Pre Training Phase 1
32
bf16
32450 sent/sec
254 min
64
BERT Pre Training Phase 1
8
bf16
9218 sent/sec
0
64
BERT Pre Training Phase 1
1
bf16
1178 sent/sec
64
BERT Pre Training Phase 2
32
bf16
10861 sent/sec
0
80.21 min
16
BERT Pre Training Phase 2
8
bf16
2777.5 sent/sec
0
16
BERT Pre Training Phase 2
1
bf16
351 sent/sec
16
BERT SQUAD Fine Tuning
8
bf16
2075 sent/sec
90.64
4.68 min
24
BERT SQUAD Fine Tuning
1
bf16
285 sent/sec
24
ResNext101
8
bf16
22184 img/sec
77.93
100 min
256
ResNext101
1
bf16
2853 img/sec
256
SSD
8
bf16
14651 img/sec
23.02
10.3 min
128
SSD
1
bf16
2140 img/sec
128
Transformer
8
bf16
1110435 token/sec
27.8
241.73 min
8,192
Transformer
1
bf16
138173.66 token/sec
8,192
Unet2D (torch.compile)
8
bf16
19938.29 img/sec
72.66
12.55 min
64
Lightning 2.2.4
Unet2D (torch.compile)
1
bf16
2626 img/sec
64
Lightning 2.2.4
Unet3D
8
bf16
252 img/sec
254.7 img/sec
74.26
2
Lightning 2.2.4
Unet3D
1
bf16
32.42 img/sec
2
Lightning 2.2.4
Hugging Face Optimum Habana for Intel Gaudi 2 Training Performance
See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.
Model
# HPU
Precision
Throughput
Accuracy
Time To Train
Batch Size
Task
Framework Version
Llama2-70B Fine Tuning FSDP (LoRA with torch.compile)
8
bf16
1.3 sentences/sec
2.13
81.75 min
10
language-modeling
Optimum Habana 1.11.1
Llama2-70B Fine Tuning (LoRA)
8
bf16
2.6 sentences/sec
2.13
39.43 min
10
language-modeling
DeepSpeed 0.14.0 Optimum Habana 1.11.1
Llama1-7B Fine Tuning (LoRA)
8
bf16
150 sentences/sec
2.35
5.08 min
64
language-modeling
Optimum Habana 1.11.1
Falcon-180B Fine Tuning (LoRA)
8
bf16
2.67 sentences/sec
3.71
149.41 min
1
language-modeling
DeepSpeed 0.14.0 Optimum Habana 1.11.1
Falcon-40B Fine Tuning (LoRA)
8
bf16
27.99 sentences/sec
4.06
15.85 min
1
language-modeling
Optimum Habana 1.11.1
GPTJ-CLM
8
bf16
22.24 sentences/sec
0.53
17.18 min
4
language-modeling
DeepSpeed 0.14.0 Optimum Habana 1.11.1
GPTNEOX-20B-CLM
16
bf16
294 sentences/sec
0.53
27.21 min
2
language-modeling
DeepSpeed 0.14.0 Optimum Habana 1.11.1
BridgeTower
8
bf16
726 sentences/sec
20.63 min
40
contrastive-image-text
Optimum Habana 1.11.1
GPT2
8
bf16
651 sentences/sec
0.40
1.61 min
4
language-modeling
DeepSpeed 0.14.0 Optimum Habana 1.11.1
GPT2-XL
8
bf16
94.24 sentences/sec
0.47
6.55 min
4
language-modeling
DeepSpeed 0.14.0 Optimum Habana 1.11.1
ALBERT-Large
8
bf16
2479 sentences/sec
91.70
1.86 min
32
question-answering
Optimum Habana 1.11.1
ALBERT-XXL
8
bf16
456 sentences/sec
94.80
6.73 min
16
question-answering
Optimum Habana 1.11.1
BERT Base (torch.compile)
8
bf16
4172 sentences/sec
85.35
1.16 min
24
question-answering
Optimum Habana 1.11.1
BERT-Large Fine Tuning (torch.compile)
8
bf16
2117 sentences/sec
93.40
1.98 min
32
question-answering
Optimum Habana 1.11.1
ClipRoBERTa
8
bf16
16366 images/sec
9.35 min
64
contrastive-image-text
Optimum Habana 1.11.1
DistilBERT
8
bf16
9992 sentences/sec
82.43
0.56 min
64
question-answering
Optimum Habana 1.11.1
Flan-T5 XXL
8
bf16
26.99 sentences/sec
37.06
369.91 min
22
question-answering
Optimum Habana 1.11.1
RoBERTa Base
8
bf16
6640 sentences/sec
92.14
0.73 min
64
question-answering
Optimum Habana 1.11.1
RoBERTa Large (torch.compile)
8
bf16
2122 sentences/sec
94.43
2.06 min
32
question-answering
Optimum Habana 1.11.1
Swin Transformer
8
bf16
5841 images/sec
99.09
1.8 min
160
image-classification
Optimum Habana 1.11.1
T5-LARGE
8
bf16
87.57 sentences/sec
44.34
246.95 min
4
summarization
DeepSpeed 0.14.0 Optimum Habana 1.11.1
T5-Small
8
bf16
553 sentences/sec
26.19
106.61 min
4
translation
DeepSpeed 0.14.0 Optimum Habana 1.11.1
Vision Transformer
8
bf16
6496 images/sec
98.91
1 min
128
image-classification
Optimum Habana 1.11.1
Wav2Vec2.0 AC
8
bf16
1960 sentences/sec
80.94
2.45 min
16
speech-recognition
Optimum Habana 1.11.1
Wav2Vec2.0 ASR
8
bf16
76 sentences/sec
3.96
20.65 min
4
speech-recognition
Optimum Habana 1.11.1
MosaicML on Intel Gaudi 2 Training Performance
Model
# HPU
Precision
Throughput
Accuracy
Time To Train
Batch Size
Framework Version
MosaicML MPT-1B
8
bf16
24145.17 samples/sec
7.35
13.41 min
512
PyTorch 2.2.2
MosaicML MPT-70B
32
bf16
17937.17 samples/sec
6.95
106.43 min
512
PyTorch 2.2.2
First Gen Gaudi Reference Models Training Performance
Model
# HPU
Precision
Throughput
Accuracy
Time To Train
Batch Size
Framework Version
ResNet50 Keras LARS (torch.compile)
32
bf16
45063 img/sec
76.34
24.5 min
256
ResNet50 Keras LARS (torch.compile)
8
bf16
11633 img/sec
76.55
69.76 min
256
ResNet50 Keras LARS (torch.compile)
1
bf16
1621 img/sec
256
BERT Pre Training combine
32
bf16
4792.62 sent/sec
1751 min
64
BERT Pre Training combine
8
bf16
1234 sent/sec
64
BERT Pre Training combine
1
bf16
155 sent/sec
64
BERT Pre Training Phase 1
32
bf16
5732.07 sent/sec
Loss:
1315 min
64
BERT Pre Training Phase 1
8
bf16
1481.31 sent/sec
64
BERT Pre Training Phase 1
1
bf16
186.2 sent/sec
64
BERT Pre Training Phase 2
32
bf16
1917.35 sent/sec
Loss:
436 min
16
BERT Pre Training Phase 2
8
bf16
487.99 sent/sec
16
BERT Pre Training Phase 2
1
bf16
61.25 sent/sec
16
BERT SQUAD Fine Tuning
8
bf16
404.52 sent/sec
90.68
12.96 min
24
BERT SQUAD Fine Tuning
1
bf16
53.58 sent/sec
24
BART Fine Tuning
8
bf16
32
DINO
8
bf16
947 exmpl/sec
77
2315 min
64
MobileNetV2
8
bf16
12632 img/sec
71.49
505 min
256
ResNet152
8
bf16
4967 img/sec
78.63
399 min
128
SSD**
8
bf16
3439 img/sec
128
Transformer
8
bf16
187860.33 tokens/sec
28.1
1023 min
4096
Unet2D (torch.compile)
8
bf16
4773 img/sec
72.86
63 min
64
Lightning 2.2.4
Unet3D
8
bf16
62 img/sec
74.33
73 min
2
Lightning 2.2.4
YOLOX
8
bf16
313.37 img/sec
39.75
2326.8 min
16
Hugging Face Optimum Habana for First Gen Intel Gaudi Training Performance
See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.
Model
# HPU
Precision
Throughput
Accuracy
Time To Train
Batch Size
Task
Framework Version
GPT2-XL
8
bf16
19.37 sentences/sec
0.47
74 min
4
language-modeling
DeepSpeed 0.14.0, Optimum Habana 1.11.1
GPT2
8
bf16
167.41 sentences/sec
0.41
4.2 min
4
language-modeling
DeepSpeed 0.14.0, Optimum Habana 1.11.1
T5-LARGE
8
bf16
50 sentences/sec
44.34
365 min
4
summarization
DeepSpeed 0.14.0, Optimum Habana 1.11.1
T5-Small
8
bf16
192 sentences/sec
26.12
116.8 min
4
translation
DeepSpeed 0.14.0, Optimum Habana 1.11.1
ALBERT-L
8
bf16
490.11 sentences/sec
92.57
7.9 min
32
question-answering
Optimum Habana 1.11.1
ALBERT-XXL
8
bf16
75.34 sentences/sec
94.88
41.4 min
12
question-answering
Optimum Habana 1.11.1
BERT-BASE FT (torch.compile)
8
bf16
1178 sentences/sec
85.53
3 min
24
question-answering
Optimum Habana 1.11.1
BERT-Large FT (torch.compile)
8
bf16
413 sentences/sec
93.29
8.6 min
24
question-answering
Optimum Habana 1.11.1
Clip-RoBERTa
8
bf16
895 images/sec
45.2 min
64
contrastive-image-text
Optimum Habana 1.11.1
DistilBERT
8
bf16
1524 sentences/sec
85.72
3 min
8
question-answering
Optimum Habana 1.11.1
RoBERTa Base
8
bf16
1066 sentences/sec
91.81
3.13 min
12
question-answering
Optimum Habana 1.11.1
RoBERTa Large (torch.compile)
8
bf16
410 sentences/sec
94.76
8.6 min
12
question-answering
Optimum Habana 1.11.1
Swin Transformer
8
bf16
1573 images/sec
98.68
4.8 min
64
question-answering
Optimum Habana 1.11.1
Vision Transformer
8
bf16
2461 images/sec
97.19
2.81 min
64
question-answering
Optimum Habana 1.11.1
Wav2Vec2-AC
8
bf16
667 sentences/sec
81.84
6.3 min
16
speech-recognition
Optimum Habana 1.11.1
Wav2Vec2-ASR
8
bf16
41.83 sentences/sec
4.2
36.73 min
4
speech-recognition
Optimum Habana 1.11.1
Intel® Gaudi® 2 with MLPerf* v4.0
Model
# HPU
Precision
Performance
Framework Version
MLPerf4.0 LLama 2 70B Server
8
fp8
6222.9 token/sec
PyTorch 2.2.2
MLPerf4.0 Llama 2 70B Offline
8
fp8
7808 token/sec
PyTorch 2.2.2
MLPerf4.0 Stable Diffusion XL Server
8
fp8
6.25 Queries/s
MLPerf4.0 Stable Diffusion XL Offline
8
fp8
6.45 samples/sec
Intel Gaudi 2 Large Languages Models for Thruput
Model
# HPU
Precision
Input length
Output Length
Throughput
Batch
Framework Version
LLaMA 2 7B
1
fp8
128
128
13163 tokens/sec
1,230
Optimum Habana 1.11.1
LLaMA 2 7B
1
fp8
128
2048
4777 tokens/sec
163
Optimum Habana 1.11.1
LLaMA 2 7B
1
fp8
2048
128
1291 tokens/sec
94
Optimum Habana 1.11.1
LLaMA 2 7B
1
fp8
2048
2048
1943 tokens/sec
81
Optimum Habana 1.11.1
LLaMA 2 70B
2
fp8
128
128
2727 tokens/sec
1,750
DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B
4
fp8
128
2048
7422 tokens/sec
750
DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B
2
fp8
2048
128
276 tokens/sec
95
DeepSpeed 0.14.0, Optimum Habana 1.11.1
LLaMA 2 70B
2
fp8
2048
2048
958 tokens/sec
78
DeepSpeed 0.14.0, Optimum Habana 1.11.1
Mistral 7B Instruct
1
fp8
128
128
13112 tokens/sec
896
Optimum Habana 1.11.1
Mistral 7B Instruct
1
fp8
128
2048
7947 tokens/sec
120
Optimum Habana 1.11.1
Mistral 7B Instruct
1
fp8
2048
128
1360 tokens/sec
120
Optimum Habana 1.11.1
Mistral 7B Instruct
1
fp8
2048
2048
3143 tokens/sec
44
Optimum Habana 1.11.1
Intel Gaudi 2 Large Languages Models for Low Latency
Hugging Face Optimum for Intel Gaudi 2 Inference Performance
See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.
Model
# HPU
Precision
Input Length
Output Length
Throughput
Latency
Batch
Task
Framework Version
Llama 2-7B (torch.compile)
1
bf16
128
128
5820 token/sec
51.54 ms
300
text-generation
Optimum Habana 1.11.1
Falcon 180B
8
bf16
128
2,048
700 token/sec
57.14 ms
40
text-generation
Optimum Habana 1.11.1
Falcon-40B 2048 Tokens
8
bf16
128
2,048
92.34 token/sec
10.82 ms
1
text-generation
Optimum Habana 1.11.1
Falcon-7B 8192 Tokens
1
bf16
128
8,192
118.19 token/sec
8.46 ms
1
text-generation
Optimum Habana 1.11.1
GPT-J
8
bf16
128
100
628.74 token/sec
6.36 ms
4
text-generation
Optimum Habana 1.11.1
StableLM-3B
1
bf16
128
2,048
250 token/sec
4 ms
1
text-generation
Optimum Habana 1.11.1
StableLM-7B
1
bf16
128
2,048
128 token/sec
7.81 ms
1
text-generation
Optimum Habana 1.11.1
MPT-7B
1
bf16
128
1,932
121 token/sec
8.26 ms
1
text-generation
Optimum Habana 1.11.1
Bloomz
8
bf16
128
100
36.78 token/sec
27.18 ms
1
text-generation
DeepSpeed 0.14.0, Optimum Habana 1.11.1
StarCoder
1
bf16
100
100
65 token/sec
15.38 ms
1
text-generation
DeepSpeed 0.14.0, Optimum Habana 1.11.1
OPT
1
bf16
100
100
1120 token/sec
0.89 ms
1
text-generation
Optimum Habana 1.11.1
T5-3B Summarization 1024-128 Beam4
1
bf16
1,024
128
0.94 token/sec
1063.82 ms
1
summarization
Optimum Habana 1.11.1
Bert (Text Classification)
1
bf16
128
2125 token/sec
3.76 ms
8
text-classification
Optimum Habana 1.11.1
Bert (Language Modeling)
1
bf16
66.64 token/sec
60.02 ms
4
language-modeling
Optimum Habana 1.11.1
Bert (Question Answering)
1
bf16
384
613 token/sec
13.05 ms
8
question-answering
Optimum Habana 1.11.1
StableDiffusion v2.1 (512x512)
1
bf16
1.33 images/sec
3007.51 ms
4
stable-diffusion
PyTorch Lightning 2.2.4
Bart
1
bf16
6.79 token/sec
294.55 ms
2
summarization
Optimum Habana 1.11.1
BridgeTower
1
bf16
321 token/sec
49.84 ms
16
constrastive-image-text
Optimum Habana 1.11.1
ESMFold
1
bf16
2.97 token/sec
336.7 ms
1
protein-folding
Optimum Habana 1.11.1
T5-3B Summarization Greedy
1
bf16
2.46 token/sec
406.5 ms
1
summarization
Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy
1
bf16
30.85 token/sec
129.65 ms
4
translation
Optimum Habana 1.11.1
Wav2vec(Audio Classification)
1
bf16
1002 token/sec
3.99 ms
4
audio-classification
Optimum Habana 1.11.1
Wav2vec(Speech Recoginition)
1
bf16
16.62 token/sec
240.67 ms
4
speech-recoginition
Optimum Habana 1.11.1
Intel Gaudi First Gen Reference Models Inference Performance
Model
# HPU
Precision
Throughput
Latency
Batch Size
Framework Version
Bert
1
bf16
154.1 token/sec
155.74 ms
24
Unet2D
1
bf16
3730 img/sec
17.15 ms
64
Lightning 2.2.4
Unet3D
1
bf16
64.1 img/sec
31.2 ms
2
Lightning 2.2.4
Hugging Face Optimum Habana on Intel Gaudi First Gen Performance
See the Examples page for information on how to run each of the Tasks, including model naming and hyperparameter usage.
Model
# HPU
Precision
Throughput
Latency
Batch
Task
Framework Version
HF Bert (Language Modeling)
1
bf16
4
language-modeling
Optimum Habana 1.11.1
HF Bert (Question Answering)
1
bf16
127.7 token/sec
62.64 ms
8
question-answering
Optimum Habana 1.11.1
HF Bert (Text Classification)
1
bf16
434.4 token/sec
18.41 ms
8
text-classification
Optimum Habana 1.11.1
HF Bart-Greedy
1
bf16
3.1 token/sec
645.16 ms
2
summarization
Optimum Habana 1.11.1
HF ESMFold
1
bf16
13.9 token/sec
71.94 ms
1
protein-folding
Optimum Habana 1.11.1
HF StableDiffusion V2-1 (512x512)
1
bf16
0.4 token/sec
10000 ms
4
text to image generation
Optimum Habana 1.11.1
HF-T5-Small-Translation-Greedy
1
bf16
16.8 token/sec
238.09 ms
4
translation
Optimum Habana 1.11.1
HF Wav2vec(Audio Classification)
1
bf16
494.6 token/sec
8.08 ms
4
speech-recognition
Optimum Habana 1.11.1
HF Wav2vec(Speech Recoginition)
1
bf16
9.5 token/sec
421.05 ms
4
speech-recognition
Optimum Habana 1.11.1
* These models used the previous 1.15.0 software release *** For the Large Language Inference Models, this is the average next token latency
System Configuration:
Gaudi® Platform System: HLS-1 with eight Habana Gaudi HL-205 Mezzanine cards and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Gaudi®2 Platform System: HLS-Gaudi2 with eight Habana Gaudi2 HL-225H Mezzanine cards and two Intel® Xeon® Platinum 8380 CPU @ 2.30GHz, and 1TB of System Memory
Common Software Ubuntu22.04, SynapseAI Software version 1.16.0-526 PyTorch: Models run with PyTorch v2.2.2 use this Docker image Environment: These workloads are run using the Docker images running directly on the Host OS
Performance varies by use, configuration and other factors. Please refer to the Model-References GitHub page for each model’s support and validation coverage. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.
Stay Informed: Register for the latest Intel Gaudi AI Accelerator developer news, events, training, and updates.