See the latest performance data for Gaudi2 training, Gaudi2 inference, Gaudi training and Gaudi inference. For information on models and containers that are currently integrated with Habana’s Synapse AI software suite visit the Habana catalog.
Gaudi2 MLPerf™ 2.1 Training Performance
These performance numbers have been generated with the latest version of SynapseAI and are improvements over the officially submitted numbers posted on MLCommons website.
Framework Version | Model | # HPU | Precision | Time To Train |
---|---|---|---|---|
PyTorch 2.0.1 | MLPerf 2.1 - ResNet | 8 | bf16 | 16.48 min |
PyTorch 2.0.1 | MLPerf 2.1 - BERT | 8 | bf16 | 15.9 min |
TensorFlow 2.12.0 | MLPerf 2.1 - ResNet | 8 | bf16 | 15.878 min |
TensorFlow 2.12.0 | MLPerf 2.1 - BERT | 8 | bf16 | 13.03 min |
Gaudi2 Reference Models Training Performance
Framework Version | Model | ||||||
---|---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size |
DeepSpeed 0.7.7 | Megatron-DeepSpeed BLOOM 13B | 64 | bf16 | 55.48 sent/sec | 1024 | ||
Lightning 2.0.0 | Stable Diffusion | 64 | bf16 | 8993.1 image/sec | 32 | ||
Lightning 2.0.0 | Stable Diffusion | 8 | bf16 | 1140.1 image/sec | 32 | ||
Lightning 2.0.0 | Stable Diffusion | 1 | bf16 | 146.7 image/sec | 32 | ||
PyTorch 2.0.1 | Stable Diffusion Fine Tuning | 1 | bf16 | 59.6 image/sec | 7 | ||
PyTorch 2.0.1 | Stable Diffusion Fine Tuning Textual Inversion | 1 | bf16 | 19.5 image/sec | 7 | ||
PyTorch 2.0.1 | ResNet50 LARS | 32 | bf16 | 176679.7 image/sec | 76.28 | 7.5 min | 256 |
PyTorch 2.0.1 | ResNet50 LARS | 16 | bf16 | 84824.2 image/sec | 76.43 | 12.6 min | 256 |
PyTorch 2.0.1 | ResNet50 LARS | 8 | bf16 | 46526.3 image/sec | 76.27 | 18.2 min | 256 |
PyTorch 2.0.1 | ResNet50 LARS | 1 | bf16 | 5858.6 image/sec | 256 | ||
TensorFlow 2.12.0 | ResNet50 Keras LARS | 32 | bf16 | 174082.9 image/sec | 76.47 | 6.7 min | 256 |
TensorFlow 2.12.0 | ResNet50 Keras LARS | 16 | bf16 | 87277.4 image/sec | 76.52 | 11.3 min | 256 |
TensorFlow 2.12.0 | ResNet50 Keras LARS | 8 | bf16 | 44140.6 image/sec | 76.64 | 20.5 min | 256 |
TensorFlow 2.12.0 | ResNet50 Keras LARS | 1 | bf16 | 6101.5 image/sec | 256 | ||
PyTorch 2.0.1 | BERT-Large Pre Training Phase 1 | 32 | bf16 | 30950.4 sent/sec | Loss: 1.50 | 277.7 min | 64 |
PyTorch 2.0.1 | BERT-Large Pre Training Phase 1 | 16 | bf16 | 16298.8 sent/sec | Loss: 1.50 | 509.4 min | 64 |
PyTorch 2.0.1 | BERT-Large Pre Training Phase 1 | 8 | bf16 | 8354.6 sent/sec | Loss: 1.50 | 941.7 min | 64 |
PyTorch 2.0.1 | BERT-Large Pre Training Phase 1 | 1 | bf16 | 1057.1 sent/sec | 64 | ||
PyTorch 2.0.1 | BERT-Large Pre Training Phase 2 | 32 | bf16 | 9920.8 sent/sec | Loss: 1.34 | 95.1 min | 16 |
PyTorch 2.0.1 | BERT-Large Pre Training Phase 2 | 16 | bf16 | 5075.2 sent/sec | Loss: 1.34 | 181.4 min | 16 |
PyTorch 2.0.1 | BERT-Large Pre Training Phase 2 | 8 | bf16 | 2578.1 sent/sec | Loss: 1.33 | 338.3 min | 16 |
PyTorch 2.0.1 | BERT-Large Pre Training Phase 2 | 1 | bf16 | 323.7 sent/sec | 16 | ||
PyTorch 2.0.1 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 1516. sent/sec | 93.19 | 2.61 min | 24 |
PyTorch 2.0.1 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 267.9 sent/sec | 24 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training phase 1 | 32 | bf16 | 29857.6 sent/sec | Loss: 1.431 | 260.2 min | 64 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 1 | 16 | bf16 | 15117.8 sent/sec | Loss: 1.464 | 511.5 min | 64 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 1 | 8 | bf16 | 7760.2 sent/sec | Loss: 1.5 | 1001 min | 64 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 1 | 1 | bf16 | 989.5 sent/sec | 64 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training phase 2 | 32 | bf16 | 10051.1 sent/sec | Loss: 1.460 | 87.5 min | 16 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 2 | 16 | bf16 | 5053.7 sent/sec | Loss: 1.285 | 172.0 min | 16 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 2 | 8 | bf16 | 2620.5 sent/sec | Loss: 1.285 | 328.7 min | 16 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 2 | 1 | bf16 | 332.7 sent/sec | 16 | ||
TensorFlow 2.12.0 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 1858.3 sent/sec | 93.49 | 4.1 min | 24 |
TensorFlow 2.12.0 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 255.4 sent/sec | 24 | ||
PyTorch 2.0.1 | Wav2Vec 2.0 | 8 | bf16 | 1252.1 sent/sec | WER: 6.45 | 57h | 384 |
PyTorch 2.0.1 | Wav2Vec 2.0 | 1 | bf16 | 185.0 sent/sec | 384 | ||
PyTorch 2.0.1 | ResNext101 | 8 | bf16 | 22068.9 image/sec | 78.03 | 101 min | 256 |
PyTorch 2.0.1 | ResNext101 | 1 | bf16 | 2813.1 image/sec | 256 | ||
TensorFlow 2.12.0 | ResNext101 | 8 | bf16 | 19569.6 image/sec | 79.4 | 101.7 min | 256 |
TensorFlow 2.12.0 | ResNext101 | 1 | bf16 | 2620.0 image/sec | 256 | ||
PyTorch 2.0.1 | SSD | 8 | bf16 | 16506.2 image/sec | 23.1 | 9.0 min | 128 |
PyTorch 2.0.1 | SSD | 1 | bf16 | 2128.1 image/sec | 128 | ||
TensorFlow 2.12.0 | SSD | 8 | bf16 | 8744.5 image/sec | 23.43 | 14.9 min | 120 |
TensorFlow 2.12.0 | SSD | 1 | bf16 | 2341.8 image/sec | 240 | ||
PyTorch 2.0.1 | Transformer | 8 | bf16 | 1034728 tokens/sec | 28.2 | 248.6 min | 8192 |
PyTorch 2.0.1 | Transformer | 1 | bf16 | 129036 tokens/sec | 8192 | ||
TensorFlow 2.12.0 | Transformer | 8 | bf16 | 1040012 tokens/sec | 27.1 | 158.8 min | 16384 |
TensorFlow 2.12.0 | Transformer | 1 | bf16 | 135420 tokens/sec | 16384 | ||
TensorFlow 2.12.0 | MaskRCNN | 8 | bf16 | 294.3 image/sec | 34.02 | 96.73 min | 12 |
TensorFlow 2.12.0 | MaskRCNN | 1 | bf16 | 92.7 image/sec | 12 | ||
Lightning 2.0.0 | Unet2D | 8 | bf16 | 13666.7 image/sec | 72.65 | 14.06 min | 64 |
Lightning 2.0.0 | Unet2D | 1 | bf16 | 2300.9 image/sec | 64 | ||
Lightning 2.0.0 | Unet3D | 8 | bf16 | 237.8 image/sec | 74.3 | 19.6 min | 2 |
Lightning 2.0.0 | Unet3D | 1 | bf16 | 30.6 image/sec | 2 | ||
TensorFlow 2.12.0 | UNet2D | 8 | bf16 | 1429.4 image/sec | 89.37 | 1.53 min | 8 |
TensorFlow 2.12.0 | UNet2D | 1 | bf16 | 207.0 image/sec | 8 | ||
TensorFlow 2.12.0 | UNet3D | 8 | bf16 | 192.8 image/sec | 90.13 | 3.95 min | 2 |
TensorFlow 2.12.0 | UNet3D | 1 | bf16 | 27.9 image/sec | 2 |
Gaudi Reference Models Training Performance
Framework Version | Model | ||||||
---|---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Accuracy | Time To Train | Batch Size |
Lightning 2.0.0 | Stable Diffusion | 64 | bf16 | 1207.9 seq/sec | 8 | ||
PyTorch 2.0.1 | Stable Diffusion Fine Tuning | 1 | bf16 | 12.2 image/sec | 2 | ||
PyTorch 2.0.1 | Stable Diffusion Fine Tuning Textual Inversion | 1 | bf16 | 4.37 image/sec | 2 | ||
PyTorch 2.0.1 | ResNet50 SGD | 16 | bf16 | 22687 image/sec | 76.09 | 98.8 min | 256 |
PyTorch 2.0.1 | ResNet50 SGD | 8 | bf16 | 12669.2 image/sec | 76.46 | 167.6 min | 256 |
PyTorch 2.0.1 | ResNet50 SGD | 1 | bf16 | 1663.03 image/sec | 256 | ||
TensorFlow 2.12.0 | ResNet50 Keras LARS | 32 | bf16 | 49797.6 image/sec | 76.2 | 19.7 min | 256 |
TensorFlow 2.12.0 | ResNet50 Keras LARS | 16 | bf16 | 25006.2 image/sec | 76.02 | 35.7 min | 256 |
TensorFlow 2.12.0 | ResNet50 Keras LARS | 8 | bf16 | 12569.7 image/sec | 76.16 | 70.3 min | 256 |
TensorFlow 2.12.0 | ResNet50 Keras LARS | 1 | bf16 | 1626.4 image/sec | 256 | ||
PyTorch 2.0.1 | BERT-Large Pre Training combine | 32 | bf16 | 4732.2 sent/sec | 1859.66 min | 64 | |
PyTorch 2.0.1 | BERT-Large Pre Training combine | 8 | bf16 | 1205.6 sent/sec | 64 | ||
PyTorch 2.0.1 | BERT-Large Pre Training combine | 1 | bf16 | 151.3 sent/sec | 64 | ||
PyTorch 2.0.1 | BERT-L Pre Training Phase 1 | 32 | bf16 | 5704.5 sent/sec | Loss: 1.49 | 1386.1 min | 64 |
PyTorch 2.0.1 | BERT-L Pre Training Phase 1 | 8 | bf16 | 1454.4 sent/sec | 64 | ||
PyTorch 2.0.1 | BERT-L Pre Training Phase 1 | 1 | bf16 | 182.7 sent/sec | 64 | ||
PyTorch 2.0.1 | BERT-L Pre Training Phase 2 | 32 | bf16 | 1848.8 sent/sec | Loss: 1.33 | 473.5 min | 8 |
PyTorch 2.0.1 | BERT-L Pre Training Phase 2 | 8 | bf16 | 470.1 sent/sec | 8 | ||
PyTorch 2.0.1 | BERT-L Pre Training Phase 2 | 1 | bf16 | 58.9 sent/sec | 8 | ||
PyTorch 2.0.1 | BERT-L SQUAD Fine Tuning (SQUAD) | 8 | bf16 | 369.9 sent/sec | 93.18 | 9.7 min | 24 |
PyTorch 2.0.1 | BERT-L SQUAD Fine Tuning (SQUAD) | 1 | bf16 | 57.81 sent/sec | 24 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training combine | 32 | bf16 | 5382.2 sent/sec | 1599.3 min | 64 | |
TensorFlow 2.12.0 | BERT-Large Pre Training combine | 8 | bf16 | 1369.8 sent/sec | 64 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training combine | 1 | bf16 | 172.6 sent/sec | 64 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training phase 1 | 32 | bf16 | 6474.5 sent/sec | Loss: 1.452 | 1191.4 min | 64 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 1 | 8 | bf16 | 1650.9 sent/sec | 64 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training phase 1 | 1 | bf16 | 207.5 sent/sec | 64 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training phase 2 | 32 | bf16 | 2115.8 sent/sec | Loss: 1.287 | 407.9 min | 8 |
TensorFlow 2.12.0 | BERT-Large Pre Training phase 2 | 8 | bf16 | 535.5 sent/sec | 8 | ||
TensorFlow 2.12.0 | BERT-Large Pre Training phase 2 | 1 | bf16 | 68.0 sent/sec | 8 | ||
TensorFlow 2.12.0 | BERT-Large Fine Tuning (SQUAD) | 8 | bf16 | 411.7 sent/sec | 93.31 | 12.1 min | 24 |
TensorFlow 2.12.0 | BERT-Large Fine Tuning (SQUAD) | 1 | bf16 | 54.3 sent/sec | 24 | ||
PyTorch 2.0.1 | BART Fine Tuning | 8 | bf16 | 1379.2 sent/sec | 7.51 min | 32 | |
TensorFlow 2.12.0 | Densenet 121 TFD | 8 | bf16 | 6030.1 image/sec | 75.12 | 333.4 min | 128 |
PyTorch 2.0.1 | DINO | 8 | bf16 | 1269.2 exmpl/sec | 77.1 | 2779.8 min | 64 |
PyTorch 2.0.1 | GoogleNet | 8 | bf16 | 15835.0 image/sec | 72.29 | 140.2 min | 256 |
TensorFlow 2.12.0 | MaskRCNN | 8 | bf16 | 131.6 image/sec | 34.23 | 190.6 min | 4 |
PyTorch 2.0.1 | MobileNetV2 | 8 | bf16 | 12328 image/sec | 71.28 | 522.5 min | 256 |
PyTorch 2.0.1 | ResNet152 | 8 | bf16 | 5033.9 image/sec | 78.24 | 416.8 min | 128 |
PyTorch 2.0.1 | ResNext101 | 8 | bf16 | 5820.9 image/sec | 77.94 | 412.9 min | 128 |
TensorFlow 2.12.0 | Resnext-101 | 8 | bf16 | 4883 image/sec | 79.25 | 399.4 min | 128 |
PyTorch 2.0.1 | SSD | 8 | bf16 | 3566.0 image/sec | 23.1 | 33.7 min | 128 |
TensorFlow 2.12.0 | SSD | 8 | bf16 | 4029.2 image/sec | 23.43 | 32.2 min | 128 |
PyTorch 2.0.1 | Transformer | 8 | bf16 | 185676.4 tokens/sec | 27.7 | 1057.9 min | 4096 |
TensorFlow 2.12.0 | Transformer | 8 | bf16 | 186976.4 tokens/sec | 26.4 | 873.2 min | 4096 |
Lightning 2.0.0 | Unet2D | 8 | bf16 | 5081.0 image/sec | 73.95 | 47.6 min | 64 |
TensorFlow 2.12.0 | UNet2D | 8 | bf16 | 379.2 image/sec | 88.78 | 3.56 min | 8 |
Lightning 2.0.0 | Unet3D | 8 | bf16 | 65.2 image/sec | 75.47 | 60.4 min | 2 |
TensorFlow 2.12.0 | UNet3D | 8 | bf16 | 50.9 image/sec | 89.41 | 11.9 min | 2 |
TensorFlow 2.12.0 | Vision Transformer | 8 | bf16 | 601.2 image/sec | 84.66 | 363.8 min | 32 |
PyTorch 2.0.1 | YOLOX | 8 | bf16 | 529.6 image/sec | 39.87 | 2650.6 min | 16 |
TensorFlow 2.12.0 | RN50 Keras LARS TFD | 8 | bf16 | 12338.9 image/sec | 76.25 | 70.3 min | 256 |
TensorFlow 2.12.0 | ResNet50 Keras LARS Host NIC (HVD and Libfabric) | 16 | bf16 | 24414.0 image/sec | 256 | ||
TensorFlow 2.12.0 | ResNet50 Keras LARS Host NIC (tf.distribute and Libfabric) | 16 | bf16 | 24524.8 image/sec | 256 | ||
PyTorch 2.0.1 | ResNet50 Host NIC (libfabric) | 16 | bf16 | 22284.6 image/sec | 256 |
Hugging Face Optimum Habana Gaudi2 Training Performance
Framework Version | Model | Task | ||||
---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Batch Size | Task |
PyTorch 2.0.1 | ALBERT-Large | 8 | bf16 | 1644.0 tokens/sec | 32 | question-answering |
PyTorch 2.0.1 | ALBERT-Large | 1 | bf16 | 289.6 tokens/sec | 32 | question-answering |
PyTorch 2.0.1 | ALBERT-XXL | 8 | bf16 | 383.2 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | ALBERT-XXL | 1 | bf16 | 53.4 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | BERT-Large Fine Tuning | 8 | bf16 | 1527.5 tokens/sec | 24 | question-answering |
PyTorch 2.0.1 | BERT-Large Fine Tuning | 1 | bf16 | 265.9 tokens/sec | 24 | question-answering |
PyTorch 2.0.1 | Clip-RoBERTa | 8 | bf16 | 1089.8 tokens/sec | 64 | contrastive-image-text |
PyTorch 2.0.1 | Clip-RoBERTa | 1 | bf16 | 120.4 tokens/sec | 64 | contrastive-image-text |
PyTorch 2.0.1 | DistilBERT | 8 | bf16 | 2184.4 tokens/sec | 8 | question-answering |
PyTorch 2.0.1 | DistilBERT | 1 | bf16 | 397.2 tokens/sec | 8 | question-answering |
PyTorch 2.0.1 | Flan-T5 XXL | 8 | bf16 | 142.1 tokens/sec | 22 | question-answering |
PyTorch 2.0.1 | GPT2 | 8 | bf16 | 285.1 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | GPT2 | 1 | bf16 | 66.0 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | GPT2 XL | 8 | bf16 | 53.5 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | RoBERTa Base | 8 | bf16 | 1740.3 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | RoBERTa Base | 1 | bf16 | 279.0 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | RoBERTa Large | 8 | bf16 | 924.7 tokens/sec | 12 | language-modeling |
PyTorch 2.0.1 | RoBERTa Large | 1 | bf16 | 170.7 tokens/sec | 12 | language-modeling |
PyTorch 2.0.1 | Swin Transformer | 8 | bf16 | 2776.1 tokens/sec | 64 | image-classification |
PyTorch 2.0.1 | Swin Transformer | 1 | bf16 | 415.6 tokens/sec | 64 | image-classification |
PyTorch 2.0.1 | T5-Large | 8 | bf16 | 578.8 tokens/sec | 4 | summarization |
PyTorch 2.0.1 | T5-Small | 8 | bf16 | 2202.9 tokens/sec | 4 | translation |
PyTorch 2.0.1 | T5-Small | 1 | bf16 | 73.6 tokens/sec | 4 | translation |
PyTorch 2.0.1 | Vision Transformer | 8 | bf16 | 6483.2 image/sec | 128 | image-classification |
PyTorch 2.0.1 | Vision Transformer | 1 | bf16 | 1162.8 image/sec | 128 | image-classification |
PyTorch 2.0.1 | Wav2Vec2.0 AC | 1 | bf16 | 1580.9 sent/sec | 256 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec2.0 AC | 8 | bf16 | 899.6 sent/sec | 16 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec2.0 ASR | 8 | bf16 | 56.4 sent/sec | 4 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec2.0 ASR | 1 | bf16 | 7.0 sent/sec | 4 | speech-recognition |
Hugging Face Optimum Habana Gaudi Training Performance
Framework Version | Model | Task | ||||
---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Batch Size | Task |
PyTorch 2.0.1 | GPT2-XL | 8 | bf16 | 16.5 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | GPT2 | 8 | bf16 | 121.6 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | T5-Large | 8 | bf16 | 282.9 tokens/sec | 4 | summarization |
PyTorch 2.0.1 | T5-Small | 8 | bf16 | 1249.9 tokens/sec | 4 | translation |
PyTorch 2.0.1 | Wav2Vec 2.0 AC | 1 | bf16 | 389.2 sent/sec | 256 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec 2.0 AC | 8 | bf16 | 333.9 sent/sec | 16 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec 2.0 ASR | 8 | bf16 | 27.8 sent/sec | 4 | speech-recognition |
PyTorch 2.0.1 | ALBERT-Large | 8 | bf16 | 427.3 tokens/sec | 32 | question-answering |
PyTorch 2.0.1 | ALBERT-XXL | 8 | bf16 | 73.8 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | BERT-Large Fine Tuning | 8 | bf16 | 364.6 tokens/sec | 24 | question-answering |
PyTorch 2.0.1 | BERT-Large Fine Tuning | 1 | bf16 | 55.9 tokens/sec | 24 | question-answering |
PyTorch 2.0.1 | Clip-RoBERTa | 8 | bf16 | 889.8 tokens/sec | 64 | contrastive-image-text |
PyTorch 2.0.1 | Clip-RoBERTa | 1 | bf16 | 125.6 tokens/sec | 64 | contrastive-image-text |
PyTorch 2.0.1 | DistilBERT | 8 | bf16 | 1339.4 tokens/sec | 8 | question-answering |
PyTorch 2.0.1 | DistilBERT | 1 | bf16 | 251.2 tokens/sec | 8 | question-answering |
PyTorch 2.0.1 | RoBERTa Base | 8 | bf16 | 834.0 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | RoBERTa Large | 8 | bf16 | 306.7 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | Swin Transformer | 8 | bf16 | 1214.7 tokens/sec | 64 | image-classification |
PyTorch 2.0.1 | Vision Transformer | 8 | bf16 | 2376.8 tokens/sec | 128 | image-classification |
Gaudi2 Reference Models Inference Performance
Framework Version | Model | # HPU | Precision | Throughput | Latency | Batch Size |
---|---|---|---|---|---|---|
DeepSpeed 0.7.7 | Bloom-176B Greedy | 8 | bf16 | 31.8 token/sec | 31.4 ms/token | 1 |
PyTorch 2.0.1 | Stable Diffusion v2.1 image size 512x512 | 1 | bf16 | 1.1 image/sec | 890.4 ms/image | 1 |
PyTorch 2.0.1 | Stable Diffusion v2.1 image size 768x768 | 1 | bf16 | 0.42 image/sec | 2347.4 ms/image | 1 |
PyTorch 2.0.1 | Bloom-7B | 1 | bf16 | 129.7 token/sec | 7.7 ms/token | 1 |
PyTorch 2.0.1 | BERT Large | 1 | bf16 | 748.6 token/sec | 32.0 ms/token | 24 |
PyTorch 2.0.1 | Wav2Vec-L | 1 | bf16 | 12.3M token/sec | 1 | |
PyTorch 2.0.1 | ResNet50 | 1 | bf16 | 17261.9 image/sec | 14.8 ms/image | 256 |
PyTorch 2.0.1 | ResNext101 | 1 | bf16 | 10542.5 image/sec | 24.2 ms/image | 256 |
PyTorch 2.0.1 | UNet2D | 1 | bf16 | 8437.7 image/sec | 7.5 ms/image | 64 |
PyTorch 2.0.1 | UNet3D | 1 | bf16 | 112.2 image/sec | 17.82 ms/image | 2 |
Gaudi Reference Models Inference Performance
Framework Version | Model | # HPU | Precision | Throughput | Latency | Batch Size |
---|---|---|---|---|---|---|
DeepSpeed 0.7.7 | Bloom-176B Greedy | 16 | bf16 | 10.47 tokens/sec | 100 ms/token | 1 |
PyTorch 2.0.1 | Stable Diffusion v2.1 image size 512x512 | 1 | bf16 | 0.35 image/sec | 2818.2 ms/token | 1 |
PyTorch 2.0.1 | Stable Diffusion v2.1 image size 768x768 | 1 | bf16 | 0.12 image/sec | 8072.7 ms/token | 1 |
PyTorch 2.0.1 | Bloom 7B | 1 | bf16 | 42.6 tokens/sec | 23.5 ms/token | 1 |
PyTorch 2.0.1 | Wav2Vec-Base | 1 | bf16 | 7.07M tokens/sec | 1 | |
PyTorch 2.0.1 | BERT-Large | 1 | bf16 | 151.8 tokens/sec | 158 ms/token | 24 |
PyTorch 2.0.1 | UNet2D | 1 | bf16 | 1522.3 image/sec | 44.8 ms/image | 64 |
PyTorch 2.0.1 | UNet3D | 1 | bf16 | 34.4 image/sec | 127.5 ms/image | 2 |
Hugging Face Optimum Habana Gaudi2 Inference Performance
Framework Version | Model | Task | ||||
---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Batch Size | Task |
PyTorch 2.0.1 | ALBERT-Large | 8 | bf16 | 1644.0 tokens/sec | 32 | question-answering |
PyTorch 2.0.1 | ALBERT-Large | 1 | bf16 | 289.6 tokens/sec | 32 | question-answering |
PyTorch 2.0.1 | ALBERT-XXL | 8 | bf16 | 383.2 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | ALBERT-XXL | 1 | bf16 | 53.4 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | BERT-Large Fine Tuning | 8 | bf16 | 1527.5 tokens/sec | 24 | question-answering |
PyTorch 2.0.1 | BERT-Large Fine Tuning | 1 | bf16 | 265.9 tokens/sec | 24 | question-answering |
PyTorch 2.0.1 | Clip-RoBERTa | 8 | bf16 | 1089.8 tokens/sec | 64 | contrastive-image-text |
PyTorch 2.0.1 | Clip-RoBERTa | 1 | bf16 | 120.4 tokens/sec | 64 | contrastive-image-text |
PyTorch 2.0.1 | DistilBERT | 8 | bf16 | 2184.4 tokens/sec | 8 | question-answering |
PyTorch 2.0.1 | DistilBERT | 1 | bf16 | 397.2 tokens/sec | 8 | question-answering |
PyTorch 2.0.1 | Flan-T5 XXL | 8 | bf16 | 142.1 tokens/sec | 22 | question-answering |
PyTorch 2.0.1 | GPT2 | 8 | bf16 | 285.1 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | GPT2 | 1 | bf16 | 66.0 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | GPT2 XL | 8 | bf16 | 53.5 tokens/sec | 4 | language-modeling |
PyTorch 2.0.1 | RoBERTa Base | 8 | bf16 | 1740.3 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | RoBERTa Base | 1 | bf16 | 279.0 tokens/sec | 12 | question-answering |
PyTorch 2.0.1 | RoBERTa Large | 8 | bf16 | 924.7 tokens/sec | 12 | language-modeling |
PyTorch 2.0.1 | RoBERTa Large | 1 | bf16 | 170.7 tokens/sec | 12 | language-modeling |
PyTorch 2.0.1 | Swin Transformer | 8 | bf16 | 2776.1 tokens/sec | 64 | image-classification |
PyTorch 2.0.1 | Swin Transformer | 1 | bf16 | 415.6 tokens/sec | 64 | image-classification |
PyTorch 2.0.1 | T5-Large | 8 | bf16 | 578.8 tokens/sec | 4 | summarization |
PyTorch 2.0.1 | T5-Small | 8 | bf16 | 2202.9 tokens/sec | 4 | translation |
PyTorch 2.0.1 | T5-Small | 1 | bf16 | 73.6 tokens/sec | 4 | translation |
PyTorch 2.0.1 | Vision Transformer | 8 | bf16 | 6483.2 image/sec | 128 | image-classification |
PyTorch 2.0.1 | Vision Transformer | 1 | bf16 | 1162.8 image/sec | 128 | image-classification |
PyTorch 2.0.1 | Wav2Vec2.0 AC | 1 | bf16 | 1580.9 sent/sec | 256 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec2.0 AC | 8 | bf16 | 899.6 sent/sec | 16 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec2.0 ASR | 8 | bf16 | 56.4 sent/sec | 4 | speech-recognition |
PyTorch 2.0.1 | Wav2Vec2.0 ASR | 1 | bf16 | 7.0 sent/sec | 4 | speech-recognition |
Hugging Face Optimum Habana Gaudi Inference Performance
Framework Version | Model | Task | |||||
---|---|---|---|---|---|---|---|
Framework Version | Model | # HPU | Precision | Throughput | Latency | Batch Size | Task |
PyTorch 2.0.1 | Stable Diffusion v2.1 image size 512x512 | 1 | bf16 | 0.33 image/sec | 12 s/image | 4 | text to image generation |
PyTorch 2.0.1 | T5-3B BeamSearch-4 | 1 | bf16 | 8.5 tokens/sec | 466 ms/token | 4 | translation |
PyTorch 2.0.1 | Wav2Vec 2.0 ASR | 1 | bf16 | 8.79M sent/sec | 4 | speech-recognition |
*Performance measurements taken on Amazon EC2 DL1 Instance
System Configuration:
Gaudi® Platform
System: HLS-1 with eight Habana Gaudi HL-205 Mezzanine cards and two Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, and 756GB of System Memory
Gaudi®2 Platform
System: HLS-Gaudi2 with eight Habana Gaudi2 HL-225H Mezzanine cards and two Intel® Xeon® Platinum 8380 CPU @ 2.30GHz, and 1TB of System Memory
Amazon EC2 DL1 Instance
System: Custom Server with eight Habana Gaudi HL-205 Mezzanine cards and two Intel® Xeon® Platinum 8275CL CPU @ 3.00GHz, and 756GB of System Memory
Common Software
Ubuntu20.04, SynapseAI Software version 1.10.0-494
TensorFlow: Models run with TensorFlow v2.12.0 use this Docker image;
PyTorch: Models run with PyTorch v2.0.1 use this Docker image
Environment: These workloads are run using the Docker images running directly on the Host OS
Hugging Face models are available on the Hugging Face Habana hub. Performance measurements based on the version 1.5 of optimum-habana.
Performance varies by use, configuration and other factors. Please refer to the Model-References GitHub page for each model’s support and validation coverage. All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.