Running the larger Google Gemma 7B 35GB LLM on a single GPU for over 7x Inference Performance gains

7 min readApr 22, 2024

TLDR; Run Gemma 7B on a single GPU with over 40 GB VRAM — preferably an 80 GB H100, 40 GB A100 or a 48G RTX-A6000 — performance will be at least 7x that of a multi GPU deployment — due to PCIe bus avoidance.

Single GPU performance of 14 tokens/sec vs multi-gpu around 2 T/sec

In a previous article I describe how Google open sourced its larger Gemma 7B model (actually 8B) on huggingface.co https://obrienlabs.medium.com/google-gemma-7b-and-2b-llm-models-are-now-available-to-developers-as-oss-on-hugging-face-737f65688f0d
This article describes how to run the Gemma 7B model on large VRAM GPUs such as the A6000 (similar to the A100 and H100) that can handle the 35GB size of the 7B model. An 8 to 10x speedup occurs if we avoid PCIe bus transfers (16 to 32 GB/s) on a model split across GPUs.

For example, Google Gemma 7B on the 48GB RTX-A6000 (GA102) GPU runs at around 14 tokens/sec vs 0.9 tokens/sec for dual L4s (AD104) — primarily because the VRAM bandwidth of 300–1000 GB/s is throttled by the 16–32 GB/s PCIe bus. If you were to split the 35G model across even two A6000’s with a combined VRAM size of 96G — the model will run slower than on a single 48G card.

LLM model inference/serving is best done on a single GPU. While model training benefits from multiple GPUs due to distributed training jobs — Inference must be done on single GPUs per request. If an inference run is split between multiple GPU’s — performance can drop down to CPU levels.

Note: Results will be very different if running InfiniBand NVLink 4 or 5 or NVSwitch.

Note: the gemma 7B model here is the default — not fine tuned or quantitized. See https://ai.google.dev/gemma/docs?hl=en

Software

We will be using the transformers python library. Try to use the latest python 3.11

machine-learning/environments/windows/src/google-gemma/gemma-gpu.py at main ·…

Machine Learning - AI - Tensorflow - Keras - NVidia - Google …

github.com

gemma-gpu.py

import os
# default dual GPU - either PCIe bus or NVidia bus - slowdowns - VRAM min of 20G on either GPU
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# specific GPU - model must fit entierely in memory RTX-3500 ada = 12G, A4000=16G, A4500=20, A6000=48, 4000 ada = 20, 5000 ada = 32, 6000 ada = 48
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from transformers import AutoTokenizer, AutoModelForCausalLM
from datetime import datetime

access_token = 'hf_cfTP…XCQqH'
model = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# gpu
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", token=access_token)
# cpu
# model = AutoModelForCausalLM.from_pretrained(model, token=access_token)

input_text = "how is gold made in collapsing neutron stars 
    - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))

# gpu
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# cpu
#input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))

print("end", datetime.now().strftime("%H:%M:%S"))

Hardware

The Gemma 7B will not run in limited VRAM environments like single GPU 3500-Ada (12G), A4000 (16G), A4500 (20G), L4 (24G) — due to the 7B model minimum 35G VRAM requirements.

While the Gemma 2B model can run on lower VRAM GPUs of 12GB or larger — the 35GB Gemma 7B model optimally runs on a single GPU of size of 40GB or larger like the RTX-A6000 48GB or a pair of 20GB+ GPUs like the L40 (2 x 24G), the RTX-A4500 (2 x 20G) or the consumer 4090 (2 x 24G). If the model is spread across two GPUs you will experience slowdown across the PCIe bus — even with NVlink across the cards. A single GPU is optimal.

Deployment

In a docker container or directly on the system

pip install -U transformers
pip install -U torch
pip install accelerate
nvcc --version
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Performance

The LLM output of the question on neutron stars is around 170 tokens.

Single NVIDIA A6000 — Ampere GA102 (see L40s equivalent on GCP)

12 seconds for 170 tokens = 14 tokens/sec
98% GPU utilization of 10k cores and 34GB/48GB VRAM @ 85% TDP 250W of 300W 768 GB/s (384 bit)

$ nvidia-smi
Fri Apr 19 21:23:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.86                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000             WDDM  |   00000000:01:00.0 Off |                  Off |
| 38%   71C    P2            263W /  300W |   34075MiB /  49140MiB |     98%      Default |

CPU — 14900K — 6400MHz RAM — overclocked

89 seconds (7.4x A6000) = 1.9 tokens/sec
90% CPU utilization of 32 vCores (24+8) and 33GB/64GB RAM
6400 MHz DDR5 to CPU bandwidth is 102 GB/s

GPU Dual NVIDIA 4090 — Ada AD102

102 seconds (8.5x A6000) = 1.7 tokens/sec
70% GPU utilization of 2x 16k cores and 34GB/48GB VRAM @ 22% TDP 220W of 900W 1008 GB/s (384 bit)
60% PCIe bus interface load
No NVlink

GPU Dual NVIDIA A4500 — Ampere GA102

119 seconds (10x A6000) = 1.4 tokens/sec
75% GPU utilization of 2x 7k cores and 34GB/40GB VRAM @ 40% TDP 160W of 400W 640 GB/s (320 bit)
75% PCIe bus interface load
NVlink between GPUs is 112 GB/s

CPU — 13900KS — 4800MHz RAM

147 seconds (12.3x A6000) = 1.2 tokens/sec
96% CPU utilization of 32 vCores (24+8) and 34GB/64GB RAM

CPU — 13900K — 4800MHz RAM

152 seconds (13x A6000) = 1.1 tokens/sec
98% CPU utilization of 32 vCores (24+8) and 35GB/192GB RAM

GPU Dual L4 on GCP — Ada AD104

202 seconds (17x A6000) = 0.85 tokens/sec
65% GPU utilization of 2x 7k cores and 35GB/46GB VRAM @ 50% TDP 70W of 150W 300GB/s (192 bit)
?% PCIe bus interface load

Remember the L4 is essentially a 3070 but with 24G VRAM instead of 12

G2 with dual L4 VM installation

 --machine-type=g2-standard-24 
 --accelerator=count=2,type=nvidia-l4-vws
 --image=projects/nvidia-vgpu-public/global/images/nv-windows-server-2022-vws-536-25-v202306270722

Install python

PS C:\Users\michael> Invoke-WebRequest -Uri "https://www.python.org/ftp/python/3.10.2/python-3.10.2-amd64.exe" -OutFile "python-3.10.2-amd64.exe"
PS C:\Users\michael> .\python-3.10.2-amd64.exe /quiet InstallAllUsers=1 PrependPath=1 Include_test=0

Download inference code

Extract zip, get hugging face token

Run the gemma 7B model — 3.5G model files download first at 1 Gbps

modify for 2 GPUs

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

Cached model on

C:\Users\michael\.cache\huggingface\hub\models--google--gemma-7b\blobs

L4 60/69 GPU saturation

PS C:\Users\michael> nvidia-smi
Sun Feb 25 22:09:50 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.25                 Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                    WDDM  | 00000000:00:03.0 Off |                    0 |
| N/A   64C    P0              38W /  72W |  19706MiB / 23034MiB |     62%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                    WDDM  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0              37W /  72W |  16312MiB / 23034MiB |     70%      Default |
|                                         |                      |                  N/A |

Results

LLM Inference is memory bus bound around the split PCIe x16 at 16 GB/s (even with NVlink 112GB/s). The bandwidth of HBM VRAM memory at up to 1 TB/s is throttled by the limited PCIe bandwidth of 32–16 GB/s. It is therefore critical that inference runs completely within a single GPU.

If possible try to run gemma 7B or any LLM on a single GPU — you will encounter less of a slowdown during memory transfers between a multi-GPU setup — where the GPU is underutilized due to sub-optimal VRAM saturation. In most cases the slowdown can be as much as 8X (12 sec vs 102 sec) when run on a single A6000 (Ampere 48G, 10k cores at 100% utilization) vs a dual 4090 (Ada 2x 24G, 2x 16k cores at 30% utilization).

What is interesting is that under certain circumstances — a fast CPU with fast DDR5 RAM can outperform an older generation pair of GPUs — because while nothing matches VRAM speed, RAM speed outperforms the PCIe bus.