Google/gemma-4-E2B-it
Google's compact Gemma 4 multimodal model (effective 2B) with native text, image, and audio, plus thinking mode and tool-use protocol.
Compact unified multimodal model with audio, thinking, and function calling — runs on a single 24 GB+ GPU
View on HuggingFaceGuide
Overview
Gemma 4 E2B is the smallest member of Google's Gemma 4 family — an effective-2B unified multimodal model that natively processes text, images, and audio, with structured thinking/reasoning, function calling, and dynamic vision resolution. It runs comfortably on a single 24 GB+ GPU.
Key Features
- Multimodal: Text + images + audio natively (video via custom frame-extraction pipeline).
- Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
- Thinking Mode: Structured reasoning via
<|channel>thought\n...<channel|>delimiters. - Function Calling: Custom tool-call protocol with dedicated special tokens.
- Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).
TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.
Prerequisites
pip (NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match
pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)
Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
--extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade
Docker
docker pull vllm/vllm-openai:gemma4-0505-cu129 # NVIDIA Hopper (H100/H200, CUDA 12.9)
docker pull vllm/vllm-openai:gemma4-0505-cu130 # NVIDIA Blackwell (B200/B300, CUDA 13.0)
docker pull vllm/vllm-openai-rocm:latest # AMD
Deployment Configurations
Quick Start (Single GPU)
vllm serve google/gemma-4-E2B-it \
--max-model-len 32768
With Audio Support
vllm serve google/gemma-4-E2B-it \
--max-model-len 8192 \
--limit-mm-per-prompt image=4,audio=1
Full-Featured Server Launch
Enables text, image, audio, thinking, and tool calling:
vllm serve google/gemma-4-E2B-it \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt image=4,audio=1 \
--async-scheduling \
--host 0.0.0.0 \
--port 8000
Docker (NVIDIA)
docker run -itd --name gemma4-e2b \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:gemma4-0505-cu129 \
--model google/gemma-4-E2B-it \
--max-model-len 32768 \
--host 0.0.0.0 --port 8000
Swap vllm/vllm-openai:gemma4-0505-cu129 for vllm/vllm-openai:gemma4-0505-cu130 on Blackwell (B200/B300).
Docker (AMD MI300X/MI325X/MI350X/MI355X)
docker run -itd --name gemma4-rocm \
--ipc=host --network=host --privileged \
--cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
--group-add=video --cap-add=SYS_PTRACE \
--security-opt=seccomp=unconfined --shm-size 16G \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:latest \
--model google/gemma-4-E2B-it \
--host 0.0.0.0 --port 8000
Docker (Cloud TPU — Trillium / Ironwood)
TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:
docker run -itd --name gemma4-tpu \
--privileged --network host --shm-size 16G \
-v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
vllm/vllm-tpu:latest \
--model google/gemma-4-E2B-it \
--max-model-len 16384 \
--disable_chunked_mm_input \
--host 0.0.0.0 --port 8000
Client Usage
Audio Transcription
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[{"role": "user", "content": [
{"type": "audio_url", "audio_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/2/22/Beatbox_by_Wikipedia_user_Wikipedia_Brown.ogg"}},
{"type": "text", "text": "Provide a verbatim, word-for-word transcription of the audio."},
]}],
max_tokens=512,
)
print(response.choices[0].message.content)
Image Understanding
response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."},
]}],
max_tokens=1024,
)
Thinking Mode
Launch with reasoning parser, then enable per-request:
vllm serve google/gemma-4-E2B-it \
--max-model-len 16384 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice \
--chat-template examples/tool_chat_template_gemma4.jinja
Enable per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}.
Configuration Tips
- Set
--max-model-lento match your workload (max 131072). - Image-only workloads:
--limit-mm-per-prompt audio=0. - Text-only workloads:
--limit-mm-per-prompt image=0,audio=0to skip MM profiling. --async-schedulingimproves throughput.- FP8 KV cache (
--kv-cache-dtype fp8) saves ~50% KV memory.
Speculative Decoding (MTP)
Enable the Spec Decoding feature toggle (above) or add --speculative-config manually to use MTP drafting with the assistant model. Recommended num_speculative_tokens: 2 for this model. The E2B assistant uses centroids masking for efficient sparse logit computation. See the Gemma 4 usage guide for details and benchmarks.
Note: MTP speculative decoding for Gemma 4 is only available on the vLLM nightly build — it has not yet landed in a stable release. Install via the nightly wheel (
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly/cu129 …) or use thevllm/vllm-openai:gemma4-0505-cu129/vllm/vllm-openai:gemma4-0505-cu130images above; the standard:lateststable tag does not include this feature.