Cloud Run GPUs Are Generally Available: Run AI Workloads with Pay-per-Second and Scale to Zero

NVIDIA GPU support on Cloud Run is GA. Pay per second, scale to zero, sub-5s startup, and L4 GPUs without quota requests — official Google Cloud guide.

4 min read By Jatinder (Jay) Bhola Featured

Cloud Run Just Got a GPU Runtime

Cloud Run is Google Cloud’s serverless runtime for containers. As of June 2025, NVIDIA GPU support on Cloud Run is generally available (GA). You can run inference, batch jobs, and other GPU workloads with the same operational model: no cluster to manage, pay for what you use, and scale to zero when idle.

This post summarizes the official announcement and docs: Cloud Run GPUs, now GA (Google Cloud Blog, Serverless). All links are to Google sources only.

Cloud Run GPU benefits: pay-per-second, scale to zero, under 5s startup, L4 no quota


What You Get with Cloud Run + GPUs

Benefit What it means
Pay-per-second billing Charged only for GPU (and CPU) time consumed, down to the second.
Scale to zero Instances scale down to zero when there are no requests; no idle GPU cost.
Fast startup From zero to an instance with GPU and drivers in under 5 seconds.
Streaming HTTP and WebSocket streaming supported — e.g. stream LLM tokens as they’re generated.
L4 GPUs, no quota request NVIDIA L4 GPUs are available without a separate quota request.
SLA Same Cloud Run SLA as the rest of the product.

Quick Deploy: Ollama on Cloud Run with One GPU

From the official blog, you can run Ollama (open models) on Cloud Run with one GPU:

gcloud run deploy my-global-service \
  --image ollama/ollama --port 11434 \
  --gpu 1 \
  --regions us-central1,europe-west1,asia-southeast1

This deploys a single service with one GPU across three regions. Use the Cloud Run locations guide for GPU-enabled regions.


GPU Regions (Official List)

At GA, Cloud Run GPUs were available in:

  • us-central1 (Iowa, USA)
  • europe-west1 (Belgium)
  • europe-west4 (Netherlands)
  • asia-southeast1 (Singapore)
  • asia-south1 (Mumbai, India)

Check the latest: Cloud Run locations — GPU.


Multi-Region and High Availability

  • Zonal redundancy is on by default — helps withstand a single-zone failure.
  • You can turn it off for lower cost and best-effort failover; see Cloud Run pricing.
  • The same blog shows deploying one service to multiple regions with a single command for lower latency and higher availability.

Cloud Run Jobs with GPUs

GPUs aren’t only for request-driven services. Cloud Run jobs can use GPUs for:

  • Model fine-tuning — Run a job, process data, scale to zero when done.
  • Batch inference — Images, text, recommendations at scale.
  • Batch media processing — Transcode, thumbnails, image pipelines.

At the time of the GA post, GPUs on Cloud Run jobs were in private preview; sign-up was via a form linked in the blog. Check the Cloud Run jobs docs for current status.


See It in Action (Official Demo)

Google Cloud Next ’25 featured a live demo of scaling from 0 to 100 NVIDIA GPUs in about four minutes for a Stable Diffusion–style workload:


How to Enable GPUs in Your Service

  • Console: When creating or editing a Cloud Run service, enable the GPU option and choose the type (e.g. L4).
  • gcloud: Add --gpu 1 (or the desired count) to your gcloud run deploy command.
  • Docs: Configuring GPUs for Cloud Run.

Best Practices and Next Steps


References (Google sources only)