You ship a fine-tuned Llama model that runs at ~45 tokens/sec on a single A100 in staging, then production traffic hits and throughput collapses to ~18 tokens/sec with p95 latency doubling. The model didn’t change; your pipeline did: cold starts, container image bloat, noisy neighbors, and a data path that quietly moved tensors over the network one extra time.
This matters right now because AI workloads are getting more “systems-y” than “model-y.” Frameworks like PyTorch 2.x (with torch.compile), vLLM, and Ray push more performance decisions into runtime scheduling, kernel selection, and memory layout. At the same time, GPU supply is tight, and teams are being asked to squeeze more tokens/sec per dollar while keeping deployment friction low.
CoreWeave is interesting here because it’s built around GPU-first infrastructure and Kubernetes-native primitives. For developers, that means you can treat “GPU fleet + fast storage + container scheduling” as code, then iterate on performance like any other part of your stack: measure, change one variable, measure again.
1) Perspective: Infrastructure-as-Code for GPU scheduling (Kubernetes + CoreWeave)
If you can’t reliably land on the right GPU class with predictable placement, every other optimization is noise. I’ve seen teams lose 20–35% throughput simply because pods drifted onto smaller GPUs during autoscaling, or because requests/limits weren’t set correctly and the scheduler packed workloads too tightly.
On CoreWeave, you typically express GPU needs via Kubernetes resource requests, plus node affinity/tolerations to target the right node pools. Keep your requests explicit; “best effort” GPU scheduling is how you get surprise latency spikes.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
# Target GPU nodes (label names vary by cluster; adjust to your CoreWeave setup)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["gpu-a100-80gb"]
containers:
- name: server
image: ghcr.io/vllm-project/vllm-openai:v0.5.4
args:
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--gpu-memory-utilization"
- "0.90"
ports:
- containerPort: 8000
resources:
requests:
nvidia.com/gpu: "1"
cpu: "4"
memory: "16Gi"
limits:
nvidia.com/gpu: "1"
cpu: "8"
memory: "24Gi"
Tips and gotchas
- Set both requests and limits for CPU and memory. Starving the CPU can drop tokens/sec noticeably because tokenization, HTTP, and sampling logic still run on CPU.
- Pin model cache storage near compute. If your model weights or KV cache spill to network storage under pressure, you’ll see p95 latency cliffs.
- Measure cold start separately from steady-state. A 2–5 minute image pull + model load can dominate user experience if you scale to zero.
2) Perspective: Container and build hygiene (Python wheels, CUDA libs, and image size)
Most “cloud AI is slow” stories I’ve debugged ended up being “my container is huge and my startup path is doing extra work.” A 12–18 GB image isn’t rare when people copy full CUDA toolkits, build deps, and caches into production. On a busy cluster, that can add minutes to rollouts and autoscaling events.
A cleaner approach is multi-stage builds, caching wheels, and installing only runtime CUDA libraries. You won’t get a 10x speedup in steady-state tokens/sec, but you can cut cold start time by 30–70% depending on your baseline.
# syntax=docker/dockerfile:1.7
FROM python:3.12-slim AS builder
WORKDIR /w
# Build wheels once (faster installs, smaller final layer churn)
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential git && rm -rf /var/lib/apt/lists/*
COPY pyproject.toml uv.lock ./
# uv is fast and reproducible; pip works too if you prefer
RUN pip install --no-cache-dir uv==0.4.20
RUN uv sync --frozen --no-install-project --python-preference=only-system
# Build your app wheel (keeps runtime clean)
COPY . .
RUN uv build
FROM python:3.12-slim AS runtime
WORKDIR /app
# Install only runtime deps from wheels built above
COPY --from=builder /w/dist/*.whl /tmp/
RUN pip install --no-cache-dir /tmp/*.whl && rm -rf /tmp/*.whl
# Avoid Python writing .pyc files in read-only containers
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
EXPOSE 8000
CMD ["python", "-m", "my_service.api"]
Practical considerations
- Don’t ship compilers in the runtime image. If you need to compile CUDA extensions, do it in the builder stage and copy artifacts.
- Pin CUDA/PyTorch compatibility. Mismatched wheels can silently fall back to CPU ops or trigger slow kernels. Validate with a startup self-check.
- Track image size and pull time as first-class metrics. I’ve seen teams cut deploy rollback time from ~9 minutes to ~3 minutes just by shrinking images.
3) Perspective: Runtime throughput via batching and streaming (vLLM + OpenAI-compatible API)
For LLM inference, the biggest wins usually come from smarter batching and KV cache management. vLLM’s paged attention and continuous batching can increase effective throughput dramatically under concurrent load, especially compared to naïve “one request per GPU” servers.
CoreWeave gives you the infrastructure, but you still need to drive the server correctly. The client should stream tokens to users while the server batches across requests; that’s how you keep p95 latency reasonable without leaving GPU cycles idle.
import os
from openai import OpenAI
# vLLM OpenAI-compatible server endpoint
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY", "not-needed-for-local"),
base_url=os.environ.get("VLLM_BASE_URL", "http://vllm-inference:8000/v1"),
)
def stream_chat(prompt: str) -> None:
# Streaming reduces perceived latency and keeps connections active
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "Be concise and accurate."},
{"role": "user", "content": prompt},
],
temperature=0.2,
stream=True,
)
for event in stream:
delta = event.choices[0].delta
if delta and delta.content:
print(delta.content, end="", flush=True)
print()
if __name__ == "__main__":
stream_chat("Write a SQL query that finds duplicate emails in a users table.")
Performance notes
- Batching trade-off: higher throughput can increase time-to-first-token if you over-batch. Watch both TTFT and tokens/sec.
- Set sane max context. Over-allocating context length increases KV cache pressure and can force eviction or lower GPU memory utilization targets.
- Benchmark with concurrency. Single-request benchmarks lie; run 8–64 concurrent clients to see if continuous batching is paying off.
4) Perspective: Distributed training and data locality (Ray + PyTorch DDP)
Training workloads fail in production when data pipelines can’t keep GPUs fed. I’ve seen multi-GPU jobs stuck at 40–55% utilization because dataloaders were bottlenecked on network reads, small file I/O, or CPU transforms. The fix wasn’t “more GPUs,” it was “stop starving the ones you have.”
Ray is a practical middle layer: you can coordinate workers, push preprocessing closer to compute, and keep the job definition as code. For CoreWeave-style Kubernetes clusters, Ray’s operator pattern fits well with GPU node pools.
# train_ray_ddp.py
import os
import ray
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_ddp(rank: int, world_size: int) -> None:
os.environ["MASTER_ADDR"] = os.environ.get("MASTER_ADDR", "127.0.0.1")
os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", "29500")
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup_ddp() -> None:
dist.destroy_process_group()
@ray.remote(num_gpus=1)
def train_worker(rank: int, world_size: int) -> float:
setup_ddp(rank, world_size)
model = torch.nn.Linear(4096, 4096, bias=False).cuda()
ddp = DDP(model, device_ids=[rank])
opt = torch.optim.AdamW(ddp.parameters(), lr=1e-3)
loss_fn = torch.nn.MSELoss()
# Synthetic batch; replace with a real DataLoader pinned to local/cache storage
for _ in range(50):
x = torch.randn(32, 4096, device="cuda")
y = torch.randn(32, 4096, device="cuda")
opt.zero_grad(set_to_none=True)
loss = loss_fn(ddp(x), y)
loss.backward()
opt.step()
cleanup_ddp()
return float(loss.detach().cpu().item())
if __name__ == "__main__":
ray.init(address=os.environ.get("RAY_ADDRESS", "auto"))
world_size = int(os.environ.get("WORLD_SIZE", "2"))
losses = ray.get([train_worker.remote(r, world_size) for r in range(world_size)])
print({"final_losses": losses})
Data-path tips
- Prefer fewer, larger files (e.g., WebDataset shards) over millions of tiny objects. Tiny-file overhead can dominate and cut GPU utilization by 10–30%.
- Pin dataloader workers and use persistent_workers to avoid fork/spawn overhead each epoch.
- Watch NCCL time. If all-reduce dominates, you’re network-bound; scale up GPUs per node before scaling out across nodes.
5) Perspective: Cost and performance guardrails (load tests + autoscaling signals)
Optimizing AI workloads isn’t only about peak throughput; it’s about keeping performance stable as traffic changes. I’ve seen autoscalers thrash because they scaled on CPU while the real constraint was GPU memory, causing oscillations and intermittent OOMs.
Put guardrails in code: run a repeatable load test, capture latency and throughput, and feed autoscaling with the right signals (GPU utilization, queue depth, or in-flight requests). If you don’t measure tokens/sec per dollar, you’ll end up optimizing the wrong thing.
// loadtest.js (Node.js 20+)
import http from "node:http";
import { setTimeout as sleep } from "node:timers/promises";
const BASE = process.env.VLLM_BASE_URL ?? "http://localhost:8000/v1";
const CONCURRENCY = Number(process.env.CONCURRENCY ?? 16);
const REQUESTS_PER_WORKER = Number(process.env.REQS ?? 20);
function postJson(path, body) {
return new Promise((resolve, reject) => {
const data = JSON.stringify(body);
const req = http.request(
`${BASE}${path}`,
{
method: "POST",
headers: {
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(data),
// vLLM may ignore auth; keep header for compatibility
Authorization: `Bearer ${process.env.OPENAI_API_KEY ?? "x"}`,
},
},
(res) => {
let buf = "";
res.setEncoding("utf8");
res.on("data", (c) => (buf += c));
res.on("end", () => resolve({ status: res.statusCode, body: buf }));
}
);
req.on("error", reject);
req.write(data);
req.end();
});
}
async function worker(id) {
const latencies = [];
for (let i = 0; i < REQUESTS_PER_WORKER; i++) {
const t0 = performance.now();
const resp = await postJson("/chat/completions", {
model: "meta-llama/Meta-Llama-3-8B-Instruct",
messages: [{ role: "user", content: "Give 3 tips to speed up PyTorch inference." }],
temperature: 0.2,
max_tokens: 128,
stream: false,
});
const t1 = performance.now();
if (resp.status !== 200) throw new Error(`HTTP ${resp.status}: ${resp.body}`);
latencies.push(t1 - t0);
// Small jitter prevents lockstep batching artifacts in tests
await sleep(10 + Math.random() * 30);
}
return latencies;
}
function percentile(xs, p) {
const s = [...xs].sort((a, b) => a - b);
const idx = Math.floor((p / 100) * (s.length - 1));
return s[idx];
}
const all = (await Promise.all([...Array(CONCURRENCY)].map((_, i) => worker(i)))).flat();
console.log({
n: all.length,
p50_ms: percentile(all, 50).toFixed(1),
p95_ms: percentile(all, 95).toFixed(1),
p99_ms: percentile(all, 99).toFixed(1),
});
What to scale on
- Inference: scale on queue depth / in-flight requests per replica, not CPU. CPU is often low while GPU is saturated.
- Training: scale based on step time and GPU utilization. If utilization is <70%, fix input pipeline before adding nodes.
- Cost control: enforce max replicas and use canary rollouts. A bad config can burn through a GPU budget in hours.
Use CoreWeave when you want Kubernetes-native control over GPU placement and you’re ready to treat performance as a code problem: explicit scheduling, slim images, a batching-aware inference server (vLLM), and distributed training that keeps data close to compute. Start by locking down GPU requests/affinity and shrinking cold starts, then benchmark concurrency and tune batching; that sequence tends to produce the biggest, most repeatable gains in tokens/sec per dollar without destabilizing latency.