You just trained a massive multimodal model, and your inference costs are 40% higher than projected because your GPU instances are idle half the time. The bill arrives, and you realize your carefully orchestrated Kubernetes cluster is spending more time managing spot instance preemptions than running batches. This isn't a hypothetical; it's the weekly stand-up for teams trying to run serious AI workloads on general-purpose cloud platforms.
The economics of modern AI are brutal. Model sizes are growing exponentially, but the underlying cloud infrastructure hasn't kept pace with the specific demands of GPU-heavy, bursty, and fault-tolerant compute. General-purpose clouds treat GPUs as just another instance type, leading to poor utilization, complex orchestration, and unpredictable costs. This mismatch is why specialized GPU clouds like CoreWeave have gained traction, offering a hardware stack and software layer built explicitly for AI.
For developers, the shift isn't just about renting a different VMs. It's about adopting new architectural patterns that align with a hardware-first approach. The wrong code patterns can leave thousands of dollars of GPU time on the table, while the right ones can slash inference latency and training time. Let's examine five concrete coding perspectives, from infrastructure-as-code to the kernel level, that make the difference.
1. Infrastructure as Code: Treating GPUs as Ephemeral, Not Precious
On traditional clouds, you provision a GPU instance and cling to it. On CoreWeave, the model is different. With direct access to NVIDIA's latest hardware (H100, A100, L40S) and high-availability guarantees, you can design for failure and transience. Your IaC should reflect that.
Using Terraform or Pulumi, you define workloads that can be torn down and recreated without data loss. The key is separating stateful storage (like model weights or training datasets) from the compute layer. CoreWeave's storage solutions, like Hot Pools (NVMe-backed) or Object Storage, are mounted via CSI drivers, making this separation clean.
Here's a Pulumi example (TypeScript) that deploys an inference service designed for this ephemerality. It uses CoreWeave's Kubernetes-native APIs via their custom resource definitions (CRDs).
import * as k8s from "@pulumi/kubernetes";
import * as pulumi from "@pulumi/pulumi";
const appName = "llm-inference";
// 1. Define a VirtualServer (CoreWeave CRD) for the GPU workload
const inferenceServer = new k8s.apiextensions.CustomResource(appName, {
apiVersion: "virtualservers.coreweave.com/v1alpha1",
kind: "VirtualServer",
metadata: {
name: appName,
namespace: "tenant-myteam", // Your CoreWeave namespace
},
spec: {
region: "ORD1", // Chicago data center
os: {
type: "linux",
},
resources: {
gpu: {
type: "A100_PCIE_80GB", // Explicit hardware selection
count: 2,
},
cpu: {
cores: 16,
},
memory: "120Gi",
},
storage: {
// Ephemeral root disk. Nothing persistent lives here.
root: {
size: "100Gi",
storageClassName: "block-nvme-ord1", // High-performance local NVMe
},
// Persistent volume for models, mounted from object storage
additionalDisks: [{
name: "model-store",
size: "500Gi",
storageClassName: "object-storage", // S3-compatible, slower but persistent
mountPath: "/models",
}],
},
// User data script runs on boot
userData: `#!/bin/bash
# Pull the latest model weights from the persistent store
aws s3 sync s3://my-bucket/models/llama-3-70b /models/llama-3 --endpoint-url=$S3_ENDPOINT
# Start the inference server, loading from /models
python -m vllm.entrypoints.api_server --model /models/llama-3 --tensor-parallel-size=2 &
wait
`,
network: {
public: true, // Gets a public IP
tcp: {
ports: [8000], // Expose vLLM's API port
},
},
initializeRunning: true,
},
});
// 2. Output the public IP for immediate use
export const publicIP = inferenceServer.status.apply(s => s?.network?.publicIP);
The critical mindset shift here is in the `storage` block. The root disk is fast NVMe but ephemeral. The model weights live on a separate, persistent object storage mount. The `userData` script syncs weights on boot. If the instance is preempted or fails, the next one pulls the latest weights and starts serving. You're coding for resilience, not permanence.
Performance Consideration
Syncing 100GB+ models on every boot adds cold-start latency. Mitigate this by using CoreWeave's Inference Cache or by maintaining a warm pool of pre-loaded instances behind a load balancer. For training, use the `block-nvme` class for checkpointing to avoid network storage I/O bottlenecks.
2. Container Orchestration: Kubernetes Jobs for Embarrassingly Parallel Training
Hyperparameter tuning, large-scale dataset preprocessing, and model evaluation are embarrassingly parallel tasks. On CoreWeave, you don't request a giant multi-node cluster and manage it. You define a Kubernetes `Job` or `PyTorchJob` (using the Kubeflow operator) that requests *N* independent GPU pods. The scheduler finds the available hardware, even if it's spread across different physical nodes.
This is more efficient and cost-effective than managing a static cluster. You pay only for the GPU time during the actual parallel execution, not for idle interconnect overhead. The following example uses a `Job` with a `completion` mode to run 50 parallel hyperparameter trials.
apiVersion: batch/v1
kind: Job
metadata:
name: hyperparam-sweep
namespace: tenant-myteam
spec:
completions: 50 # Run 50 pods total
parallelism: 10 # Run 10 pods concurrently
completionMode: Indexed # Each pod gets a unique index (0-49)
template:
spec:
nodeSelector:
# Target specific GPU types for consistency
gpu.nvidia.com/class: A100_PCIE_80GB
containers:
- name: trial-runner
image: myregistry.com/training:py3.12-torch2.3
resources:
limits:
# Request exactly one GPU per pod
nvidia.com/gpu: 1
cpu: 8
memory: 60Gi
env:
- name: TRIAL_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
command: ["python", "/app/run_trial.py"]
args:
- "--learning-rate=$(LR)"
- "--batch-size=$(BATCH)"
# Mount fast scratch space for each trial
volumeMounts:
- name: scratch-nvme
mountPath: /scratch
volumes:
- name: scratch-nvme
ephemeral:
volumeClaimTemplate:
spec:
storageClassName: block-nvme-ord1
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
restartPolicy: Never
backoffLimit: 1
---
# A ConfigMap to define parameters for each index
apiVersion: v1
kind: ConfigMap
metadata:
name: trial-parameters
data:
# Generate 50 unique parameter pairs. In practice, you'd generate this programmatically.
parameters.json: |
[
{"LR": "0.001", "BATCH": "32"},
{"LR": "0.001", "BATCH": "64"},
{"LR": "0.0005", "BATCH": "128"},
# ... 47 more entries
]
The pod uses the `TRIAL_INDEX` environment variable to select its unique parameter set from the ConfigMap. Each pod gets its own dedicated, fast NVMe scratch disk (`block-nvme-ord1`) for intermediate data. The `parallelism: 10` controls the concurrency, preventing you from overwhelming the scheduler or your data source.
Gotcha: Be mindful of egress costs. If each pod downloads a 1TB dataset from the public internet, you'll have a nasty surprise. Always pre-stage large datasets in CoreWeave's Object Storage or use their Dataset Caching feature.
3. Low-Level GPU Control: Maximizing Utilization with CUDA Graphs
When you're paying by the second for H100s, 100% GPU utilization is the goal. A common bottleneck in inference servers is the Python launch overhead for tiny GPU kernels. CUDA Graphs solve this by capturing a sequence of kernels (e.g., a single model forward pass) into a single, replayable unit. This eliminates launch latency and can improve throughput by 2-10x for small-batch inference.
You don't need to write raw CUDA C++ to benefit. Frameworks like PyTorch and NVIDIA's TensorRT-LLM have built-in support. Here's how you explicitly enable and benchmark CUDA Graphs in a vLLM inference setup, which is a common deployment on CoreWeave.
# inference_benchmark.py
import argparse
import torch
from vllm import LLM, SamplingParams
import time
def run_without_graph(model_id: str, prompt: str, num_iters: int = 100):
"""Baseline: Standard vLLM engine without CUDA Graph."""
llm = LLM(
model=model_id,
tensor_parallel_size=2, # Uses 2 GPUs
enable_cuda_graph=False, # Explicitly disabled
gpu_memory_utilization=0.9,
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)
start = time.perf_counter()
for _ in range(num_iters):
# Each call incurs Python->CUDA launch overhead
llm.generate([prompt], sampling_params)
torch.cuda.synchronize() # Wait for all GPU work
elapsed = time.perf_counter() - start
print(f"No CUDA Graph: {num_iters/iters:.2f} iters/sec, {elapsed:.2f}s total")
def run_with_graph(model_id: str, prompt: str, num_iters: int = 100):
"""Optimized: Use CUDA Graph for kernel sequence replay."""
llm = LLM(
model=model_id,
tensor_parallel_size=2,
enable_cuda_graph=True, # Critical flag
cuda_graph_batch_size=1, # Must match the exact batch size you'll use
gpu_memory_utilization=0.9,
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)
# Warm-up run to capture the graph. This run is slower.
print("Capturing CUDA Graph...")
llm.generate([prompt], sampling_params)
start = time.perf_counter()
for _ in range(num_iters):
# These iterations replay the pre-captured graph, minimizing overhead.
llm.generate([prompt], sampling_params)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f"With CUDA Graph: {num_iters/elapsed:.2f} iters/sec, {elapsed:.2f}s total")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="/models/llama-3-8b")
args = parser.parse_args()
test_prompt = "Explain quantum computing in one sentence."
iterations = 500
run_without_graph(args.model_path, test_prompt, iterations)
run_with_graph(args.model_path, test_prompt, iterations)
The `enable_cuda_graph=True` flag tells vLLM to capture the kernel sequence for a specific batch size (1, in this case). The first run is expensive due to capture, but subsequent runs are dramatically faster. On a CoreWeave A100, I've seen this reduce per-token latency by ~40% for a consistent stream of single requests.
Trade-off: CUDA Graphs are inflexible. The graph is compiled for an exact batch size, sequence length, and model configuration. If your workload is highly dynamic (varying batch sizes from 1 to 32), graphs can hurt performance due to cache misses. Use them for predictable, high-throughput inference endpoints.
4. Network-Aware Data Loading: Saturating 800 Gb/s InfiniBand
Training a 70B parameter model requires efficient multi-node communication. CoreWeave's bare-metal servers are connected with NVIDIA Quantum-2 InfiniBand, offering 400-800 Gb/s bandwidth. Naive data loading from networked storage can become the bottleneck, leaving these expensive links underutilized.
The solution is to overlap data preprocessing, CPU-to-GPU transfer (H2D), and GPU computation. PyTorch's DataLoader with multiple workers and pin_memory helps, but for maximum throughput on distributed setups, you need a pipeline. Here's a pattern using PyTorch's `DistributedDataParallel` (DDP) with a prefetching dataloader.
# distributed_train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from datasets import load_dataset
import numpy as np
from functools import partial
def collate_fn(batch, tokenizer, max_length=2048):
"""Optimized collation on CPU."""
texts = [item["text"] for item in batch]
# Tokenize on CPU in parallel
encodings = tokenizer(
texts,
truncation=True,
padding="max_length",
max_length=max_length,
return_tensors="pt",
)
return encodings["input_ids"], encodings["attention_mask"]
def create_high_throughput_loader(dataset, tokenizer, batch_size, world_size, rank):
"""Creates a DataLoader designed to keep GPUs fed."""
sampler = DistributedSampler(
dataset,
num_replicas=world_size,
rank=rank,
shuffle=True,
seed=42,
)
# Key parameters:
loader = DataLoader(
dataset,
batch_size=batch_size,
sampler=sampler,
num_workers=8, # Rule of thumb: 4-8 * num_GPU per node
pin_memory=True, # Enables fast H2D async copies
prefetch_factor=4, # Each worker prefetches 4 batches
persistent_workers=True, # Avoids restarting workers each epoch
collate_fn=partial(collate_fn, tokenizer=tokenizer),
)
return loader
def main():
# Initialize distributed process group (NCCL over InfiniBand)
dist.init_process_group("nccl")
rank = dist.get_rank()
local_rank = int(os.environ["LOCAL_RANK"])
world_size = dist.get_world_size()
torch.cuda.set_device(local_rank)
device = torch.device(f"cuda:{local_rank}")
# Model and optimizer
model = MyLargeModel().to(device)
model = DDP(model, device_ids=[local_rank])
# Dataset - Assume it's pre-downloaded to NVMe storage
dataset = load_dataset("parquet", data_files="/nvme_data/train/*.parquet", split="train")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b")
loader = create_high_throughput_loader(dataset, tokenizer, 32, world_size, rank)
for epoch in range(10):
sampler.set_epoch(epoch) # Crucial for shuffling across epochs
for batch_idx, (input_ids, attention_mask) in enumerate(loader):
# Data is already on pinned memory. Non-blocking transfer to GPU.
input_ids = input_ids.to(device, non_blocking=True)
attention_mask = attention_mask.to(device, non_blocking=True)
# Computation on GPU
outputs = model(input_ids, attention_mask=attention_mask)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
if batch_idx % 100 == 0 and rank == 0:
print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
if __name__ == "__main__":
main()
The magic is in the `DataLoader` configuration. `num_workers=8` parallelizes data loading and tokenization across CPU cores. `pin_memory=True` allocates page-locked host memory, allowing the `non_blocking=True` transfer to GPU to overlap with kernel execution. `prefetch_factor=4` ensures there's always a buffer of ready batches.
Tip: Profile with `NVIDIA Nsight Systems` or PyTorch's profiler. If you see GPU idle time (`CUDA Kernel` gaps), increase `num_workers` or `prefetch_factor`. If your CPU is saturated, you might need to pre-tokenize your dataset and save it as binary files for direct memory mapping.
5. Observability and Cost Attribution: Instrumenting Every GPU-Second
With great GPU power comes great billing responsibility. You need to know which team, experiment, or API endpoint is consuming resources. CoreWeave provides metrics, but you need to instrument your application code to attribute costs correctly. Structured logging and Prometheus metrics