A deep-dive into every layer of a production-grade, fully open-source stack for self-hosting large language models — from the API gateway to the GPU compute plane.
Why self-host?
Cloud-hosted LLM APIs are convenient, but they come with trade-offs that matter at enterprise scale: data leaves your network on every inference call, costs scale linearly with volume (and models are getting longer), and you have no control over model versioning, rate limits, or uptime SLAs. Self-hosting on Kubernetes gives you full control over the stack — at the cost of having to build and operate that stack yourself.
This guide covers every layer of a production LLM serving platform using exclusively open-source tools. We’ll go from a raw HTTP request all the way down to GPU silicon, explaining why each component exists and how they fit together.
Architecture overview
The full stack has seven layers, each solving a distinct set of problems:
┌─────────────────────────────────────────────────────────────────┐
│ CLIENTS │
│ Web apps · Mobile · CLI agents · Internal APIs │
└───────────────────────────┬─────────────────────────────────────┘
│ HTTPS
┌───────────────────────────▼─────────────────────────────────────┐
│ API GATEWAY LAYER │
│ cert-manager (TLS) · Keycloak (OIDC) · Rate limiting │
│ WAF · LLM Guard · Envoy / Kong / Traefik + Gateway API │
└───────────────────────────┬─────────────────────────────────────┘
│ cache hit? return early
┌───────────────────────────▼─────────────────────────────────────┐
│ LLM CACHE LAYER │
│ Exact-match (Redis) · Semantic (Qdrant) · KV prefix │
└───────────────────────────┬─────────────────────────────────────┘
│ cache miss → forward to inference
┌───────────────────────────▼─────────────────────────────────────┐
│ INFERENCE LAYER (vLLM) │
│ Router · Tensor parallelism · Pipeline parallelism │
│ Continuous batching · Paged attention · Speculative decoding │
└───────────────────────────┬─────────────────────────────────────┘
│ kernel calls
┌───────────────────────────▼─────────────────────────────────────┐
│ GPU COMPUTE LAYER │
│ NVIDIA device plugin · MIG · NVLink/RoCE · KEDA autoscale │
└───────────────────────────┬─────────────────────────────────────┘
│ model weights loaded from
┌───────────────────────────▼─────────────────────────────────────┐
│ MODEL STORAGE & REGISTRY │
│ MinIO (S3) · MLflow · Rook-Ceph · Init containers │
└─────────────────────────────────────────────────────────────────┘
Cross-cutting concerns (all layers)
├── Observability: Prometheus · Grafana · Jaeger · Loki
└── Control plane: ArgoCD · Vault · Istio / Cilium · OPA Gatekeeper
Let’s walk through each layer in detail.
Layer 1: The API gateway
The gateway is the single entry point for all LLM traffic. It does a lot of work before a single token gets generated.
Why Kubernetes Gateway API (not legacy Ingress)?
The older Kubernetes Ingress API conflates infrastructure and application concerns. The Kubernetes Gateway API (GA and mature in 2026) separates them cleanly:
GatewayClass— infrastructure team owns the controller (Envoy Gateway, Kong, Traefik, etc.).Gateway— defines listeners, TLS, and ports.HTTPRoute— application/ML teams define routing per model endpoint.
This separation is essential in enterprises where platform and ML teams are distinct.
TLS termination
Use cert-manager for automated certificates (Let’s Encrypt for public endpoints or internal CA for private clusters). The gateway terminates TLS; internal east-west traffic uses mTLS via the service mesh.
Authentication and authorization
Use OIDC/OAuth2 with Keycloak (or equivalent). Map scopes to models for cost governance — for example, scope:large-model unlocks higher-tier inference and can gate access to expensive 70B+ parameter models.
Rate limiting
LLMs require dual limits:
- Requests per minute (protect against abuse).
- Tokens per hour/day (prevent cost overruns).
Implement via Envoy rate-limit service (backed by Redis) or native plugins in Kong/Traefik. Charge token usage asynchronously after generation.
WAF, DDoS protection, and prompt guarding
Apply OWASP Core Rule Set (via Envoy Lua/WASM or ModSecurity) and a lightweight LLM Guard sidecar for prompt injection, PII scrubbing, and toxicity scoring.
Audit logging and distributed tracing
Emit OpenTelemetry spans for every request (user identity, model, token counts, cache status, latency breakdown). Ship to Jaeger + Prometheus + Loki.
Gateway implementation choices
| Option | Best for | Notes |
|---|---|---|
| Envoy Gateway | Performance & control | Strong Gateway API conformance; excellent for custom AI logic |
| Kong Gateway OSS | Plugin ecosystem | Rich out-of-box plugins for auth, rate limiting, AI features |
| Traefik | Simplicity & GitOps | Excellent Kubernetes-native experience |
All three support the Gateway API spec. Choose based on team expertise. Emerging options like Agentgateway add AI-specific routing (agent-to-agent traffic, tool call routing) that may matter for multi-agent workloads.
Layer 2: The LLM cache
Cache hits eliminate GPU work entirely — this is the highest-leverage optimization in the stack.
Three tiers of caching
1. Exact-match cache (Redis/Valkey)
SHA256 of (model + messages + parameters). Great for FAQs, repeatable batch jobs, and any deterministic prompt patterns. Typical hit rate: 5–20%.
2. Semantic cache (GPTCache + Qdrant, or Redis with vector search)
Embedding-based cosine similarity lookup. Tune the threshold carefully — 0.92–0.95 for factual work, disabled for creative tasks. Typical hit rate: 15–40%. The tradeoff is a small added latency for the embedding lookup and risk of incorrect hits if your threshold is too loose.
3. vLLM prefix KV cache (GPU-resident)
vLLM hashes KV blocks for shared system prompts or RAG contexts and reuses them across requests. This is the most powerful tier: for chatbot or RAG workloads where most requests share the same system prompt, prefix cache hit rates of 60–90% are common, dramatically cutting time-to-first-token (TTFT).
Combine all three tiers. The exact-match check is microseconds; the semantic check adds a few milliseconds; the prefix cache operates within vLLM transparently.
Layer 3: vLLM inference
vLLM remains the highest-throughput open-source inference engine for LLMs. Its core innovations are worth understanding because they directly shape how you size and operate it.
Paged attention
Traditional inference pre-allocates a contiguous KV cache block per sequence. vLLM uses paged attention — non-contiguous physical blocks allocated on demand — which eliminates memory fragmentation and enables much higher concurrent sequence counts on the same GPU memory.
Continuous batching
Rather than processing a fixed batch then starting a new one, vLLM uses continuous batching (also called iteration-level scheduling): new requests join the batch mid-iteration as soon as a slot frees. This eliminates GPU idle time between requests and is a primary driver of throughput improvement over naive serving.
Prefix caching
vLLM computes and caches KV blocks for common prefixes (system prompts, RAG contexts). Subsequent requests that share the prefix skip the prefill computation for those tokens entirely. At scale, this is often the difference between a cluster that fits in budget and one that doesn’t.
Parallelism strategies
For models exceeding single-GPU VRAM:
- Tensor parallelism: Splits weight matrices across GPUs within a node. Uses NVLink/NVSwitch for fast all-reduce. Scales to 8 GPUs per node efficiently.
- Pipeline parallelism: Splits model layers across nodes. Uses RoCE or InfiniBand for inter-node communication. Introduces pipeline bubbles but enables serving models that won’t fit on a single node.
Combine both for the largest models.
Quantization
Production recommendations in approximate priority order:
- FP8 (H100/H200): Near-lossless quality, ~2× memory reduction, supported natively in hardware.
- AWQ: Excellent quality/size tradeoff for A100 and older hardware.
- GPTQ: Widely supported, slightly lower quality than AWQ at equivalent bit-width.
Leave 10–15% GPU memory headroom above your model’s requirements for KV cache and kernel overhead.
Context length as an operational lever
max_model_len — the maximum sequence length vLLM will accept — directly controls KV cache memory consumption per slot. Longer contexts consume proportionally more KV cache memory, which reduces the number of concurrent sequences the engine can hold and increases TTFT.
Set max_model_len deliberately rather than leaving it at the model’s architectural maximum. For most production workloads, a limit of 8k–32k tokens is sufficient and meaningfully improves throughput. Monitor vllm:gpu_cache_usage_perc — if it runs consistently above 85%, reducing max_model_len is often the fastest way to recover headroom before adding hardware.
Speculative decoding
A draft model generates candidate tokens at low cost; the target model verifies them in parallel. Effective for latency-sensitive workloads where output is predictable (code, structured data). Adds complexity — evaluate whether the latency gain justifies the operational overhead.
Advanced: disaggregated inference
For very large-scale deployments (many thousands of requests per day), consider disaggregated inference: separate prefill pods (compute-intensive, large batches) from decode pods (memory-bandwidth-intensive, streaming). The llm-d project implements this pattern and integrates with vLLM. It adds significant operational complexity but can substantially improve hardware utilization at scale.
Layer 4: GPU compute
NVIDIA GPU Operator
Install via Helm. It automates driver installation, device plugin deployment, DCGM exporter setup, and container toolkit configuration. Without it, GPU node management becomes a manual nightmare across OS upgrades and Kubernetes versions.
GPU sharing
- MIG (Multi-Instance GPU): Available on A100/H100+. Creates hardware-isolated partitions with dedicated memory slices and compute engines. The right choice when you need strict isolation between workloads (multi-tenant or mixed-criticality).
- Time-slicing: Software-level sharing configured via GPU Operator. Lower overhead than MIG, no memory isolation. Suitable for dev/test environments or homogeneous workloads.
High-speed interconnects
NVLink/NVSwitch for intra-node GPU communication (tensor parallelism collectives). RoCE v2 or InfiniBand for inter-node communication (pipeline parallelism and distributed training). For inference-only clusters, RoCE with RDMA is typically sufficient and significantly cheaper than InfiniBand.
Autoscaling with KEDA
CPU/GPU utilization is a poor signal for LLM autoscaling — a GPU can be 80% utilized but handling requests efficiently, or 20% utilized but with a growing queue. Use KEDA with a Prometheus trigger on vllm:num_requests_waiting.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
namespace: inference
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
query: sum(vllm:num_requests_waiting{model="llama-3-70b"})
threshold: "8"
Target approximately 5–10 pending requests per replica as your threshold. Combine with Cluster Autoscaler or Karpenter for node-level scaling (GPU nodes take 3–5 minutes to join; plan for that latency in your scale-out strategy).
Readiness probes and rolling deploys
vLLM takes several minutes to load weights before it can serve requests — 5–10 minutes for a 70B model depending on storage throughput. Without a correctly configured readiness probe, Kubernetes will route traffic to pods that are still loading and immediately return errors.
Configure the readiness probe against vLLM’s /health endpoint with a sufficiently long initialDelaySeconds (or use startupProbe with a high failureThreshold to avoid timing fights). Set terminationGracePeriodSeconds long enough to drain in-flight streaming responses — 120–300 seconds is typical. During a rolling deploy, the old replica keeps serving until the new one passes its readiness check.
Graceful scale-down
KEDA’s default scale-down is aggressive. An in-progress streaming response can be 60+ seconds; a pod termination that fires mid-stream silently drops the connection. Set scaleDown.stabilizationWindowSeconds in the ScaledObject (300–600 seconds is reasonable) and configure a preStop lifecycle hook that waits for active connections to drain before the pod accepts SIGTERM. Pair this with minReplicaCount: 1 to prevent complete scale-to-zero for latency-sensitive endpoints.
Layer 5: Model storage and registry
Getting weights into MinIO
Most teams source weights from Hugging Face Hub. The practical pipeline: download once to a secure jump host or CI runner (huggingface-cli download or hf_transfer for speed), verify the SHA256 checksum against the Hub’s published value, apply your quantization step if needed (AWQ/FP8 conversion offline before serving), then push to MinIO with versioned paths (/models/llama-3-70b/awq-4bit/v1.2/). Never pull directly from Hugging Face into production inference pods — it bypasses your checksum verification, creates an external dependency at pod startup, and is slow.
For air-gapped environments, mirror the weights to an internal registry during the ingestion pipeline and gate on checksum + license validation before the weights are considered promotion-eligible in MLflow.
Object storage (MinIO)
Store model weights in MinIO (S3-compatible). Version with prefixed paths (/models/llama-3-70b/v2.1/). Use init containers to pull weights into a shared volume before the inference pod starts, or mount directly via a CSI driver.
Model registry (MLflow)
Track model lineage, quantization config, evaluation results, and deployment history. Enables A/B testing via weighted routing in the gateway. Integrate with your CI/CD pipeline so model promotions are gated on evaluation thresholds.
Persistent storage (Rook-Ceph)
For RWX access patterns (multiple inference pods reading weights simultaneously), Rook-Ceph provides a self-managed distributed filesystem. Alternatively, MinIO with parallel downloads from multiple replicas works well if you cache locally on the node.
Always verify checksums after weight downloads — a corrupted weight file produces subtle, hard-to-diagnose inference errors.
Layer 6: Observability
LLMs have different failure modes than typical services. Your dashboards need to reflect that.
Key vLLM metrics
| Metric | What it tells you |
|---|---|
vllm:num_requests_waiting | Queue depth — primary autoscaling signal |
vllm:gpu_cache_usage_perc | KV cache pressure — if consistently >85%, add replicas or reduce context length |
vllm:prefix_cache_hit_rate | Prefix cache effectiveness |
vllm:e2e_request_latency | End-to-end latency histogram |
vllm:time_to_first_token | TTFT — user-perceived responsiveness |
vllm:time_per_output_token | TPOT — streaming speed |
GPU metrics (DCGM exporter)
Monitor GPU utilization, memory bandwidth, NVLink throughput, and GPU temperature. Throttling events (SM clock drops) indicate thermal or power issues that degrade throughput without obvious errors.
Tracing and logging
Propagate trace context from the gateway through to vLLM. Log request_id, user_id, model, token counts (prompt + completion), cache tier hit (exact/semantic/prefix), TTFT, TPOT, and any guardrail flags. This data is essential for cost attribution and debugging latency regressions.
Layer 7: The platform control plane
GitOps (ArgoCD / Flux)
Every cluster resource — deployments, ScaledObjects, policies, secrets references — lives in Git. ArgoCD syncs it. No manual kubectl apply in production. This makes rollbacks, audits, and multi-cluster management tractable.
Secrets management (HashiCorp Vault)
Dynamic secrets for database credentials, API keys, and model registry tokens. Use the Vault Agent sidecar or External Secrets Operator to inject secrets as environment variables or files. Avoid Kubernetes Secrets for anything sensitive — they’re base64-encoded in etcd, not encrypted by default.
Service mesh (Istio / Cilium)
mTLS for all east-west traffic. Zero-trust: pods cannot communicate unless explicitly permitted. Cilium (eBPF-based) has lower overhead than Istio’s sidecar model and is the better choice for latency-sensitive inference traffic. Use Istio if you need its traffic management features (circuit breaking, retry policies, mirroring).
Policy (OPA Gatekeeper)
Admission control policies that enforce:
- All GPU workloads must have resource limits set.
- No
:latestimage tags in production namespaces. - All pods must carry cost-attribution labels (
team,model,environment). - GPU nodes must have the appropriate taint/toleration pair.
Network policies
Restrict pod communication explicitly. Inference pods should only accept traffic from the gateway and observability scrape jobs. They should only egress to the model registry and observability collectors. Default-deny egress on inference namespaces prevents data exfiltration.
Full request lifecycle
Putting it together, a single inference request flows through:
- TLS termination at the gateway — client presents token.
- OIDC validation — Keycloak confirms identity, scope checked against requested model.
- Rate limit check — token budget and request rate verified in Redis.
- Prompt guard — LLM Guard sidecar scans for injection and PII.
- Exact-match cache check — SHA256 lookup in Redis.
- Semantic cache check — embedding lookup in Qdrant (on cache miss).
- vLLM routing — request forwarded to least-loaded replica.
- Prefix cache check — vLLM checks KV block hashes for shared prefix.
- Prefill — prompt tokens processed, KV cache populated.
- Decode — tokens generated and streamed back via SSE.
- Token accounting — usage logged asynchronously for cost attribution.
- Trace closed — span exported to Jaeger with full latency breakdown.
Latency and cost are controlled at steps 5, 6, and 8. Observability at every step.
When to evolve beyond pure vLLM deployments
The stack described here is ideal for focused, high-performance serving of a small number of models. For larger-scale scenarios, consider layering additional tooling:
KServe (with vLLM runtime or LLMInferenceService) adds a standardized control plane for multi-model governance, canary rollouts, and heterogeneous workloads (LLMs + embeddings + vision models). It keeps vLLM as the inference engine while providing higher-level abstractions for model lifecycle management.
llm-d adds advanced distributed routing and disaggregated prefill/decode separation on top of vLLM. Worth evaluating when you have dedicated hardware for prefill compute and want to maximize utilization of decode capacity separately.
These are additive layers — they don’t replace vLLM, they orchestrate it.
Gaps worth filling for your environment
This guide covers the core serving platform. Depending on your context, you’ll also need to address:
- Multi-tenancy: Dedicated namespaces per team (strong isolation, higher overhead) vs. shared inference pool with metering at the gateway (better utilization, more complex chargeback).
- Fine-tuning: Separate GPU node pools with tools like Axolotl or LitGPT. Never share fine-tuning and inference workloads on the same nodes — memory pressure from training jobs will degrade inference latency unpredictably.
- Multi-region / HA: Global load balancing across clusters + MinIO cross-region replication for weight distribution.
- Cost attribution: Token metering at the gateway + GPU-hour tracking via OpenCost or custom labels. Essential for chargeback and for identifying which teams/models are driving cost.
- A/B testing: Weighted routing in Gateway API HTTPRoute + Grafana dashboards comparing TTFT/TPOT/quality metrics between model versions.
- Security hardening: Image scanning in CI, runtime sandboxes (gVisor or Kata Containers for multi-tenant isolation), and prompt/response logging policies that avoid persisting sensitive data.
Summary: the full open-source stack
| Layer | Function | Tools |
|---|---|---|
| API Gateway | Auth, rate limiting, routing | Envoy / Kong / Traefik + Gateway API |
| Identity | Authn/Authz | Keycloak |
| Caching | Reduce GPU compute | Redis, Qdrant / GPTCache, vLLM prefix |
| Inference | Token generation | vLLM (core) — optional KServe / llm-d |
| GPU management | Resource scheduling | NVIDIA GPU Operator, MIG / time-slicing, KEDA |
| Model storage | Weight distribution | MinIO, Rook-Ceph, MLflow |
| Observability | Metrics / traces / logs | Prometheus, Grafana, Jaeger, Loki |
| GitOps | Config management | ArgoCD / Flux |
| Secrets | Credential management | HashiCorp Vault |
| Network security | Zero-trust mTLS | Istio / Cilium |
| Policy | Admission control | OPA Gatekeeper |
This stack delivers full data sovereignty, predictable costs, and high performance with no per-token pricing.
The operational investment is real. You need platform engineers who understand Kubernetes, GPU workloads, and distributed systems. But for teams with strict compliance requirements, high inference volume, or custom/fine-tuned models, self-hosting on Kubernetes is overwhelmingly worthwhile — and the tooling has matured to the point where the gap with managed services has narrowed significantly.