“The industry has been papering over context loss for years. At small scale, you absorb the recompute tax and move on. At large scale, it breaks you — a GPU recomputing context it has already generated is burning memory bandwidth, adding latency, and running up cost with nothing to show for it. At a thousand GPUs or more, that's not inefficiency but a structural constraint.”

— AB Periasamy, Co-Founder and CEO, MinIO

The Challenge

Modern AI inference has hit a memory wall. As LLMs evolve into multi-step agentic workflows, KV cache — the working memory of every active inference request — has grown from thousands to millions of tokens. Today’s infrastructure was not designed for this workload: 

GPU HBM is physically constrained at 80 GB per H100 — long-context sessions consume 20 GB+ of KV cache each, forcing constant eviction under production concurrency.
Evicted cache must be recomputed from scratch, reducing effective GPU utilization to 30–60% in memory-bound deployments. The most expensive infrastructure in the stack generates tokens less than half the time.
General-purpose shared storage runs at millisecond latency, roughly 50x slower than G3.5 and incompatible with real-time inference. Node-local SSDs isolate cache to a single GPU, forcing session pinning and fragmenting  cluster capacity.

Architecture

G3.5 tier · RDMA/GPUDirect data path · NVIDIA Dynamo / NIXL integration

Solution Overview

MinIO MemKV is a purpose built Context Memory store for Inference and occupies the G3.5 tier — the missing layer between node-local SSDs and general-purpose shared storage. It provides a shared, petabyte-scale KV cache pool accessible by every GPU in the cluster, with microsecond retrieval latency over end-to-end RDMA and GPUDirect Storage. Running as a single ARM64 or x86 Native binary embedded in the storage tier, data moves directly from NVMe to GPU HBM, bypassing CPU overhead and kernel stacks. Native support for NVIDIA Dynamo, NIXL, vLLM, and LMCache requires no changes to model code. 

‍

Key Highlights

End-to-end RDMA transport, with GPU-native block sizes
Wire-speed fabric performance
Petascale KV Cache Capacity. Independent scaling of compute and context memory — expand KV cache by adding flash nodes, not GPUs.
Elimination of session pinning — shared cache pool lets any GPU serve any request, enabling true load balancing and fault tolerance.
Native to NVIDIA STX ecosystem; supports Dynamo, NIXL, vLLM, and LMCache out of the box.

Business Value

Quantifiable Outcomes

75x TTFT

time-to-first-token

Prior context retrieved from flash over RDMA in microseconds, not rebuilt through prefill—at any context length or session history.

90%+ GPU util.

vs. 30–60% baseline

KV cache offloaded to shared flash; each GPU serves more concurrent requests without memory pressure forcing evictions.

40–60% lower

cost per token

Higher utilization + reduced recompute + flash-vs-HBM cost differential. Context capacity scales independently from compute.

$2M recovered

per 128-GPU cluster/yr

Raising GPU utilization from 50% to 90%+ on a 128-GPU / 128K-token deployment. Scales with cluster size and context length.

Get Started

Executive Deep Dive

Technical Deep Dive

Live Demo

Proof of Value

30-min briefing on AI inference economics, the recompute tax, and MinIO MemKV’s role in the AI Stack.

60-min session covering G3.5 architecture, RDMA/GPUDirect data path, and NVIDIA Dynamo/NIXL integration.

Hands-on demonstration of MemKV running inference workloads—showing TTFT and GPU utilization improvements against a baseline.

Structured evaluation in your environment with defined success criteria, benchmark methodology, and documented TCO comparison.

MinIO MemKV