MinIO MemKV

Purpose-Built Context Memory Store for Inference
Download
“The industry has been papering over context loss for years. At small scale, you absorb the recompute tax and move on. At large scale, it breaks you — a GPU recomputing context it has already generated is burning memory bandwidth, adding latency, and running up cost with nothing to show for it. At a thousand GPUs or more, that's not inefficiency but a structural constraint.”
— AB Periasamy, Co-Founder and CEO, MinIO
75x
Faster time-to-first-token at production concurrency
90%+
GPU utilization
40–60%
Lower cost per token in production inference clusters
Petabyte-scale
Shared KV cache capacity across the entire cluster

The Challenge

Modern AI inference has hit a memory wall. As LLMs evolve into multi-step agentic workflows, KV cache — the working memory of every active inference request — has grown from thousands to millions of tokens. Today’s infrastructure was not designed for this workload:


  • GPU HBM is physically constrained at 80 GB per H100 — long-context sessions consume 20 GB+ of KV cache each, forcing constant eviction under production concurrency.
  • Evicted cache must be recomputed from scratch, reducing effective GPU utilization to 30–60% in memory-bound deployments. The most expensive infrastructure in the stack generates tokens less than half the time.
  • General-purpose shared storage runs at millisecond latency, roughly 50x slower than G3.5 and incompatible with real-time inference. Node-local SSDs isolate cache to a single GPU, forcing session pinning and fragmenting 
cluster capacity.

Architecture

G3.5 tier · RDMA/GPUDirect data path · NVIDIA Dynamo / NIXL integration

Solution Overview

MinIO MemKV is a purpose built Context Memory store for Inference and occupies the G3.5 tier — the missing layer between node-local SSDs and general-purpose shared storage. It provides a shared, petabyte-scale KV cache pool accessible by every GPU in the cluster, with microsecond retrieval latency over end-to-end RDMA and GPUDirect Storage. Running as a single ARM64 or x86 Native binary embedded in the storage tier, data moves directly from NVMe to GPU HBM, bypassing CPU overhead and kernel stacks. Native support for NVIDIA Dynamo, NIXL, vLLM, and LMCache requires no changes to model code.


Key Highlights
  • End-to-end RDMA transport, with GPU-native block sizes
  • Wire-speed fabric performance
  • Petascale KV Cache Capacity. Independent scaling of compute and context memory — expand KV cache by adding flash nodes, not GPUs.
  • Elimination of session pinning — shared cache pool lets any GPU serve any request, enabling true load balancing and fault tolerance.
  • Native to NVIDIA STX ecosystem; supports Dynamo, NIXL, vLLM, and LMCache out of the box.

Business Value

Quantifiable Outcomes
75x TTFT
time-to-first-token
Prior context retrieved from flash over RDMA in microseconds, not rebuilt through prefill—at any context length or session history.
90%+ GPU util.
vs. 30–60% baseline
KV cache offloaded to shared flash; each GPU serves more concurrent requests without memory pressure forcing evictions.
40–60% lower
cost per token
Higher utilization + reduced recompute + flash-vs-HBM cost differential. Context capacity scales independently from compute.
$2M recovered
per 128-GPU cluster/yr
Raising GPU utilization from 50% to 90%+ on a 128-GPU / 128K-token deployment. Scales with cluster size and context length.
MinIO MemKV joins AIStor Objects and Tables as the third capability in the MinIO portfolio, covering training, analytics, and inference context from one platform.
Get Started
Executive Deep Dive
Technical Deep Dive
Live Demo
Proof of Value
30-min briefing on AI inference economics, the recompute tax, and MinIO MemKV’s role in the AI Stack.
60-min session covering G3.5 architecture, RDMA/GPUDirect data path, and NVIDIA Dynamo/NIXL integration.
Hands-on demonstration of MemKV running inference workloads—showing TTFT and GPU utilization improvements against a baseline.
Structured evaluation in your environment with defined success criteria, benchmark methodology, and documented TCO comparison.

Get started using

Ensure production success across use cases and industries.
Get started