Modern AI inference has hit a memory wall. As LLMs evolve into multi-step agentic workflows, KV cache — the working memory of every active inference request — has grown from thousands to millions of tokens. Today’s infrastructure was not designed for this workload:
G3.5 tier · RDMA/GPUDirect data path · NVIDIA Dynamo / NIXL integration
%20(1).jpg)
MinIO MemKV is a purpose built Context Memory store for Inference and occupies the G3.5 tier — the missing layer between node-local SSDs and general-purpose shared storage. It provides a shared, petabyte-scale KV cache pool accessible by every GPU in the cluster, with microsecond retrieval latency over end-to-end RDMA and GPUDirect Storage. Running as a single ARM64 or x86 Native binary embedded in the storage tier, data moves directly from NVMe to GPU HBM, bypassing CPU overhead and kernel stacks. Native support for NVIDIA Dynamo, NIXL, vLLM, and LMCache requires no changes to model code.