
An open table format is a standardized metadata layer that sits on top of data files in object storage. Think of it as the organizational system that transforms a collection of Parquet or ORC files into something you can query and manage like a database table. The format tracks which files belong to which table, maintains schema information, and handles all the coordination that lets multiple tools read and write the same data without stepping on each other.
Here's the key distinction: file formats like Parquet store your actual data—the bytes on disk. An open table format manages the metadata about that data. When Spark or Trino queries your table, it talks to the table format first to understand what data exists and where to find it, then reads the actual bytes from those Parquet files.
This article explains how open table formats work, their core capabilities like ACID transactions and time travel, the major technologies in the space, and what you need to know when implementing them on object storage.
A data lakehouse architecture stacks three layers. Object storage sits at the bottom (scalable, durable, and cost-effective). Your open table format lives in the middle as the metadata layer. Compute engines like Spark, Presto, or Dremio sit on top, reading and writing through the table format's interface.
This separation of storage and compute changes everything. You're not locked into a single analytics platform because the data and its metadata live independently. Any engine that understands the table format can work with your data.
File formats and table formats solve different problems, and you need both. Parquet handles compression, encoding, and how data physically arranges on disk. It's optimized for columnar storage and fast reads. An open table format, however, organizes those Parquet files into logical tables.
When you write data to a lake using an open table format, the format creates metadata files managed through catalog systems that track:
Your compute engine reads this metadata first, figures out which files it needs, then pulls the actual data from those files. The table format coordinates everything so multiple engines see consistent data.
Open table formats bring database capabilities to data lakes through metadata management.
Schema evolution lets you change table structures without rewriting data. You can add columns, rename fields, or change data types, and the table format tracks these changes through versioned metadata. Queries automatically adapt—older data gets read with the old schema, newer data with the updated one.
This matters because data changes over time. Your application adds a new field, or you realize a column needs a different type. Without schema evolution, you'd rewrite terabytes of data. With it, you just update metadata.
ACID transactions (atomicity, consistency, isolation, durability) prevent the chaos that happens when multiple processes write simultaneously. The table format coordinates concurrent operations, preventing conflicts and corruption. If a write fails halfway through, the format rolls back to the last stable state. Your data stays consistent.
Every write creates a new snapshot without modifying existing files. The table format keeps a history of these snapshots, letting you query data as it existed at any point in the past. You can audit changes, debug pipeline issues, or roll back bad writes.
Table formats filter by metadata before executing queries. Query engines use this metadata to skip files that can't contain relevant data. If you're filtering for dates in March and a file only has February data, the engine never reads it. This metadata pruning dramatically reduces data scanned for large queries.
Three formats dominate the lakehouse landscape, each with different strengths.
Apache Iceberg uses a tree structure of metadata files that scales to massive tables with extensive snapshot history. The format separates metadata from data completely, enabling efficient operations at scale. Iceberg has broad ecosystem support—major cloud providers, compute engines, and analytics tools all implement native Iceberg support.
Delta Lake originated at Databricks and historically integrated tightly with Spark. The format uses a transaction log to maintain consistent views and coordinate writes. Delta Lake has evolved to support broader engine compatibility beyond Spark, though its roots show in how it handles transactions.
Apache Hudi optimizes for frequent updates and streaming scenarios. The format offers write-optimized and read-optimized storage modes—you pick based on your access patterns. Hudi includes indexing capabilities like Bloom filters that accelerate lookups in update-heavy workloads.
The capabilities across formats have been converging. Features that were unique to one format often appear in others over time, reflecting how the ecosystem is maturing toward interoperability.
Open table formats solve real infrastructure problems.
These formats are open standards—any engine that implements the specification can read and write your tables. You're not locked into a single vendor's ecosystem. If you want to switch from Spark to Trino, or add Dremio for BI queries, your data doesn't move. The formats enable "write once, query anywhere" across your analytics stack.
Beyond metadata pruning, table formats use file-level statistics that let query engines apply filters during file selection rather than after reading gigabytes of data. Apache Iceberg supports partition evolution (changing partitioning schemes without rewriting data) though this capability varies across formats, with Delta Lake having more limited support for partition changes.
Operations that were painful in traditional data lakes become straightforward. You can update or delete specific rows without rewriting entire partitions. Schema changes don't require coordinated updates across all consumers—the format handles versioning. Time travel capabilities let you access previous versions of your data without scanning or reprocessing files.
Table formats reduce storage costs through smart data management. Incremental updates avoid full table rewrites. File compaction combines small files into larger ones, optimizing both storage efficiency and query performance.
Object storage provides the foundation for open table formats because of its scalability and cost characteristics. The formats are designed specifically for object storage semantics—immutable objects, strong read-after-write consistency, and different latency profiles than block storage.
The S3 API has become the de facto standard, creating a consistent interface for object storage. This combination creates a portable, vendor-neutral platform that works consistently whether you're on-premises, in AWS, or in another cloud.
When deploying table formats on object storage, understanding the storage layer's operational characteristics helps you architect reliable infrastructure.
Open table formats bring database capabilities to data lakes through a metadata layer that coordinates access across multiple engines. The formats enable ACID transactions, schema evolution, time travel, and performance optimization on data stored in object storage. This combination (open standards, database-like features, and object storage economics) provides the foundation for modern analytical workloads at scale.
Request a free trial of MinIO AIStor to see how our object store could work for you.