What Is an Open Table Format? A Technical Overview

Understanding the Lakehouse Architecture

A data lakehouse architecture stacks three layers. Object storage sits at the bottom (scalable, durable, and cost-effective). Your open table format lives in the middle as the metadata layer. Compute engines like Spark, Presto, or Dremio sit on top, reading and writing through the table format's interface.

This separation of storage and compute changes everything. You're not locked into a single analytics platform because the data and its metadata live independently. Any engine that understands the table format can work with your data.

‍

How Open Table Formats Differ from File Formats

File formats and table formats solve different problems, and you need both. Parquet handles compression, encoding, and how data physically arranges on disk. It's optimized for columnar storage and fast reads. An open table format, however, organizes those Parquet files into logical tables.

When you write data to a lake using an open table format, the format creates metadata files managed through catalog systems that track:

File inventory: Which Parquet files belong to this table
Schema tracking: Column names, types, and how they've changed over time
Partition structure: How data divides across directories or logical partitions
Transaction history: What operations happened and when

Your compute engine reads this metadata first, figures out which files it needs, then pulls the actual data from those files. The table format coordinates everything so multiple engines see consistent data.

‍

Core Capabilities That Open Table Formats Provide

Open table formats bring database capabilities to data lakes through metadata management.

Schema Evolution and Metadata Management

Schema evolution lets you change table structures without rewriting data. You can add columns, rename fields, or change data types, and the table format tracks these changes through versioned metadata. Queries automatically adapt—older data gets read with the old schema, newer data with the updated one.

This matters because data changes over time. Your application adds a new field, or you realize a column needs a different type. Without schema evolution, you'd rewrite terabytes of data. With it, you just update metadata.

ACID Transaction Support

ACID transactions (atomicity, consistency, isolation, durability) prevent the chaos that happens when multiple processes write simultaneously. The table format coordinates concurrent operations, preventing conflicts and corruption. If a write fails halfway through, the format rolls back to the last stable state. Your data stays consistent.

Time Travel and Versioning

Every write creates a new snapshot without modifying existing files. The table format keeps a history of these snapshots, letting you query data as it existed at any point in the past. You can audit changes, debug pipeline issues, or roll back bad writes.

Performance Optimization Through Metadata

Table formats filter by metadata before executing queries. Query engines use this metadata to skip files that can't contain relevant data. If you're filtering for dates in March and a file only has February data, the engine never reads it. This metadata pruning dramatically reduces data scanned for large queries.

‍

The Major Open Table Format Technologies

Three formats dominate the lakehouse landscape, each with different strengths.

Apache Iceberg

Apache Iceberg uses a tree structure of metadata files that scales to massive tables with extensive snapshot history. The format separates metadata from data completely, enabling efficient operations at scale. Iceberg has broad ecosystem support—major cloud providers, compute engines, and analytics tools all implement native Iceberg support.

Delta Lake

Delta Lake originated at Databricks and historically integrated tightly with Spark. The format uses a transaction log to maintain consistent views and coordinate writes. Delta Lake has evolved to support broader engine compatibility beyond Spark, though its roots show in how it handles transactions.

Apache Hudi

Apache Hudi optimizes for frequent updates and streaming scenarios. The format offers write-optimized and read-optimized storage modes—you pick based on your access patterns. Hudi includes indexing capabilities like Bloom filters that accelerate lookups in update-heavy workloads.

The capabilities across formats have been converging. Features that were unique to one format often appear in others over time, reflecting how the ecosystem is maturing toward interoperability.

‍

The Four Primary Benefits of Open Table Formats

Open table formats solve real infrastructure problems.

Vendor Independence and Interoperability

These formats are open standards—any engine that implements the specification can read and write your tables. You're not locked into a single vendor's ecosystem. If you want to switch from Spark to Trino, or add Dremio for BI queries, your data doesn't move. The formats enable "write once, query anywhere" across your analytics stack.

Enhanced Query Performance

Beyond metadata pruning, table formats use file-level statistics that let query engines apply filters during file selection rather than after reading gigabytes of data. Apache Iceberg supports partition evolution (changing partitioning schemes without rewriting data) though this capability varies across formats, with Delta Lake having more limited support for partition changes.

Simplified Data Operations

Operations that were painful in traditional data lakes become straightforward. You can update or delete specific rows without rewriting entire partitions. Schema changes don't require coordinated updates across all consumers—the format handles versioning. Time travel capabilities let you access previous versions of your data without scanning or reprocessing files.

Cost Efficiency

Table formats reduce storage costs through smart data management. Incremental updates avoid full table rewrites. File compaction combines small files into larger ones, optimizing both storage efficiency and query performance.

‍

Open Table Formats and Object Storage

Object storage provides the foundation for open table formats because of its scalability and cost characteristics. The formats are designed specifically for object storage semantics—immutable objects, strong read-after-write consistency, and different latency profiles than block storage.

S3-Compatible Storage Requirements

The S3 API has become the de facto standard, creating a consistent interface for object storage. This combination creates a portable, vendor-neutral platform that works consistently whether you're on-premises, in AWS, or in another cloud.

When deploying table formats on object storage, understanding the storage layer's operational characteristics helps you architect reliable infrastructure.

‍

Conclusion

Open table formats bring database capabilities to data lakes through a metadata layer that coordinates access across multiple engines. The formats enable ACID transactions, schema evolution, time travel, and performance optimization on data stored in object storage. This combination (open standards, database-like features, and object storage economics) provides the foundation for modern analytical workloads at scale.

Request a free trial of MinIO AIStor to see how our object store could work for you.

‍

Additional Resources

What is a Catalog and Why Do You Need One?

What is a Data Lakehouse?

Data Lakehouse Examples in Action

From Our Blog

Architect’s Guide to Open Table Formats and Object Storage What the Interoperability Trend in Open Table Formats Means for Enterprise Data Architectures Converging Formats: From Competition to Consolidation in the Open Lakehouse