What is a Data Lakehouse?

Understanding the Foundation: Data Warehouse and Data Lakes

In order to understand what a data lakehouse is, it is necessary to define what it isn’t. It isn’t a data warehouse or a data lake.

What is a Data Warehouse?

A data warehouse is a platform designed for structured, curated data that in the past primarily powered business intelligence and analytics. Data warehouses typically excel at performant query paths, maintained ACID compliance, and enforced consistent schemas across all data.

However, the data warehouses of the past could not store unstructured data like images, videos, or documents. Modern iterations can store semi-structured data like JSON and XML, but often have some parsing overhead for these types that result in slower query performance. In addition, schema changes usually require significant effort, and typically on-prem deployments can only scale vertically rather than horizontally. Data pipelines into data warehouses are often brittle and subject to data drift as they usually can’t easily accommodate schema or data type changes. Significant effort is required to curate this data infrastructure and as a result, it has spawned a plethora of data management philosophies including Kimble methodology, star schema, and medallion architectures.

What is a Data Lake?

Data lakes were the early response of big data to the rigid data warehouse. In contrast to a data warehouse, a data lake stores raw data, typically without curation. This type of architecture fostered ELT pipelines where all data was extracted and loaded and then only later curated for use. By their nature, data lakes can handle any data type: structured tables, semi-structured JSON, unstructured text, images, videos, and streaming data. This architecture made it ideal for early data exploration on raw data and opened the door for early machine learning.

However, this approach left many with disorderly lakes which were often referred to humorously as swamps. It became clear that a “store first” approach made actually cleaning, curating and making data available to end-users difficult. Governance and data quality both took a back seat to ease of ingestion. Eventually, systems and software were developed to solve this problem and data mesh/fabric was born. But, these couldn’t entirely solve the inherent difficulties with this infrastructure.

‍

What Trends Brought About the Data Lakehouse?

Several technological and business trends converged to make the lakehouse architecture possible. Open table formats like Iceberg, Delta, and Hudi brought database reliability to object storage. Modern object storage achieved performance levels suitable for analytics while maintaining security and cost savings benefits owned by on-prem deployments. Modern query engines like Trino and Spark can efficiently process data with UI’s that are familiar and with the shared lingua franka of SQL. Compute-storage separation now allows independent scaling of processing power and data storage. Which effectively ended the reign of tightly coupled data architectures.

At the same time, business drivers also pushed towards a different solution to the problems introduced by both the data warehouse and the data lake. The AI/ML explosion requires both raw unstructured data and curated structured features. Real-time analytics demands streaming data integration with historical analysis. Cost optimization creates pressure to eliminate duplicate data storage and complex and faulty ETL pipelines. Data governance requirements call for unified security and compliance across all data.

‍

What is a Data Lakehouse?

Enter the lakehouse, an architecture that combines the flexible storage and cost-effectiveness of data lakes with the performance, reliability, and governance of data warehouses. A lakehouse uses open table formats to provide structured database capabilities like ACID transactions, schema enforcement, and optimized query performance over data stored in object storage. This unified architecture supports diverse workloads from business intelligence and reporting to machine learning and real-time analytics, all against the same underlying tables without needing to move or duplicate data.

A data lakehouse is composed of three foundational layers: object storage, an open table format, and a query engine, supported by additional governance and orchestration components.

Object Storage

The foundation stores data files in object storage for elastic growth. On-premises object storage provides predictable costs, complete data sovereignty, and the ability to scale storage independently from compute resources while maintaining high performance for analytics workloads.

Open Table Format

An open table layer, typically Apache Iceberg, organizes the stored files into ACID-compliant tables with schemas, time travel capabilities, and reliable concurrent writes. This layer provides database-like reliability and consistency while maintaining the flexibility to evolve schemas without expensive migrations.

Query Engine

Multiple processing engines like Spark, Trino, Dremio, Flink, and DuckDB can query the same tables simultaneously without creating data copies. This polyglot approach allows teams to use the best tool for each workload, whether batch processing, interactive analytics, stream processing, or machine learning, all against the same underlying tables.

‍

The Value of a Data Lakehouse

With schema evolution, a data lakehouse can adapt to changing business requirements with minimal downtime or complex migrations. Fine-grained access controls at the query engine and metadata catalog level (such as Iceberg or Hive catalogs) enable secure data sharing across departments and external partners.

The architecture delivers consistent performance at scale through intelligent data clustering and automatic file optimization features provided by open table formats, with implementations varying across Iceberg's sorting and bucketing, Delta's Z-ordering, and Hudi's clustering capabilities. Modern object storage provides high reliability and availability with active-active replication and bit rot protection without requiring complex database replication strategies or maintenance downtime.

Deploying on-prem enables organizations to maintain complete data sovereignty and control over their data. While shared responsibility models have inherent security considerations, they can provide robust security when properly implemented.

This stack is as flexible as you need: teams can implement real-time streaming ingestion alongside batch processing through the broader lakehouse ecosystem and query engines, enabling organizations to make decisions on the freshest data while maintaining historical context for trend analysis and forecasting.

‍

Data Types and Formats in a Lakehouse

Structured Data Formats

Parquet is a columnar format optimized for analytical queries with excellent compression. Avro provides a row-based format ideal for data exchange and schema evolution. ORC offers an optimized row-columnar format with built-in indexing.

Unstructured Data

Documents including PDFs, Word docs, and text files support content analysis and RAG applications. Media files like images, videos, and audio enable computer vision and multimodal AI. Logs from applications and system metrics power observability and anomaly detection.

Semi-Structured Data

JSON provides flexible nested data from APIs and applications. XML handles legacy data interchange formats. CSV supports simple tabular data with varying schemas.

Open Table Formats

Open table formats like Apache Iceberg, Delta Lake, and Apache Hudi are built on top of Parquet as the underlying storage format, adding transactional capabilities, schema evolution, and metadata management. These formats extend Parquet's columnar efficiency with features like ACID transactions, time travel, and concurrent read/write operations.

Query engines can seamlessly convert other file formats into open table formats during ingestion or processing. CSV, JSON, Avro, and ORC files can be automatically converted to Parquet-based tables through engines like Spark, Trino, or Dremio, enabling organizations to modernize their data storage without manual migration efforts. This conversion process optimizes data layout and applies compression while preserving the original data structure and semantics.

‍

Top Use Cases for Data Lakehouses

BI & Dashboards

Replace traditional data warehouses for business reporting. Analysts query curated tables using familiar SQL tools while accessing a broader range of data sources including semi-structured logs and external APIs.

Customer 360

Unify customer data from multiple touchpoints including CRM systems, web analytics, support tickets, and transaction histories. The lakehouse enables real-time customer profiling by combining structured customer records with unstructured interaction data like call transcripts and chat logs, providing a complete view for personalization and customer experience optimization.

Log Analytics

Store and analyze massive volumes of application logs, system metrics, and security events. The lakehouse scales to petabytes while enabling both real-time monitoring and historical trend analysis.

Machine Learning

Train models on raw data while serving features from curated tables. The same platform supports data exploration, feature engineering, model training, and inference serving without data movement.

Power recommendation engines, fraud detection, and predictive analytics using structured, curated data. Traditional AI and ML workloads rely heavily on structured data formats like Parquet tables with defined features and labels. Combine transactional data, behavioral logs, and external datasets in a single queryable layer for model training and inference.

Generative AI

Enable retrieval-augmented generation (RAG) and large language models by working directly with unstructured data stored in the data lake. Generative AI primarily uses raw documents, images, videos, and text for training and inference, rather than structured tabular data. Store documents alongside vector embeddings and integrate with vector databases for semantic search while maintaining data lineage and governance.

Supporting Technologies for AI:

Feature Stores: Online/offline feature serving (Feast, Tecton)
Vector Databases: Semantic search and embeddings (Pinecone, Weaviate)
ML Orchestration: Workflow management (Airflow, Prefect)

‍

Best Practices for Open Table Format Lakehouses

Architecture Decisions

Choose Iceberg for maximum compatibility and feature richness
Design partitioning strategy based on query patterns
Implement proper data lifecycle management (archival, deletion)
Plan for multi-region replication if needed

Tool Selection

Start with Spark for ETL and Trino for interactive queries
Add specialized engines (Flink for streaming) as needed
Choose catalog based on ecosystem (HMS for Hadoop, Glue for AWS)
Implement proper monitoring and observability

Governance Approaches

Establish data classification and retention policies
Implement column-level access controls
Create data lineage tracking
Document schemas and maintain data dictionaries
Regular data quality monitoring and validation

Performance Optimization

Optimize file sizes (target 100MB-1GB per file)
Use appropriate compression (snappy for hot data, gzip for cold)
Implement proper indexing strategies
Monitor query patterns and optimize accordingly
Regular table maintenance (compaction, cleanup)

‍

Why the Lakehouse Architecture Matters for AI

Traditional architectures forced AI teams into complex data engineering. You'd extract features from a data warehouse, store raw training data in a lake, maintain separate feature stores, and sync everything through brittle pipelines. Schema changes broke downstream models, and reproducing experiments required reconstructing historical data states.

A data lakehouse eliminates this friction. Traditional AI and ML teams can train models using structured, curated data with reliable features and labels. Generative AI teams can work directly with raw documents, images, and unstructured content. The data lake also serves as the processing environment where raw data gets transformed into structured formats through tools like Spark.

You can keep raw documents and media for generative AI experiments while exposing curated tables for traditional ML features and labels. Use time travel to reproduce training runs exactly. Stream fresh events into the same tables that power online features. Store unstructured data next to structured data to enable both traditional AI workflows and generative AI applications like RAG.

The result is faster iteration cycles, simpler infrastructure, and a clearer path from experimentation to production. Data scientists spend time on models instead of data engineering, while ML engineers deploy with confidence knowing their training and serving data come from the same source.

The Future

The data lakehouse represents the evolution of data architecture, eliminating the trade-offs between flexibility and performance that defined previous generations. By enabling diverse data workloads wherever it makes sense for them to run, organizations can accelerate their data initiatives while reducing infrastructure complexity and costs.

‍

Additional Resources

What Is Apache Iceberg?

What Is an Open Table Format?

What is a Catalog and Why Do You Need One?

From Our Blog

The Architect’s Guide: A Modern Datalake Reference Architecture The Disruptive Nature of Data Lakehouses Data Lakehouse Security: Supporting Scalable Analytics and AI Workloads