The Architect’s Guide: A Modern Data Lake Reference Architecture

Graphic of a document with three horizontal blue lines above a large blue paragraph symbol on a light background.

Updated:

May 27, 2026

Futuristic black server cube with glowing symbol in digital ocean with city skyline at dawn.

An abbreviated version of this post appeared on The New Stack on March 26th, 2024.

A modern data lake is a unified data architecture that combines data warehouse and data lake capabilities on object storage, enabling both structured analytics and unstructured data management in a single platform. Businesses aiming to maximize their data assets are adopting scalable, flexible, and unified data storage and analytics approaches. This trend is driven by enterprise architects tasked with crafting infrastructures that align with evolving business demands while at the same time reducing complexity and redundancy wherever possible. A modern data lake architecture addresses this need by integrating the scalability and flexibility of a data lake with the structure and performance optimizations of a data warehouse. This post serves as a guide to understanding and implementing a modern data lake architecture.

What is a Modern Data Lake?

A modern data lake is one-half data warehouse and one-half data lake and uses object storage for everything. This may sound like a marketing trick - put two products in one package and call it a new product - but the data warehouse that will be presented in this post is better than a conventional data warehouse. It provides the ability to house both structured data - typically the domain of data warehouses, with unstructured data - typically the domain of data lakes, into a single store by using object storage to provide benefits in terms of scalability and performance while retaining the often atomic level structured data sets found in file formats typical of data warehouses at the same time. Organizations that adopt this approach will only pay for what they need (facilitated by the scalability of object storage) and if blazing speed is needed, they can equip their underlying object store with NVMe drives connected by a high-end network.

Open Table Formats (OTFs) are specifications—such as Apache Iceberg, Apache Hudi, and Delta Lake—that define how to organize and manage tabular data on object storage, enabling data warehouse functionality with features like ACID transactions, schema enforcement, and time travel. The use of object storage in this fashion has been made possible by the rise of OTFs, which, once implemented, make it seamless for object storage to be used as the underlying storage solution for a data warehouse. These specifications also provide features that may not exist in a conventional data warehouse - for example, snapshots (also known as time travel), schema evolution, partitions, partition evolution, and zero-copy branching.

But, as stated above, the modern data lake is more than just a fancy data warehouse - it also contains a data lake for unstructured data. The OTFs also provide integration to external data in the data lake. This integration allows external data to be used as a SQL table if needed - or the external data can be transformed and routed to the data warehouse using high-speed processing engines and familiar SQL commands.

So the modern data lake is more than just a data warehouse and a data lake in one package with a different name. Collectively they provide more value than what can be found in a conventional data warehouse or a standalone data lake.

Modern Data Lake vs. Traditional Data Warehouse vs. Standalone Data Lake

Dimension	Modern Data Lake	Traditional Data Warehouse	Standalone Data Lake
Storage Type	Object storage for both structured and unstructured data	Proprietary or block storage	Object storage or distributed file system
Scalability	Highly scalable; pay for what you need	Limited; scaling requires significant investment	Highly scalable
Supported Data Types	Structured, semi-structured, and unstructured	Primarily structured	All data types, but limited query capability
Key Architectural Features	Open Table Formats, time travel, schema evolution, zero-copy branching, unified storage	ACID transactions, optimized for SQL queries	Schema-on-read, raw data storage
Compute-Storage Coupling	Disaggregated (independent scaling)	Tightly coupled	Disaggregated
AI/ML Support	Native support with integrated vector databases and ML clusters	Limited; requires external tools	Supports raw data storage for ML

Key Benefits

The Modern data lake approach offers several distinct advantages for organizations:

Disaggregated Compute and Storage – Multiple processing engines can query the same data warehouse storage, allowing teams to have dedicated compute resources without competing for capacity.
Unified Storage Architecture – Consolidates structured and unstructured data on a single object storage platform, eliminating data silos and reducing infrastructure complexity.
Cost Efficiency at Scale – Object storage enables pay-for-what-you-need economics, with the ability to scale storage and compute independently.
Open Table Format Features – Provides advanced capabilities like time travel (snapshots), schema evolution, partition evolution, and zero-copy branching that exceed traditional data warehouse functionality.
AI/ML Workload Support – Native integration with vector databases, ML clusters, and MLOps tools enables seamless support for machine learning and generative AI use cases.
Flexible Data Integration – External table functionality allows raw data in the data lakeData Lake to be queried as SQL tables without migration, enabling complex transformations and joins.
Disaggregated Compute and Storage – Multiple processing engines can query the same data warehouse storage, allowing teams to have dedicated compute resources without competing for capacity.

Conceptual Architecture

The conceptual architecture provides a layered view of all the components and services needed by the Modern data lake. Layering is a convenient way to present all the components and services needed by the Modern data lake. Layering provides a clear way to group services that provide similar functionality. It also allows for a hierarchy to be established, with Consumers on top and data sources (with their raw data) on the bottom. The layers of the Modern data lake from top to bottom are:

Consumption Layer - Contains the tools used by power users to analyze data. Also contains applications and AI/ML workloads that will programmatically access the Modern data lake.
Semantic Layer - An optional metadata layer for data discovery and governance.
Processing Layer - This layer contains the compute clusters needed to query the Modern data lake. It also contains compute clusters used for distributed model training. Complex transformations can occur in the Processing layer by taking advantage of the Storage Layer’s integration between the data lake and the data warehouse.
Storage Layer - Object storage is the primary storage service for the Modern data lake; however, MLOP tools may need other storage services such as relational databases. If you are pursuing generative AI, you will need a vector database.
Ingestion Layer - Contains the services needed to receive data. Advanced ingestion layers will be able to retrieve data based on a schedule. The Modern data lake should support a variety of protocols. It should also support data arriving in streams and batches. Simple and complex data transformations can occur in the ingestion layer.
Data Sources - The data sources layer is technically not a part of the Modern data lake solution, but it is included in this post because a well-constructed Modern data lake must support a variety of data sources with varying capabilities for sending data.

The diagram below visually depicts all the layers described above and all the capabilities that may be needed to implement these layers. This is an end-to-end architecture where the heart of the platform is a Modern data lake. Rather than focusing on just the processing layer and the storage layer - this architecture also shows components needed to ingest, transform, discover, govern, and consume data. The tools needed to support important use cases that depend on a Modern data lake are also included, such as MLOps storage, vector databases, and machine learning clusters.

The conceptual nature of the approach used in this post is important. If the diagram above made use of product names, then meaning would be lost. Product names are rarely chosen for meaning - rather, they are chosen for brand awareness and memory retention. To this end, our conceptual architecture uses simple nouns where the feature provided is intuitive. The next section will provide an example of a concrete implementation for the reader familiar with the more popular big data projects and products in the market today. However, the reader is encouraged to refer to the conceptual diagram when making decisions for their organization.

Finally, there are no arrows. Arrows typically depict data flow and dependencies. Showing all possible data flows and dependencies would unnecessarily complicate the diagram. A better approach is to look at data flow and dependencies in the context of a use case. Once a few components are isolated in the context of a use case, then data flow and dependencies can be more clearly illustrated.

A Concrete Architecture

The purpose of this section is to ground the design of our reference architecture with concrete open-source examples. For the architect eager to dive in and start building, the projects and products shown below are free to use in a proof of concept. When your POC graduates to a funded project that will one day run in production, then be sure to check open source licenses and terms of use for all software used in your POC.

A Few Words on Data Sources

The applications, devices, and vendors that feed your Modern data lake come in a variety of flavors, and so does their data. On-premise modern applications may be able to stream well-structured data in real time using formats such as AVRO and Parquet. On the other hand, older legacy applications may only be able to send simple files in batches, such as XML, JSON, and CSVs. Data vendors may not send data at all - expecting their customers to retrieve data.

Mobile apps, Websites, IOT Devices, and Social Media apps will typically send application logs and other telemetry (usage statistics) to your ingestion layer. Log analytics is a popular use case for a Modern data lake. Additionally, they may send images and audio files to be used within AI/ML workloads.

Finally, organizations looking to take advantage of Generative AI will need to store documents found in file shares and portals such as SharePoint Portal Server and Confluence in the Data Lake.

The Modern data lake needs to be able to interface with all these data sources efficiently and reliably - getting the data to either Data Lake Storage or Data Warehouse Storage. Onboarding data is the primary purpose of the Ingestion Layer of our architecture. This requires your ingestion layer to support a variety of protocols capable of receiving streamed data and batched data. Let’s investigate the components of this layer next.

The Ingestion Layer

The Ingestion Layer is responsible for receiving data from external sources and routing it to the appropriate storage tier within the architecture. Structured data from sources that designed their feeds for the data warehouse side of the Modern data lake can bypass the data lake and send their data directly to the data warehouse. On the other hand, sources that did not design their feeds in such a fashion will need to have their data sent to the data lake, where it can be transformed before being ingested into the data warehouse.

The ingestion layer should be able to receive and retrieve data. Internal Lines of Business (LOB) applications may have been given the mandate to send their data via streaming or batching. For these applications, the ingestion layer needs to provide an endpoint for receiving the data. However, Data Vendors and other external data sources may not be so willing to deliver data. The ingestion layer should also provide scheduled retrieval capabilities. For example, a data vendor may provide new datasets at the first of every month. Scheduled retrieval capabilities will allow for the ingestion layer to connect and download data at the correct time.

Streaming is the best way to transmit data to a Modern data lake or to any destination for that matter. Steaming implies the use of a messaging service deployed in a way that makes it resilient, available and highly performant. The messaging service usually provides a queuing mechanism that acknowledges the receipt of a message only upon successful storage of the message. The service then provides “exactly once” delivery to a downstream service that is responsible for saving the data in the message to either the data warehouse or the data lake. (Note: Some message services provide “at least once” delivery requiring downstream services to implement idempotent updates to the data source. It is important to check the fine print of the service you end up using.) What is especially nice about this style of ingestion is that if the downstream service fails and does not acknowledge the successful processing of a message then the message will reappear in the queue for future ingestion. Messaging services also provide “dead letter queues” for messages that repeatedly fail.

Streaming ingestion is great, but in many cases, real-time insights are not needed. In these situations, batch or mini-batch processing works fine and can be considerably simpler to implement. For batch uploads, the S3 API is your best option. MinIO is S3 compliant, and any data source currently sending batch data to an S3 endpoint will work “as-is” with only a connection change once you switch over to the MinIO data lake. However, many organizations may still prefer FTP/SFTP for its simplicity and ability to run in highly constrained environments. MinIO also has support for FTP and SFTP. This interface allows a data source to send data to MinIO the same way it would send data to an FTP server. From an application or users perspective, moving data onto MinIO using SFTP is seamless since everything is essentially the same - from policies, security, etc.

The Data Storage Layer

The Data Storage Layer serves as the foundation for all other layers, providing reliable data persistence and efficient data retrieval for the entire Modern data lake.

The Data storage layer is the bedrock that all other layers depend upon. Its purpose is to store data reliably and serve it efficiently. There will be an object storage service for the data lake side of the Modern data lake and there will be an object storage service for the data warehouse.

These two object storage services can be combined into one physical instance of an object store if needed by using buckets to keep data warehouse storage separate from data lake storage. However, consider keeping them separate and installed on different hardware if the processing layer will be putting different workloads on these two storage services. For example, a common data flow is to have all new data land in the data lake. Once in the data lake, it can be transformed and ingested into the data warehouse, where it can be consumed by other applications and used for the purpose of Data Science, Business Intelligence, and data analytics. If this is your data flow, then your Modern data lake will be putting more load on your data warehouse, and you will want to make sure it is running on high-end hardware (storage devices, storage clusters, and network).

External table functionality allows data warehouses and processing engines to read objects in the Data Lake as if they were SQL tables. If the data lake is used as the landing zone for raw data, then this capability, along with the data warehouse SQL capabilities, can be used to transform raw data before inserting it into the data warehouse. Alternatively, the external table could be used “as-is” and joined with other tables and resources inside the data warehouse without it ever leaving the data lake. This pattern can help save on migration costs and can overcome some data security concerns by keeping the data in one place while, at the same time, making it available to outside services.

Most MLOP tools use a combination of an object store and a relational database to support MLOps. For example, an MLOP tool should store training metrics, hyperparameters, model checkpoints, and dataset versions. Models and datasets should be stored in the data lake, while metrics and hyperparameters will be more efficiently stored in a relational database.

Retrieval Augmented Generation (RAG) is a technique that enhances large language model responses by retrieving relevant documents from a knowledge base and including them as context, enabling the model to generate answers grounded in your organization's specific data. If you are pursuing Generative AI, you will need to build a custom corpus for your organization. It should contain documents with knowledge that no one else has and only documents that are true and accurate should be used. Furthermore, your custom corpus should be built with a Vector Database. A vector database indexes, stores, and provides access to your documents alongside their vector embeddings, which are numerical representations of your documents. Vector Databases facilitate semantic search, which is needed for Retrieval Augmented Generation - a technique utilized by generative AI to marry information in your custom corpus to an LLMs trained parametric memory.

The Processing Layer

The Processing Layer provides the compute resources required to query data, execute transformations, and run distributed machine learning workloads. At a high level, compute comes in two flavors: Processing engines for the data warehouse and clusters for distributed machine learning.

The data warehouse processing engine supports the distributed execution of SQL commands against the data in data warehouse storage. Transformations that are part of the ingestion process may also need the compute power in the processing layer. For example, some data warehouses may wish to use a medallion architecture - others may choose a star schema with dimensional tables. These designs often require substantial ETL against the raw data during ingestion.

The data warehouse used within a Modern data lake disaggregates compute from storage. So, if needed, multiple processing engines can exist for a single data warehouse data store. (This differs from a conventional relational database where compute and storage are tightly coupled, and there is one compute resource for every storage device.) A possible design for your processing layer is to set up one processing engine for each entity in the consumption layer. For example, a processing cluster for business intelligence, a separate cluster for data analytics, and yet another for data science. Each processing engine would query the same data warehouse storage service - however, since each team has their own dedicated cluster they do not compete with each other for compute. If the Business Intelligence team is running month-end reports that are compute-intensive, then they will not interfere with another team that may be running daily reports.

Machine Learning models, especially Large Language Models, can be trained faster if training is done in a distributed fashion. The Machine Learning Cluster supports distributed training. Distributed training should be integrated with an MLOPs tool for experiment tracking and checkpointing.

The Optional Semantic Layer

The Semantic Layer translates technical data structures into business-friendly terms and provides a unified view for data discovery and governance. The semantic layer sits between the processing layer, which serves up the data from the storage layer, and the Consumption layer, which contains the tools and applications looking for data. It acts like a translator that bridges the gap between the language of the business and the technical terms used to describe data. It also helps both data professionals and business users find relevant data for either end-user reports or dataset creation for AI/ML.

In its simplest form, the Semantic layer could be a data catalog or an organized inventory of data. A data catalog typically includes the original data source location (lineage), schema, short description, and long description. A more robust Semantic layer can provide security, privacy, and governance by incorporating policies, controls, and data quality rules.

This layer is optional. Organizations that have few data sources with well-structured feeds may not need a semantic layer. A well-structured feed is a feed that contains intuitive field names and accurate field descriptions that can be easily extracted from data sources and loaded into the data warehouse. Well-structured feeds should also implement data quality checks at the source so that only quality data is transmitted to the Modern data lake.

However, large organizations that have many data sources where metadata was an afterthought when schemas and feeds were designed should consider implementing the semantic layer. Many of the products that can be used in this layer provide features that help an organization populate a metadata catalog. Also, organizations that operate in complex industries should consider a semantic layer. For example, industries like Financial Services, Healthcare and Legal make heavy use of terms that are not everyday words. When these domain-specific terms are used as table names and field names, the underlying meaning of the data can be hard to ascertain.

The Consumption Layer

The Consumption Layer contains the tools, applications, and workloads that access and derive value from the data stored in the Modern data lake. Let’s conclude our presentation of the Modern data lake layers by looking at the workloads run in the topmost layer, the Consumption Layer, and discussing how the layers below support their specific use cases. Many of the workloads below are often used interchangeably or synonymous - this is unfortunate because when investigating their needs, it is better to have precise definitions. In the discussion below, I will precisely describe each function and then align it with the capabilities of the Modern data lake.

Applications - Custom applications can programmatically send SQL Queries to the Modern data lake to provide custom views for end users. These may be the same applications that submitted raw data as data sources at the bottom of the diagram. A use case that should be supported by a Modern data lake is to allow applications to submit raw data, clean it, combine it with other data and finally serve it up quickly. Applications may use models trained with data from the Modern data lake. This is another use case that the modern data lake should support. Applications should be able to send raw data to the Modern data lake, get it processed, and sent to model training pipelines - from there, the models can be used to make predictions within the application.

Data Science is the study of data. Data scientists design the datasets and potentially the models that will be trained and used for inference. Data scientists also use techniques from mathematics and statistics for the purpose of feature engineering. Feature engineering is a technique for improving datasets used to train a model. Zero-copy branching is a feature that creates an isolated copy of data using only metadata pointers rather than duplicating the actual data, enabling experimentation without storage overhead. Feature engineering is a technique for improving datasets used to train a model. A very slick feature that Modern data lakes possesses is zero-copy branching, which allows data to be branched the same way code can be branched within a Git repository. As the name suggests, this feature does not make a copy of the data - rather, it makes use of the metadata layer of the open table format used to implement the data warehouse to create the appearance of a unique copy of the data. Data scientists can experiment with a branch - if their experiments are successful, then they can merge their branch back into the main branch for other data scientists to use.

Business Intelligence is often retrospective, providing insights into past events. It involves the use of reporting tools, dashboards, and key performance indicators (KPIs) to provide a view into business performance. Much of the data needed for BI are aggregations which can require a fair amount of compute to create.

Data analytics, on the other hand, involves the analysis of data to extract insights, identify trends, and make predictions. It is more forward-looking and aims to understand why certain events occurred and what might happen in the future. Data analytics overlaps Data Science in that it incorporates statistical analysis and machine learning techniques.

Machine Learning - the machine learning workload is where ML teams run their experiments and MLOPs teams test and promote models to production. There is often a considerable difference between the needs of teams that are using machine learning for research and prototyping vs. those that are putting models into production on a regular basis. Teams only doing research and experimental work can often get away with minimal ML-Ops tooling, whereas those putting models into production will need considerably more rigorous tools and processes.

Security

A Modern data lake must provide comprehensive security capabilities including identity verification, access control, and data protection both at rest and in transit. The four key security areas are:

Authentication – Verifies the identity of users and services connecting to both the data lake and data warehouse
Authorization – Controls what actions authenticated users can perform on specific resources
Encryption at Rest – Protects stored data using cryptographic keys managed by a Key Management Server
Encryption in Transit – Secures data as it moves between components and layers of the architecture

Both the data lake and the data warehouse must support an Identity and Access Management (IAM) solution that facilitates authentication and authorization. Both halves of the Modern data lake should use the same directory service for keeping track of users and groups allowing users to present their corporate credentials when signing into the user interface for both the data lake and the data warehouse. For programmatic access, since each product requires a different connection type, the credentials that need to be presented for authentication will be different. Likewise, the policies used for authorization will also be different as the underlying resources and actions are different. The data lake requires authorization for buckets and objects as well as bucket and object actions. The data warehouse, on the other hand, needs tables and table related actions to be authorized.

Data Lake Authentication - Every connection to the data lake requires verification of identity and the data lake should integrate with the organization's identity provider. Since the data lake is an object store that is S3 compliant, the AWS Signature Version 4 protocol should be used. For programmatic access, this means that each service wishing to access an administrative API or an S3 API, such as PUT, GET, and DELETE operations, must present a valid access key and secret key.

Data Lake Authorization - Authorization is the act of restricting the actions and resources the authenticated client can perform on the data lake. An S3-compliant object store should use Policy-Based Access Control (PBAC), where each policy describes one or more rules that outline the permissions of a user or group of users. The data lake should support S3-specific actions and conditions when creating policies. By default, MinIO denies access to actions or resources not explicitly referenced in a user’s assigned or inherited policies.

Data Warehouse Authentication - Similar to the data lake, every connection to the data warehouse must be authenticated and the data warehouse should integrate with the organization’s identity provider for authenticating users. A data warehouse may provide the following options for programmatic access: ODBC connection, JDBC connection, or REST session. Each will require an access token.

Data Warehouse Authorization - A data warehouse should support User, Group, and Role level access controls for tables, views, and other objects found in the data warehouse. This allows access to individual objects to be configured based on either the user’s id, a group, or a role.

Key Management Server - For security at rest and in transit, the Modern data lake uses a Key Management Server (KMS). A KMS is a service that is responsible for generating, distributing, and managing cryptographic keys used for encryption and decryption.

Frequently Asked Questions

What is the difference between a data lakeData Lake and a Modern data lake?

A data lake stores raw, unstructured data without built-in query optimization or ACID transactions. A Modern data lake combines data lake capabilities with data warehouse functionality on object storage, providing structured analytics, Open Table Format features like time travel and schema evolution, and the ability to query both structured and unstructured data from a unified platform.

What are Open Table Formats and why do they matter?

Open Table Formats (OTFs) are specifications—such as Apache Iceberg, Apache Hudi, and Delta Lake—that enable data warehouse functionality on object storage. They matter because they provide features like ACID transactions, time travel, schema evolution, and zero-copy branching while allowing organizations to use cost-effective, scalable object storage instead of proprietary storage systems.

Can I use my existing data lake with a Modern data lake architecture?

Yes. The Modern data lake's external table functionality allows data warehouse and processing engines to read objects in your existing data lake as SQL tables. This enables you to query, transform, and join data lake content with data warehouse tables without migrating the underlying data.

What hardware is recommended for a Modern data lake?

Hardware requirements depend on your workload. For high-performance analytics, equip your data warehouse object storage with NVMe drives and a high-speed network. Consider separating data lake and data warehouse storage on different hardware if they will experience different workload patterns, with the data warehouse typically requiring higher-end infrastructure.

Do I need a Semantic Layer?

The Semantic Layer is optional. Organizations with few data sources and well-structured feeds with intuitive field names may not need one. However, large organizations with many data sources, legacy schemas, or operations in complex industries like Financial Services, Healthcare, or Legal should consider implementing a Semantic Layer for data discovery and governance.

Key Takeaways

A Modern data lake unifies data warehouse and data lake capabilities on object storage, providing more value than either architecture alone through features like time travel, schema evolution, and external table integration.
Open Table Formats (Apache Iceberg, Apache Hudi, Delta Lake) are the enabling technology that makes it possible to run data warehouse workloads on object storage with features that exceed conventional data warehouse.
The six-layer architecture (Consumption, Semantic, Processing, Storage, Ingestion, Data Sources) provides a comprehensive framework for building end-to-end data platforms that support analytics, BI, and AI/ML workloads.
Disaggregated compute and storage is a key architectural advantage, allowing multiple processing engines to query the same data warehouse storage so teams don't compete for compute resources.
The Ingestion Layer must support multiple protocols and patterns, including streaming for real-time data, batch uploads via S3 API, and scheduled retrieval for external data sources.
Security requires a unified approach across both halves of the Modern data lake with consistent authentication through a shared identity provider while maintaining appropriate authorization policies for each storage type.
The Semantic Layer is optional but valuable for complex organizations, particularly those with many data sources, legacy schemas, or domain-specific terminology that makes data discovery challenging.

Summary

There you have it, the five layers of a Modern data lake from data sources to consumption. This post explored a conceptual reference architecture for Modern data lake. The goal - to provide organizations with a strategic blueprint for building a platform that efficiently manages and extracts value from their vast and diverse data sets. The Modern data lake combines the strengths of traditional data warehouses and flexible data lakes, offering a unified and scalable solution for storing, processing, and analyzing data.

Whether you're exploring AI-native object storage or planning your next deployment, we'd love to help.

Let's start a conversation — or jump right in and try AIStor yourself.

Download AIStor