Modern Data Lake with MinIO : Part 2

October 11, 2018

In the first part of this series, we saw why object storage systems like Minio are the perfect approach to build modern data lakes that are agile, cost-effective, and massively scalable.

In this post we’ll learn more about object storage, specifically Minio and then see how to connect Minio with tools like Apache Spark and Presto for analytics workloads.

Architecture of an object storage system

One of the design principles of object storage is to abstract some of the lower layers of storage away from the administrators and applications. Thus, data is exposed and managed as objects instead of files or blocks. Objects contain additional descriptive properties which can be used for better indexing or management. Administrators do not have to perform lower-level storage functions like constructing and managing logical volumes to utilize disk capacity or setting RAID levels to deal with disk failure.

Object storage also allows the addressing and identification of individual objects by more than just file name and file path. Object storage adds a unique identifier within a bucket, or across the entire system, to support much larger namespaces and eliminate name collisions.

Inclusion of rich custom metadata within the object

Object storage explicitly separates file metadata from data to support additional capabilities. As opposed to fixed metadata in file systems (filename, creation date, type, etc.), object storage provides for full function, custom, object-level metadata in order to:

Capture application-specific or user-specific information for better indexing purposes
Support data-management policies (e.g. a policy to drive object movement from one storage tier to another)
Centralized management of storage across many individual nodes and clusters
Optimize metadata storage (e.g. encapsulated, database or key value storage) and caching/indexing (when authoritative metadata is encapsulated with the metadata inside the object) independently from the data storage (e.g. unstructured binary storage)