The On-Premises Data Databricks Couldn't Reach. Until Now.

Graphic of a document with three horizontal blue lines above a large blue paragraph symbol on a light background.

Published:

June 10, 2026

The On-Premises Data Databricks Couldn't Reach. Until Now.

Consider a life sciences company running up to 2.2 million experiments every week. Their automated laboratory systems generate terabytes of microscopy images and structured experimental metadata on a daily basis. Across on-premises storage, they manage more than 20 petabytes of biological research data accumulated over years of drug discovery work. Their data scientists need to run AI and analytics in Databricks. Their data cannot move.

This is not an edge case. It is the default condition for a large and growing share of the data that matters most to enterprise AI programs. Databricks has received requests from hundreds of enterprise customers explicitly asking for on-premises and hybrid storage connectivity to Unity Catalog. The demand is not new. The solution is.

The Data That Was Never an Option

Most Databricks practitioners already know the shape of this problem. But it is worth naming the categories precisely, because each one has a different implication for how the solution needs to work.

Some data cannot move. Regulatory requirements, data sovereignty mandates, and controlled network environments create hard legal stops. No pipeline design resolves this.

Some data is too expensive to move. At petabyte scale, cloud egress fees and storage costs make replication economically irrational even where it is legally permissible.

Some data loses value in transit. Sensor telemetry, transaction events, and operational logs are only valuable when they are fresh. By the time a pipeline replicates and syncs them, the insight window has closed. Latency is not a delay. It is a disqualifier.

Some data gets degraded to move. This is the one that quietly breaks AI programs. Organizations that cannot fully replicate sensitive datasets move a compromised version: sampled subsets, masked records, tokenized identifiers. Sampled data produces models that reflect the sample, not the population. Tokenized records break join keys. The analysis runs. The results are misleading.

All four categories share the same outcome: data that matters, sitting outside the reach of Databricks. Not because no one wanted to use it, but because there was no viable path to get there.

Until now.

What the On-Premises Stack Already Looks Like

Before introducing how that changes, it is worth grounding in the architecture that already exists at most enterprises running this kind of workload. The on-premises data stack is not a gap waiting to be filled. It is a foundation waiting to be extended.

The typical pattern looks something like this. Streaming events arrive continuously via Apache Kafka: sensor readings from a manufacturing line, transaction events from a payments system, telemetry from production applications. A Spark job or lightweight consumer processes those events and writes structured records as Delta or Apache Iceberg™ tables directly into MinIO AIStor. Alongside the streaming path, operational databases like SQL Server or Postgres feed batch workloads into AIStor on scheduled intervals.

AIStor sits at the center of this architecture as the on-premises data platform, holding both unstructured objects and structured open-format tables as first-class citizens. Everything is queryable. Everything is governed. None of it goes anywhere.

This is not a staging area waiting for data to be promoted to the cloud. For many organizations, this is the system of record. The question has never been whether to keep data here. The question has been how to give Databricks a way to reach it without compromising the properties that kept it on-premises in the first place.

Enabling Databricks to Access On-Premises Data

AIStor Table Sharing builds the open OpenSharing protocol directly into the data platform. There is no separate sharing service to deploy or maintain. There is no replication job to schedule or monitor. When a table is shared, it stays exactly where it is.

As a founding member of the new Databricks Software-Defined Storage Ecosystem, it connects on-premises and hybrid storage platforms to the Databricks Data Intelligence Platform via the open OpenSharing protocol. The ecosystem reflects a structural shift in how enterprises think about data: from moving everything to the cloud, to governing it wherever it lives.

The connection is straightforward. A Databricks administrator imports a small JSON credential file containing the AIStor endpoint URL and a scoped bearer token with configurable expiration. They mount the share as a catalog in Unity Catalog. From that point, the on-premises tables appear in the Catalog Explorer alongside native Databricks catalogs. Any Databricks user with permission can run standard SQL against them. The data never leaves the on-premises environment, meaning it is never duplicated to the cloud.

*AIStor Table Sharing opens the protocol boundary. Databricks mounts the shared catalog in Unity Catalog and queries on-premises tables directly. The data never moves.*

Here is what this looks like in a live Databricks environment.

*The AIStor catalog appears in Databricks Unity Catalog under Delta Shares Received, alongside native catalogs. Type: Delta Sharing. State: Active.*

No special connectors are required, and no configuration is needed on the Databricks side beyond importing the credential file.

Two things matter to a technical audience evaluating this architecture.

First, Unity Catalog governance applies in full. Access controls, audit logs, and column-level permissions work exactly as they do for native Databricks catalogs. The external catalog participates in the governance model; it does not sit outside it. Data classification and predictive optimization settings are visible and configurable. The on-premises data becomes a governed data product within the Databricks environment, not an unmanaged external reference.

Second, AIStor Table Sharing works natively with both Delta and Apache Iceberg™ tables, with no format conversion, migration, or forced standardization required. Organizations running mixed-format environments do not need to choose one before they can connect.

As Stephen Orban, SVP of Product Ecosystem and Partnerships at Databricks, put it: "By natively integrating OpenSharing, MinIO enables enterprises to securely connect on-premises data to the Databricks Data Intelligence Platform without complex replication, accelerating time-to-insight for hybrid workloads."

Why Architecture Matters

The Databricks Software-Defined Storage Ecosystem brings several approaches to connecting on-premises data to Databricks. AIStor Table Sharing builds the open OpenSharing protocol natively into the storage layer, providing a live, continuous, zero-copy read path to on-premises tables with no additional tooling to deploy or operationalize.

For the data described in this post, that distinction is meaningful. Data that cannot move needs an architecture that never asks it to. Data that loses value in transit needs access that is live at the moment of the query, not a snapshot from the last scheduled sync. Data that would be degraded by replication needs a path that never touches it in transit at all.

No interval. No lag. No copy.

What Zero Copy and Zero Pipelines Enable

Removing data gravity as a constraint is not an incremental improvement. It changes which workloads are viable.

AI and machine learning against complete datasets. The life sciences company described at the opening of this post manages more than 20 petabytes of research data, with terabytes added daily from automated laboratory systems. Training models against a sampled or replicated subset of that data is not a temporary compromise. It is a permanent ceiling on model quality, built into the architecture by necessity. With AIStor Table Sharing, Serverless workloads can run directly against the dataset as it exists on-premises. The ceiling is gone. For the pattern this represents in production, see our life sciences case study.

Real-time operational analytics without pipeline lag. Streaming data written continuously to AIStor is queryable in Databricks as soon as it lands. There is no pipeline introducing latency, no sync job to wait on, no snapshot that is already stale by the time the query runs. For manufacturing, financial services, and telco environments where decisions are made on current data, this is the difference between analytics that inform action and analytics that describe what already happened.

Regulated and sovereign workloads, finally reachable. The data that could never move now has a path to cloud analytics. Compliance requirements stay intact because the data does not leave. Data residency mandates are satisfied. Sensitive records are never replicated, masked, or tokenized to accommodate a pipeline. The data that was never an option for Databricks is now queryable, on its own terms, without leaving the premises.

The Architecture That Was Always Implied

For years the implicit bargain was: move your data to the cloud, and governance follows. The Databricks Software-Defined Storage Ecosystem reflects a different model: govern data where it lives. AIStor Table Sharing is what makes that possible for on-premises data, connecting your existing data platform directly to Unity Catalog without moving a file.

The architecture that hybrid analytics always implied is now the architecture you can actually build.

For a deeper look at the technical mechanics, see Unlocking On-Premises Data for Databricks: Secure, Zero-Copy Sharing with AIStor Table Sharing. To see it in action, watch the on-demand webinar with Databricks. To run it in your own environment, visit min.io/product/aistor/delta-sharing. And visit booth 527 at Data + AI Summit next week to see on-premises sharing with Databricks live for yourself.

Whether you're exploring AI-native object storage or planning your next deployment, we'd love to help.

Let's start a conversation — or jump right in and try AIStor yourself.

Download AIStor