What Happens When Databricks Can Query Your On-Premises Data Directly

Use cases for manufacturing, financial services, and regulated industries

Most enterprises already know why their critical data stays on-premises. The reasons are well documented: some data is too time-sensitive to survive the latency of replication. Some data is too voluminous to replicate economically. Some data is bound by regulation or policy to stay exactly where it is.

These aren't new problems. What's new is that they no longer have to mean the same thing they used to.

Until recently, data that stayed on-premises was data that Databricks couldn't reach. If your analytics and AI workloads ran in Databricks, and your most valuable data lived on-prem, you had two options: build and maintain a replication pipeline to copy data into the cloud, or accept that certain datasets simply wouldn't participate in your cloud analytics.

Both options carry real costs. But a third option now exists: Databricks querying on-premises data directly, with no copies and no pipelines, through the open Delta Sharing protocol embedded natively in MinIO AIStor.

The architecture has been covered in depth elsewhere. Denis Dubeau of Databricks and I wrote about the rationale for embedding Delta Sharing at the storage layer. I wrote about the implementation in AIStor Table Sharing in a companion post. This post asks a different question: what actually changes for the teams in manufacturing, financial services, and regulated industries that depend on that data?

Three Teams, Before and After

The impact shows up differently depending on why the data stayed on-premises in the first place. Here are three scenarios that reflect the three most common constraints, and what shifts when those constraints stop being blockers to cloud analytics.

Manufacturing Quality Analytics: When Hours Matter

A semiconductor manufacturer runs high-volume production lines that generate continuous sensor and telemetry data. That data feeds quality models that detect defect patterns, predict equipment drift, and flag yield anomalies.

Before zero-copy sharing, the plant engineering team exported production data nightly to a cloud staging area. The data science team ran their Databricks models the following morning against data that was already 12 to 24 hours stale. If a model flagged an anomaly, the shift that produced the defect pattern had already ended. Root cause analysis started a day late. Corrective action was always retroactive.

With AIStor Table Sharing, the data science team queries live production data directly from their Databricks notebooks. The model sees current conditions, not yesterday's export. A quality issue flagged during the morning shift can be investigated during the same shift. The feedback loop between production and analytics tightens from days to hours.

The data didn't change. The models didn't change. What changed is that the models can now see what's happening now, not what happened yesterday.

Financial Services Risk Modeling: When Scale Blocks Scope

A financial services firm maintains hundreds of terabytes of transaction history on-premises. That data is the foundation for risk models, fraud detection, and regulatory reporting. It stays on-prem because the economics of replicating it to the cloud don't work at that scale.

Before zero-copy sharing, the risk team maintained a replication pipeline that copied a subset of transaction data to Databricks. The subset was always a compromise. Too much data and the pipeline costs were prohibitive. Too little and the models missed edge cases. Every expansion of scope meant expanding the pipeline: more storage, more engineering time, more breakage surface. The risk team's analytical ambition was capped by the capacity of their sync infrastructure.

With AIStor Table Sharing, the full transaction history is queryable from Databricks without replication. Risk models run against complete datasets instead of sampled subsets. There is no pipeline to maintain and no duplicated storage to pay for. Expanding the scope of analysis means writing a broader query, not rebuilding infrastructure.

The constraint wasn't the analytics platform. The constraint was the cost and complexity of getting data to the analytics platform. Remove that, and the scope of what risk teams can analyze changes fundamentally.

Regulated Industries: When Data Cannot Leave

A pharmaceutical company's clinical trial data, a defense contractor's classified telemetry, a government agency's citizen records. These datasets share a common trait: regulatory or policy requirements prohibit them from leaving the on-premises environment.

Before zero-copy sharing, these organizations faced a stark choice. They could run analytics on a separate, often limited on-prem platform. Or they could forgo the analysis entirely. Databricks was available to the organization for other workloads, but it couldn't see the data that mattered most. The most sensitive data was often the most analytically valuable, and the hardest to reach.

With AIStor Table Sharing, Databricks queries the data in place. The data never crosses the perimeter. The compliance officer's requirement is satisfied by architecture, not by a policy exception that introduces risk. And the same governance and audit capabilities that apply to cloud-native Databricks catalogs, including access controls, audit logging, and column-level permissions, apply equally to the shared on-prem catalog.

The data still doesn't leave. But it's no longer invisible to the teams that need it.

This Is Incremental, Not Substitution

Each of these scenarios has something in common: they don't describe replacing an existing workflow. They describe unlocking a new one.

The data science team at the manufacturer wasn't running quality models on live data before; they were running them on stale exports. The risk team at the financial services firm wasn't analyzing the full transaction history; they were analyzing whatever the pipeline could carry. The regulated organization wasn't running Databricks on its controlled data at all.

In every case, the workloads that AIStor Table Sharing enables are workloads that weren't running before. For organizations, that means new analytical capabilities without new infrastructure. For data teams, it means running Databricks against datasets that were previously out of reach.

This is already happening at enterprise scale. A Fortune 500 semiconductor manufacturer is running Databricks AI and analytics workloads across hundreds of terabytes of on-premises production data spanning multiple manufacturing facilities, all through AIStor Table Sharing. No replication. No pipelines. Live data, queried in place.

The pattern is straightforward: every on-premises dataset that becomes accessible to Databricks represents analytical work that wasn't happening before. That's not a migration story. It's an expansion story.

Security and Governance

Governance is enforced at both ends.

On the AIStor side, administrators control which tables are shared, scope authentication with bearer tokens or OAuth2, set token expiration policies, and manage access through AIStor's own IAM policies and access logs. The data never leaves the on-premises environment, and access is read-only by design.

On the Databricks side, Unity Catalog applies its own governance layer to shared catalogs the same way it does to native catalogs: role-based access controls, audit logging, and column-level permissions. The result is two complementary security boundaries, one at the storage layer and one at the compute layer, with no gap between them. For the full treatment of how AIStor Table Sharing handles identity, encryption, and lifecycle management, see the implementation post.

Go Deeper

This post is the third piece of a three-part picture. Denis Dubeau's Medium post explains the architectural rationale for embedding Delta Sharing at the storage layer. The MinIO blog post explains how AIStor Table Sharing implements it. This post explains what it means in practice for the people doing the work.

To see it live, join Denis Dubeau of Databricks, Dwight Evers, and me for our webinar, From On-Prem to Insight: Secure, Zero-Copy Analytics with Databricks and MinIO AIStor. We'll demo Databricks querying on-prem data in AIStor with no data movement and dig into what this architecture means for hybrid data strategies.

The infrastructure exists today. The question worth asking is what your teams could do with it.

Whether you're exploring AI-native object storage or planning your next deployment, we'd love to help.
Let's start a conversation or jump right in and try AIStor yourself.

Related Posts