Clean Rooms + ML: Collaborating on Data Without Sharing It

June 23, 2025

The data collaboration problem has always been tricky. Companies want to build better models together, but sharing sensitive customer data feels risky. What if there was a way to train machine learning models on combined datasets without anyone actually seeing the raw data?

AWS Clean Rooms ML just added Parquet file format support, and this seemingly small update points to something bigger: we’re seeing the early stages of privacy-preserving collaborative analytics becoming mainstream.

Traditional data partnerships require enormous trust. Company A sends their customer data to Company B, hoping it won’t be misused. Company B worries about the same thing in reverse. Legal teams draft complex agreements, data gets anonymized (sometimes poorly), and everyone crosses their fingers.

Clean Rooms flip this model. Instead of sharing data, partners create a secure environment where computations happen on combined datasets, but no one gets direct access to anyone else’s raw information. Think of it as a black box where data goes in, insights come out, but the individual records stay private.

What Changed with Parquet Support

Parquet isn’t just another file format. It’s a columnar storage format designed for analytics workloads, offering better compression and faster query performance than traditional row-based formats like CSV. More importantly for Clean Rooms ML, Parquet handles non-text data types naturally.

This matters because real-world machine learning doesn’t just work with text and numbers. Companies want to train models on images, audio files, sensor data, and other binary formats. Before this update, getting that data into Clean Rooms required workarounds and preprocessing steps that added complexity and potential security gaps.

Now teams can work with their data in its native format. A retail partnership might combine purchase history (structured data) with product images (binary data) to build better recommendation systems, all without either company exposing their proprietary datasets.

Privacy-Preserving ML in Practice

The mechanics work like this: each partner uploads their data to their own Clean Rooms ML workspace. They define what computations are allowed, set privacy controls, and specify what outputs can be shared. The system trains models using techniques like federated learning, where the algorithm learns from distributed datasets without centralizing the data.

Consider a fraud detection scenario. Three banks want to improve their models by learning from each other’s transaction patterns, but obviously can’t share customer financial data. With Clean Rooms ML, they can:

Train a shared model that learns from all three datasets simultaneously. Each bank’s data stays in their control, but the resulting model benefits from the larger, more diverse training set. The banks receive model insights and predictions, not raw transaction data.

The Technical Foundation

Parquet’s columnar structure aligns well with machine learning workflows. Models typically process features (columns) rather than individual records (rows), so columnar storage reduces the amount of data that needs to be read from disk. This becomes critical when training on large datasets across multiple organizations.

The format also supports schema evolution, meaning partners can add new data types or features over time without breaking existing collaborations. A marketing partnership might start by analyzing basic demographic data, then gradually incorporate behavioral signals, purchase history, and engagement metrics as trust builds.

Where This Gets Interesting

We’re seeing similar privacy-preserving patterns emerge across the industry. Apple’s differential privacy, Google’s federated learning, and Microsoft’s confidential computing all tackle the same fundamental challenge: how do you extract value from sensitive data without compromising privacy?

Clean Rooms ML represents the enterprise version of this trend. While consumer-focused privacy tech often sacrifices some accuracy for privacy guarantees, business applications need both strong privacy protections and reliable model performance.

The Parquet update suggests AWS is betting on this approach becoming standard for B2B data collaboration. Supporting more data types and improving performance signals they expect significant adoption in scenarios where traditional data sharing agreements are too risky or legally complex.

What to Watch For

As these tools mature, we’ll likely see new collaboration patterns emerge. Cross-industry partnerships that were previously impossible due to regulatory constraints might become feasible. Healthcare organizations could collaborate with technology companies on diagnostic models without sharing patient records. Financial institutions could work with retailers on fraud detection without exposing customer transactions.

The key indicator will be whether companies start structuring their data strategies around collaborative analytics from the beginning, rather than treating it as an afterthought. Organizations that design their data architecture with Clean Rooms in mind will have advantages in forming strategic partnerships.

For now, AWS Clean Rooms ML with Parquet support gives teams a more practical way to experiment with privacy-preserving collaboration. The technology is becoming less exotic and more operational, which usually means wider adoption is coming.

The future of data collaboration might not involve sharing data at all.