PLAYBOOK$7.5M · 24 Apr 2025 · 4 MIN

The Pick-and-Shovel Play for the AI Training Data Gold Rush

When model architectures converge, the winner is whoever trains on the best dataset – making data infrastructure the most reliable bet in the stack.

XETDATAxetdata.com ↗

KEY METRIC$7.5M

LATEST ROUND$7.5M 09.01.2023

TOTAL RAISED$7.5M · 1 rounds

OPPORTUNITY SNAPSHOTbuild adjacent

ENTRY ANGLES

Data platforms with built-in integrity checking that don't constrain experimentation · Multi-tier interfaces serving both non-technical users and data engineers · Data management solutions for AI training data workflows

VERTICALS

AI/Machine LearningData science and engineering

CAPABILITIES

Data integrity and validation systems, User experience design for multiple skill levels, Data platform infrastructure and tooling

XETDATA FOUNDER

“Across the entire organization”

The Pick-and-Shovel Play for the AI Training Data Gold Rush

01 /The Concept

GitHub and GitLab became indispensable for developers by solving the same problem from multiple angles – version control, parallel branching, and collaborative code review – until working with code productively became an assumed baseline, not a differentiator.

But the AI boom has introduced an interesting twist. Different projects might use identical model architectures and identical training code – yet the winner will be whoever trains on the "right" dataset: large enough, representative enough, clean enough.

In other words, data is becoming at least as valuable as code. And managing it deserves the same rigor.

XetData built XetHub to do exactly that – apply the same principles GitHub/GitLab use for code to datasets, as a direct extension of standard git workflows.

All the familiar git primitives are there: version history for datasets, edit attribution, branching, collaboration. But the key play is performance: XetHub lets you mount a 50 GB remote dataset to your local filesystem in seconds.

A second major feature is built-in data visualization – tooling that lets you eyeball data quality and roughly assess which processing approaches are worth pursuing.

One particularly important capability: XetHub supports synchronized versioning of both code and data. That means you can safely modify code against modified datasets without worrying that the wrong version of one ends up paired with the wrong version of the other in production.

Up to 20 GB of data can be stored free – though fine-grained access controls (typically needed inside organizations) are locked to paid tiers. Enterprise storage caps out at 10 TB for now, with a 100 TB increase promised soon.

XetHub just entered open beta and has already closed a $7.5M seed round at this early stage.

02 /Why It Matters

AI is advancing fast enough and delivering enough measurable quality improvements across traditional industries that every software project will eventually be an AI project.

For those projects, data becomes the new gold. That gold needs to be carefully protected – but also easy to access and use across the entire organization.

Data storage platforms are a breakout trend, and XetHub is one of the first movers. It won't be the last.

"Across the entire organization" is the key phrase. Data won't just be the province of engineers – marketers, product managers, and a long list of others will need to work with it too. Every business function will eventually need to be data-driven.

XetData gets this. Their stated target audiences include not just data engineers and developers, but managers who need access to data too.

The current platform, though, sells itself on the git-native interface – which is already familiar to technical users but foreign to most managers who aren't engineers.

That's where significant untapped potential lives. How do you build a data platform that's both technically rigorous for engineers and genuinely approachable for non-technical users? These sound like contradictory requirements. But they're not. Think of macOS – built on the geeky BSD underpinnings that power users can still access via Terminal, but wrapped in an interface that millions of non-technical people find intuitive and beautiful.

Something like macOS for data will eventually have to exist.

03 /Opportunities

The general direction: powerful, fast, and genuinely usable data platforms.

Platforms that handle data integrity without sacrificing the ability to experiment freely. Platforms that expose multiple interface tiers – simple enough for non-technical users, deep enough for data engineers.

Today's XetHub is one early step in that direction – a useful reference point for finding your own path.

The demand is already forming, and it will only intensify. That much seems clear.

RELATED DRILLS · 6

PROVENSandstone$40M

Contracts Worth Millions Live in Spreadsheets. Sandstone Says No.

AI/Agents · build adjacent

RISINGKampala$500K

Seven Years of Reverse-Engineering Sneaker Bots, Turned Into a Real Business

DevTools · build tooling

RISINGProbably$9M

The AI Labs Are Incentivized to Keep Models Unreliable. This Startup Isn’t.

AI/Infrastructure · adapt model