TollBit lets publishers set prices for AI crawler access – turning content scraping into a revenue stream before the window closes.
ENTRY ANGLES
Licensing platform for AI training data that serves both content owners and AI developers · Focus on deep web content aggregation and licensing · First-mover strategy targeting premium content partnerships
VERTICALS
CAPABILITIES
Content licensing and rights management infrastructure, Deep web content discovery and indexing, Two-sided marketplace platform development
TOLLBIT FOUNDER
“But we can help you get paid for it,”
"Bots are already here. They're scraping your content," TollBit warns publishers. The bots in question are crawlers deployed by AI developers to pull training data from websites across the internet.
"But we can help you get paid for it," the startup promises site owners.
The way it works: site owners set prices on the TollBit platform for bot access to their content. Different content types can carry different rates – recent news or exclusive content might be priced higher than archival material. Rates can also vary by AI developer, depending on how well-funded the site owner thinks they are.
The platform supports flexible, dynamic pricing. The obvious model is volume-based – price per request scales with request volume from a given bot. Rates can also be keyed to specific search terms used in those requests.
On the technical side, TollBit blocks any bot it can identify – by user-agent string or by the IP ranges of the servers sending requests. Access is only granted to bots presenting a valid token. That token is issued by the platform once the AI developer has signed a licensing agreement with the site owner.
Authorized requests are tracked and tallied, and at the end of each billing period the platform automatically generates and sends invoices to the AI developers. It also enforces rate limits and ensures total request volumes stay within agreed caps.
Connecting the platform reportedly takes 15 minutes. After that, it runs on autopilot – automatically identifying bots attempting to scrape, sending them a notification with a ready-to-sign licensing agreement and pricing schedule.
TollBit was founded last year, has already signed its first publisher customers, and has now raised $7M.
Trouble for AI developers started in December of last year, when the New York Times sued OpenAI and Microsoft for using "millions of Times articles" to train their models without authorization.
By January, it emerged that OpenAI was in active negotiations with CNN, Fox, Time, and a dozen other publishers over content licensing deals.
In February, Google struck a deal with Reddit to license the platform's user-generated content for AI training – reportedly worth around $60M.
Around the same time, Reddit filed for its IPO. Its prospectus revealed that total data licensing agreements on the books amounted to $203M, with $66.4M expected to convert into actual revenue by the end of 2024.
AI models are only as good as the data they're trained on. And as models get more sophisticated, their appetite for data grows. Data collected by Reuters shows that AI training data requirements have quadrupled since 2022.
TollBit is targeting a large and fast-moving market that AI development essentially created from nothing.
One important caveat: the platform's current functionality covers only public web pages – the ones any bot can crawl by default.
The real play here is that publicly accessible pages represent just 4% of all data that actually exists on the internet. The other 96% is out of reach for standard crawlers.
About 6% of all internet data lives on the dark web – deliberately hidden and encrypted, including private communications and various categories of illegal content.
But 90% – the vast majority – lives in what's called the deep web: perfectly legal content that's simply inaccessible to crawlers. Password-protected pages behind paywalls. Databases where information only surfaces in response to specific search queries – which would take a bot an enormous amount of time to enumerate systematically.
Owners of subscription platforms and databases might be quite willing to license that data for additional revenue. But a platform built for that use case needs to handle permissioned access and metered billing for structured, gated content – which is technically far more complex than blocking or allowing crawlers on public pages.
That means roughly 90% of the technical work for a comprehensive content licensing platform still hasn't been done.
Content licensing for AI training is a timely and rapidly growing space, driven by the sheer pace of AI development.
The direction for builders: a licensing platform that's easy for both content owners and AI developers to use – covering not just the public web but the deep web, where the most valuable and unique content lives.
The deep web segment is the real prize. It's where content is densest, most curated, and impossible to reach any other way.
But if this direction is appealing, speed matters enormously. These markets consolidate quickly. Publishers will integrate with one platform and stick with it once they've chosen. The competitive dynamic heavily rewards the first to sign meaningful content partners – because AI developers will follow the aggregators who have already assembled the best content libraries.
The key growth move: get to the best content owners first. After that, the AI side takes care of itself.