As more companies invest in generative AI (gen AI) for bespoke use cases and products, proprietary data is becoming increasingly important to training large language models (LLMs). Unlike ChatGPT, which was trained on billions of public data points — emails, scripts, social media, papers — scraped from the internet, enterprise gen AI often needs to be specified to that business’ customer data.
However, data from real customers can contain personally identifiable information (PII), making it a privacy risk to use. That’s where structured synthetic data company Mostly AI comes in.
Also: Google’s AI podcast tool transforms your text into stunningly lifelike audio – for free
On Tuesday, the company launched a synthetic text functionality that automates the process of generating synthetic data and preserves the patterns of the user’s original dataset.
By using synthetic data to train models, Mostly AI aims to help businesses avoid risking privacy without sacrificing the insights customer data like emails, support transcripts, and chatbot exchanges can reveal. According to the company, synthetic data can also represent more diversity than original data.
Beyond privacy, other use cases include rebalancing a dataset to tailor it to a model or remove bias and generating mock data for software testing.
How it works
Companies upload their proprietary dataset to Mostly AI generators, which are privacy-protected reusable bundles that include metadata from the original data. Users can upload data from their local device or another external source and fine-tune their generator on Mostly AI’s platform.
Once they have confirmed the correct configuration and encoding types, users select from the Mostly AI models they’d like to use, then choose from several language models, including pre-trained options from HuggingFace.
Also: Every new Microsoft Copilot feature and AI upgrade coming soon to your Windows PC
What emerges is a privacy-protected, synthesized version of the data that preserves its original statistical patterns.
This setup helps train an enterprise’s generator. Users can then compare synthetic and actual data using the model’s reports to ensure accuracy.
Mostly AI says its datasets look “just as real as a company’s original customer data with just as many details, but without the original personal data points – helping companies comply with privacy protection regulations such as GDPR and CCPA.” The company added that its synthetic text “delivers performance improvement as much as 35% compared to text generated by prompting GPT-4o-mini providing either no or just a few real-world examples.”
Also: The best AI chatbots of 2024: ChatGPT, Copilot, and worthy alternatives
So, is synthetic data really the future of AI?
A Gartner report from April found that synthetic data has unrealized potential in software engineering but recommends that it must be deployed carefully. Creating synthetic data can be resource-intensive, as using it effectively requires specific testing stages for each use case.
“Today, AI training is hitting a plateau as models exhaust public data sources and yield diminishing returns,” Mostly AI CEO Tobias Hann said in the release. “To harness high-quality, proprietary data, which offers far greater value and potential than the residual public data currently being used, global enterprises must take the leap and leverage both structured and unstructured synthetic data to safely train and deploy forthcoming generative AI solutions.”
A common concern is that the AI bubble is about to burst in part because models are running out of publicly available data to ingest. While that’s technically not true — any human activity can be data, it may simply not be codified, collected, structured, and free — the need for more usable data to train models is real. After all, it’s much easier (and cheaper) to get really good at generating synthetic data than it is to digitize messy pages of handwritten notes. Even Meta used both human and synthetic data to train Llama 3.1 405B.
Also: The journey to fully autonomous AI agents and the venture capitalists funding them
But what about model collapse — the idea that models deteriorate once they’ve ingested too much synthetic data?
Mostly AI said in an email to ZDNET that it avoids this possibility because “the synthetic data is generated once and directly applied to downstream tasks,” rather than used to repeatedly train the models.
It’s still to be seen whether increased use of synthetic data across industries creates a bigger-picture threat of model collapse. Until then, enterprises interested in Mostly AI’s tool can visit its website.
Artificial Intelligence