As real-world data dries up or becomes too messy to use, synthetic data steps in for fields ranging from agentic AI to chemistry labs
When Google unveiled its latest world model, Genie 3, it felt like the dawn of a new way to train physical AI.
Instead of putting together messy, hard-to-label real-world footage, the AI builds its own clean, playable worlds from scratch. It’s like giving robots a custom sandbox where they can learn the rules of physics before stepping into reality. You could, in theory, spin up any environment you want and teach a robot how to move, react, and adapt inside it.
Since that launch, world models have become the talk of the AI town. But Genie 3 is just one piece of a bigger shift that’s quietly reshaping how data itself is made and utilized.
New Kind of Data Shift
From startups working in niche domains like healthcare and material science to giants like Databricks, more teams are betting on synthetic data — building the data they need when it doesn’t exist yet.
“Synthetic data is one of the most effective ways to increase model performance and reliability, provided that the data itself is high-fidelity and grounded in real-world distributions,” Craig Wiley, Senior Director of Product at Databricks, told Future Nexus.
Organizations today struggle to generate quality evaluation datasets, which are critical to assess whether an AI system is actually truthful and improving. Databricks’ Agent Bricks offering solves for that by helping users generate domain-specific synthetic data and benchmarks to evaluate and refine agent performance continuously, striking the right balance between quality and cost.
“We use domain-specific evaluation frameworks that pair synthetic data generation with ‘LLM judges,’ and task-aware metrics to measure accuracy, bias, and realism against real datasets. In practice, that means automatically scoring synthetic samples and model outputs for alignment with enterprise-specific expectations,” Wiley explained, noting that the approach helps teams make their AI agents more useful, measurable, and production-ready.
Beyond evaluation in the AI pipeline and training, synthetic data can also pave the way for teams to create datasets that mimic real-world data, without personal information. This can help teams collaborate, share insightful data with partners, experiment, and model edge cases or future what-if situations, like testing systems for rare fraud scenarios or regulatory anomalies.
Mostly AI and Tonic AI are two notable players operating in this domain, alongside Gretel, which was acquired by Nvidia in a $320M deal earlier this year.
Synthetic data for the hard problems
With its privacy-safe and expansive nature, synthetic data is also being used in niche domains to solve very specific problems, where the real-world data isn’t available or is too hard or time-consuming to access.
For instance, Oakland-based Albert Invent, which helps with R&D in chemistry/materials science and works with companies like Henkel, Chemours, and Nouryon, has developed Albert Breakthrough, an AI engine that simulates hundreds of thousands of potential experiments, identifying the most promising candidates for teams to pursue.
“Synthetic data is combined with empirical data and expert domain knowledge to synthesize potential experiments and guide models toward achieving specific targets. For example, chemists may already have an idea of what a valid formulation could look like, but the possible choices from that space are still enormous and varied. We generate possible experiments, predict the likelihood they will perform as expected, and it is up to the chemist to decide which ones to test in real life,” Jonathan Welch, VP, AI/ML at Albert Invent, told Future Nexus.
With this approach, Albert Invent claims to enable rapid, iterative formulation optimization, shortening R&D cycles for customers. One of its customers, Applied Molecules, was able to use the Breakthrough engine to cut down development time from 3 months to a mere 2 days.
Similarly, consumer research player Vurvey Labs has developed a Large Persona Model (LPM) that generates global AI populations using identifiable personality traits, demographics, psychographics, and other facets captured from 3M+ real interviews. These agents, with customizable trait settings and with the capacity to “reflect in real time,” enable brands like Unilever to capture insights on important questions instantly – instead of waiting weeks for survey data.
“By using People Model populations, brands achieve shorter time-to-insights with significantly reduced testing costs. We also see reduced bias in predicting real-world responses by providing an easily accessible perspective that differs from a researcher’s internal perspective…The most significant gains are seen in rethinking the insight and testing processes. By rethinking processes for generating and testing insights and concepts rapidly, 24/7, brands can reduce the opportunity costs often incurred by prematurely cutting or focusing on certain concepts or areas of interest,” Ben Vaughan, head of AI research at Vurvey, told Future Nexus.
The quality problem
Despite the growing potential and use cases, synthetic data also poses certain challenges, with the biggest one being quality assurance. To succeed with artificial data at scale, you need to ensure that every piece of information that is generated is on point. A slight deviation here and there, and the whole basis of work goes off track, whether you’re training an AI model, evaluating it, or testing an application’s security framework.
“Some of the biggest challenges are in formulary chemistry, where there is a lack of quality and reliable data. Synthetic data cannot fully solve this. Unlike traditional sim-to-real approaches, simulating formulations from first principles, even something as simple as a hair gel, is generally computationally expensive or infeasible. To solve this, our approach is to help businesses harness their empirical organizational data at scale, integrate it into machine learning models, which help them to identify the highest value experiments to conduct so that they focus their investment on collecting the most informative data points that will accelerate their material development,” Welch from Albert Invent explained.
Vurvey’s Vaughan, meanwhile, pointed out that keeping the system that generates synthetic data (populations, in this case) updated is very difficult.
“People’s opinions and views are often fluid and changing over time, especially among younger adults. We cannot just collect some data and say they are set. We have to continually collect data to keep our finger on the pulse of various groups and detect when their views shift. Since how quickly each group evolves is unique to them, there are technical challenges in allocating the right resources to the right places,” he said.
The research head also highlighted that bias can seep in when working with demographic data.
“On the ethical side, another significant concern is ensuring we are reflecting authentic voices for different groups of people. We don’t want internet bias against Gen Z, for example, to leak into responses attributed to them. This goes back to our approach to mitigating negative bias and ensuring we stay up to date on what is happening in these groups,” he said.
A hybrid future
With AI giants gobbling up all the publicly available data and entering into deals with major data sources, and with growing privacy concerns worldwide, synthetic data is bound to rise.
However, Wiley said, it won’t be an all-synthetic future but a hybrid one, with synthetic data playing a critical role across applications.
“Synthetic data is already a practical tool for supplementing incomplete or biased internal datasets, and that role will grow as enterprises focus on building high-quality data foundations and AI agents. We’re not moving toward ‘all synthetic’ systems; we’re moving toward hybrid ones. The most capable models will be trained on governed, high-fidelity enterprise data, then expanded with synthetic datasets that explore edge cases, rare events, or safety-sensitive scenarios that are hard to capture in the wild,” he added.
Building synthetic data/predictions on top of organic real-world data, much like what Vurvey and Albert Invent already do, will help teams double down on what they already trust, ensuring the downstream system works accurately and achieves the desired results.
“Over time, the organizations that get this balance right by treating real data as the ground truth and synthetic data as an amplifier will see the greatest gains in quality, fairness, and robustness,” Wiley concluded.