Synthetic Data: Advantages & Risks
Share
The Synthetic Data Paradox
Why fabricated customer information may create more problems than it solves
Google Gemini's training corpus contains approximately 30 trillion tokens—the equivalent of 300 million novels. This staggering figure represents both an achievement and a predicament for artificial intelligence companies. As Yann LeCun noted at a conference hosted by Nvidia, a software chip manufacturer, earlier this year, the industry has essentially "read the whole internet." The response to this data drought has been synthetic data: algorithmically generated information that mimics real-world patterns without requiring actual human sources.
The appeal is obvious. Synthetic data sidesteps privacy regulations by avoiding personally identifiable information entirely. Organizations can take a small training set of real data, then expand it exponentially without navigating the legal and logistical complexities of collecting, storing, and protecting customer information. For companies with tightly defined business models and limited scenarios to explore, this represents a genuine efficiency gain. Speed to market accelerates; compliance headaches diminish.
Yet the solution contains a fundamental flaw that becomes apparent at scale. The problem organizations once faced was scarcity: how to acquire enough data to make meaningful decisions. Synthetic data inverts this challenge, creating what Ishmael Interactive CEO Ana Monroe calls an "embarrassment of riches." The new bottleneck is sense-making. When an AI system can generate tens of thousands of fictional customer scenarios, who consumes them? The answer, typically, is another AI system—which must then be interpreted by humans attempting to extract actionable insights.
This recursive loop exposes deeper issues. Synthetic data does not eliminate organizational blind spots; it amplifies them. Like a television series that spawns increasingly derivative spin-offs, synthetic data perpetuates whatever assumptions and biases existed in the original training set. If a company has not identified why something is not working, fabricated data will not reveal it. Instead, the organization accumulates vast quantities of information that validate existing hunches and justify current approaches.
The individualization problem compounds these concerns. AI tools are fundamentally solitary instruments. An employee working with synthetic data operates alone at a desk, querying an opaque system that cannot explain its reasoning when processing millions of data points. This stands in direct opposition to decades of management orthodoxy emphasizing cross-functional collaboration and breaking down silos. Synthetic data does not facilitate organizational learning; it enables individual confirmation bias at industrial scale.
The economics become questionable at a certain point. Training large language models is extraordinarily expensive, and the marginal returns from expanding a corpus from 20 trillion to 30 trillion tokens are unclear—and likely small. If humans cannot consume the output volume and must employ additional AI systems to interpret results, when does hiring imaginative people who can articulate their reasoning become more cost-effective than continuously running computational models?
The answer suggests a balanced approach: deploy synthetic data where it genuinely accelerates hypothesis testing and scenario planning, but maintain investment in traditional human research practices. Organizations require both efficiency and insight. Synthetic data offers the former; conversations with actual customers provide the latter. As we conclude after our analysis on the CX Pod, sometimes the most effective and economical solution remains remarkably simple: go talk to somebody.
What we’re into this week
Scott
I’m probably the farthest one in Ishmael Interactive from using synthetic data, but I see your points. As with so many of today’s inventions, synthetic data seems to have a use, but within far narrower boundaries than a lot of the marketing suggests. This reminds me of this old sketch from That Mitchell and Webb look that helps us question the time and place for our inventions and designs. Ultimately, we need to ensure the things we create solve people’s problems today, in today's environment.
Aaron
Facts. I feel like synthetic data use will both empower better organizational decisions, but only if its use is paired with practice. This actually reminds me of how Ana made the entire team take the course Business Writing and Storytelling from the Economist which was suuuuuuuuper hard. It was weeks and weeks long and nearly killed everyone who took it, which was everyone on the team. At the time, I liked it because I usually like trainings, but I also thought it was kind of silly to “learn” how to write again. Then I noticed how my writing had improved once I was done with it—turns out, focusing on fundamentals really works out! And I feel like organizations are going to have to focus on keeping people “in practice” with customer research if they want to continue to get the most benefit from synthetic data integration.
Ana
Yeah that course was killer, Aaron. Glad we did it; I might do it again!
Aaron
I would do it again! Is Ishmael Interactive paying?!
Ana
Lol not at this time :) Soon though! For me, the use of synthetic data and related powerful tools just reinforces the points Harvard Business Review (HBR) recently made regarding Hands-on Leaders. It’s sort of this trope that a great organizational leader is either a visionary who leads from afar or rolls up their sleeves and grinds day-to-day, but I think the best path is somewhere in the middle of those extremes. With so much at stake in the integration of new tools into organizations, leaders will be well-advised to both have to learn the tools in a somewhat applied manner while still knowing when to get out of the way of their teams.
CX Research—Documented
Synthetic data helps organizations game out more situations than can be researched human-to-human, but it has to be constantly injected with new data in order to continue producing great outputs. To get that data, start talking to your customers using Ishmael Interactive’s HCD Discovery Guide, the step-by-step manual for customer research that’s rigorous, replicable, and sustainable.
The Discovery Guide was developed alongside thousands of professionals working in healthcare, veteran support, education, and operations. When you open this Guide, you'll see:
-
The Why and How of customer research.
-
Step-by-step, modular instruction that you can dip in and out of easily.
-
Plain language for working professionals.
Buy the book at Barnes and Noble, Amazon, or at Ishmael Interactive. (Psst: when you buy at Ishmael Interactive, you get the ebook version included!)
You don't need a full UX or CX team to get to know your customers: you need the HCD Discovery Guide.
Credits
Hosts: Ana Monroe, Aaron Meyers
Producer: Ana Monroe
Text: John Jay
Artwork: Basket of Flowers c. 1622 Balthasar van der Ast. Via the National Gallery
Why this artwork: The artwork of the Northern Renaissance focused on detail in rendering paired with stark contrasts in light and shadow. The resulting works, this one by Blathasar van der Ast being one of the great examples, is of an almost photorealistic effect that fools the viewer into believing that the lushness of the flower petals are there to touch. But this is, of course, a synthetic, human-made rendering. It is beautiful, and is thusly effective as the decorative object it was meant to be, but its limit is that it is not alive and never has been. Just like the outputs of synthetic data.