LLMs can answer business questions — but how do you know if their inferences are accurate? Theory Ventures Partner Andy Triedman unpacks what it would take to make AI predictions about human behavior more accurate for marketing and customer research.
Like a good friend, LLMs have opinions if you ask them. “Which of these outfits is better?” “How should I reply to that text?”
It’s no doubt tempting to also ask them valuable business questions: “Is messaging A or B better?” or “Would people choose to buy our new product?” Ask an LLM, and it will tell you. But how do you know that it’s right?
Prior to Theory Ventures, I worked at Replica – a company that uses generative models to simulate human activities. The data is used to prioritize infrastructure investments, approve construction projects, and inform policies, so it is critical that you trust its accuracy. But ensuring generative models match real-world behaviors takes a lot of work (and data).
LLMs have opinions. But whose?
Generative models learn from real-world data and generate new data that looks like it. That doesn’t mean they’re limited to replicating what they’ve already seen: they can use underlying patterns to write new poems or answer new math problems. Similarly, they can form and share opinions, even for new questions, products, or services.
LLMs can provide valuable product feedback, whether from the perspective of an expert UX researcher or a typical user. But for many business questions, an opinion is only valuable to the extent that it matches what real people would think. Here, we run into two key challenges:
Calibration: LLM responses are not representative
Ask an LLM which shampoo ad is better: one that highlights its natural ingredients, or one that shows cool celebrities using it? The model says it likes the former. Of course, we know that different people will prefer one or the other, depending on their preferences, backgrounds, demographics, or other factors.
We could prompt the model with different personas: What would be the response from a nature-loving parent? What about a videogame-obsessed teen? That could make the model respond more convincingly like that type of person. But how do we know that we have the right proportion of the two, or that there aren’t videogame-obsessed parents or nature-loving teens?
Unless we have an accurate estimate of the distribution of preferences, behaviors, and attributes in the total population that we care about, it doesn’t matter if we run 1 million iterations of LLM responses: we won’t know if the answers are representative of the real-world population.
Generalization: LLMs can’t always infer beyond what they were trained on
There is good evidence that LLMs can make inferences about people’s preferences given other information. If you said you love nature, you might be more likely to prefer natural ingredients in your shampoo. If you said you love fashionable clothes, you probably like fashionable shoes.
But there are a lot of questions that no LLM, nor human, could accurately guess if not provided with relevant preferences. If you love nature, are you more likely to prefer a pop-up toaster or toaster oven? If you love fashionable clothes, are you more likely to prefer a PPO or HMO health plan?
If you ask an LLM simulation, surely it may give you answers, but there’s no way to guarantee that they’re actually meaningful.
So how could we get LLMs to provide accurate input on business questions?
One way is to develop a generative AI-powered survey panel. Here’s my hypothesis for how this would look:
Step 1: Build your AI panel
- Assemble a broad human panel — probably at least in the tens or hundreds of thousands.
- Ask this panel a broad set of questions. You might imagine this as the most in-depth survey in history — demographics, behaviors, favorite activities, preferred brands, politics, broad-ranging preference questions (“Are you relaxed or anxious? How do you decide what to eat in the morning? Introvert or extrovert?”). However, given the length of the survey, it would be expensive to conduct. Three ways to solve this:
- Gamify the experience.
- Ask questions over a period of time.
- Use AI to narrow down the most informative questions.
- This would create a mega-panel of AI personas that mirrors population-wide demographic and behavioral attributes.
Step 2: Ask it questions
- Ask the LLM to determine which preference data or inputs are most likely to influence a response, and if there is sufficient information to believe the answer.
- If not, you would need to go back and ask your panel more question(s).
- If there is sufficient information from the survey, ask each AI persona to predict an answer. It will be expected to use the demographic / preference data we collected to extrapolate responses relevant to the target population.
- Validate your answers by asking some number of real-world people the same question to make sure the AI is generally accurate (maybe dozens to hundreds depending on the question)
- You could do this on just a subset of questions, and potentially on a periodic or asynchronous basis.
Should I build it?
Perhaps. The questions you’d need to ask are whether it’s even possible — or economical — to assemble a consumer panel of this scale to develop it, and whether the kinds of questions businesses would ask are consistent enough for the panel to be useful. It may be that most valuable questions are so specific that they fall outside the model’s distribution, making the panel relatively useless.
How can you replicate this process for an industry-specific simulation?
Of course, building a full-scale AI survey panel would be a massive lift — probably the work of a dedicated startup or research lab. Most builders, even the most savvy, won’t be spinning that up tomorrow. But there are smaller-scale, more immediately actionable versions worth considering right now for all execs.
In lieu of trying to replicate the entire population, you could simulate just the personas that matter most to your business. You may consider building an AI panel of your target customers — just a few archetypes representing your own buyers’ demographics, preferences, and behaviors — and leverage this in the absence of regular, full-scale consumer research studies.
While it’s not a substitute for real-world surveys, building this once and using it repeatedly can serve as a lightweight, low-cost complement to stress-test messaging, anticipate objections, or surface day-to-day insights.