User Research: Your Secret Weapon for Building AI Eval Sets That Actually Matter
Everyone's talking about evals, but how do you build them to reflect real-world use? Here's how to leverage user research to create your initial eval set and tee your AI product up for success.
Evaluations and Benchmarks - they're all the rage in the world of AI. Every day, there's a new headline touting a model that outperforms another on some benchmark. “DeepSeek-R1 matches the performance of o1, OpenAI’s frontier reasoning LLM, across math, coding and reasoning tasks” [source] is just one recent example. These benchmarks, and the evals that power them, are important for tracking progress and pushing the boundaries of what's possible. But here's the thing: when it comes to building your own AI-first products, generic model evals and benchmarks only tell part of the story.
In fact, I recently got a question on X from someone asking about how to approach evals. It's a topic a lot of people are curious about, and for good reason. Evals can feel a bit mysterious, especially when you're just starting out. But they are absolutely essential. Logan Kilpatrick recently tweeted "At the end of the day, evals are all you need." He's right, and this isn’t a new concept. Greg Brockman of OpenAI tweeted over a year ago, "evals are surprisingly often all you need."
But here's the key difference: I'm not just talking about the standard model benchmarks used to compare performance in a vacuum. I'm talking about your evals. The ones you create to measure the quality of your specific product experience. Because let's face it, just because you're using a state-of-the-art model doesn't guarantee a state-of-the-art product. The magic is in the system you build around it, and that system starts with understanding what you want your AI to do, and then evaluating it against that expectation.
The Missing Link: Building "Your" Evals for "Your" AI System
In my last post Beyond CUJs: Why "Example Prompts" Are the New North Star for AI-First Products, I talked about the shift towards "prompt-first" product development. This naturally leads to the question: how do you know if your prompts, and the AI system they power, are actually working? That's where your own, customized evals come in.
/* First - a quick primer on what evals are:
In the world of generative AI, we use "evals" (short for evaluations) to measure the performance and quality of our AI systems. Think of evals as a set of carefully selected inputs, the resulting outputs, and a rating or score, that is used to measure how well your AI product works. The key here is:
The inputs represent expected user inputs in your product
The outputs are what come out of your product, which encompasses your entire “AI System”
The rating or score is based on a set of evaluation criteria that you define based on what you think matters for your product.
So let’s say you’ve built a chat app that pretends to be a funny dog. Anything a user asks it, it should always respond with a joke. A super simple eval set might look like this:
Clearly my AI System has some room to improve */
Ok, now that you know what evals are, how do you actually begin to tackle all this?
There are many ways to build the initial Input set. You can:
Brainstorm examples yourself.
Use AI to generate variations of your initial user prompts.
Create synthetic datasets.
Analyze user logs from a functioning prototype.
And those are all valid approaches that we'll dive into in future posts. But today, I want to focus on a particularly powerful, and often overlooked, method: leveraging user research to build your initial input examples.
User Research: Your Eval Set's Secret Weapon
Think about it. User research is all about understanding how real people interact with your product. What are their goals? What are their pain points? What are their expectations? Now, with AI-first products, we can tap into that same well of knowledge to create evals that are grounded in real-world usage.
Here's how it works:
Build your AI System: Following the prompt-first product development approach, validate that your idea has a chance of working by testing out your prompts (I love using AI Studio to help with this).
Craft a Simple Pitch or Mock-Up: Once you have some conviction that your idea is feasible, mock something up that you can show potential users. This doesn't have to be fancy. A basic wireframe, a storyboard, or even just a verbal description of the intended user experience will do. The goal is to give users something concrete to react to.
Conduct User Interviews: This is where the magic happens. Show users your pitch or mocks and ask them targeted questions. These questions should aim to elicit things like:
Potential Inputs: What information would users provide in each field or at each step of the interaction?
Expected Outputs: What kind of results would they expect from the AI based on their inputs?
Expectations: What would make users enjoy the experience?
Compile and Analyze: Once you have done your research, take all of the different inputs and scenarios and start to build out your initial eval set that you can then run against your AI System.
The Power of Real-World Input
By conducting these early user research sessions, you're not just validating your product idea; you're actively crowdsourcing your initial eval set. You're gathering a collection of real-world user inputs and expected outputs that reflect how people actually intend to use your product, and how you can measure “what is good”.
This approach has several key benefits:
Grounds your evals in reality: You're not just guessing what users might do; you're observing actual behavior and expectations.
Uncovers edge cases early: User research will inevitably surface edge cases and scenarios you hadn't considered, making your eval set more robust.
Informs your prompt design: The specific language and details users provide in their responses can directly inform how you craft your prompts and build your AI System.
Don't Just Build, Validate: It All Starts With Your Users
Building successful AI-first products requires a new way of thinking about product development. It's about embracing a "prompt-first, eval-obsessed" mindset, where your initial user prompts and your eval set are inextricably linked. And by leveraging the power of early user research, you can create evals that take you beyond generic benchmarks towards reflections of real-world user needs. It's a way to ensure that your AI system isn't just technically impressive by leveraging the latest models, but truly valuable to the people who will ultimately interact with it - your users. So, before you get too far down the product development road, go talk to your users. You might be surprised at what you discover, and your eval set will be all the better for it.