Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Lavrans Løvlie
Ask me anything – Authors of Service Design: From Insight to Implementation
2025 • Advancing Service Design 2025
Gold
Joshua Graves
We Need To Talk: Addressing Unmet Expectations (Part 2 of 3)
2025 • Rosenfeld Community
Zariah Cameron
ReDesigning Wellbeing for Equitable Care in the Workplace
2024 • DesignOps Summit 2024
Gold
Devon Powers
Imagining Better Futures
2022 • Advancing Research 2022
Gold
Shazia Ali
Communication: Innovative techniques for making your voice heard [Advancing Research Community Workshop Series]
2024 • Advancing Research Community
Julie Gitlin
Design as an Agent of Digital Transformation at JPMC
2021 • Design at Scale 2021
Gold
Roberta Dombrowski
5 Reasons to Bring your Recruiting in House
2021 • DesignOps Summit 2021
Gold
Brigette Metzler
Scaling ResearchOps: Helping Researchers do Their Best Work
2020 • Advancing Research 2020
Gold
Sara Logel
Your Colleagues are Your Users Too
2023 • Advancing Research 2023
Gold
Steve Portigal
War Stories LIVE! Q&A-Discussion
2020 • Advancing Research 2020
Gold
Boon Yew Chew
Making Sense of Systems—and Using Systems to Make Sense of the Enterprise
2023 • Enterprise UX 2023
Gold
Jennifer Bolduc
What's involved with getting people back to work?: A panel discussion
2021 • DesignOps Community
Sarah Fathallah
A Typology of Participation in Participatory Research
2023 • Advancing Research 2023
Gold
Nicole Aleong
What UX research can learn from other research practices [Advancing Research Community Workshop Series]
2023 • Advancing Research Community
Rachel Posman
"Ask Me Anything" with Rachel Posman and John Calhoun, Authors of the Upcoming Rosenfeld Book, The Design Conductors
2024 • DesignOps Summit 2024
Gold
Matt Webb
Context Window: Five Futures for AI
2025 • Designing with AI 2025
Gold

More Videos

Jim Kalbach

"In jazz, the structure is head, I solo, you solo, then back to head—repeated often and universally."

Jim Kalbach

Jazz Improvisation as a Model for Team Collaboration

November 6, 2017

Louis Rosenfeld

"We look to work with authors we like and enjoy being around because writing a book is a long, collaborative journey."

Louis Rosenfeld

Coffee with Lou: Should You Write a (UX) Book?

March 7, 2024

Catt Small

"Levels can mean really different things at different companies, so ask a lot of questions about expectations for each level."

Catt Small Micah Bennett Brian Carr Jessica Harllee

What's Next for ICs: Exploring Staff and Principal Designer Roles

February 22, 2024

Marieke McCloskey

"You might find that spending the most time getting people interested in collaboration is where the work really happens."

Marieke McCloskey

User Science: Product Analytics & User Research

March 11, 2021

Llewyn Paine

"Synthetic duplicates living on after data deletion raise ethical questions about users’ right to be forgotten."

Llewyn Paine

[Demo] Deploying AI doppelgangers to de-identify user research recordings

June 5, 2024

Joshua Noble

"Synthetic controls create counterfactuals using historical data when no suitable control group exists."

Joshua Noble

Casual Inference

October 6, 2023

Sara Logel

"We’re selectively skeptical—skeptical about some things but not others based on what we want to believe."

Sara Logel

Your Colleagues are Your Users Too

March 29, 2023

Bria Alexander

"The themes this year include transforming century-old industries, startup to enterprise, the outsider’s perspective, and partnering to transform."

Bria Alexander Louis Rosenfeld

Welcome

January 8, 2024

Sam Proulx

"Timely interactions that log users out without saving progress cause abandonment, especially for people with disabilities."

Sam Proulx

Online Shopping: Designing an Accessible Experience

June 7, 2023