Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Prabhas Pokharel
Order and Chaos: New Ways of Collaborating on Synthesis and Storytelling
2022 • Advancing Research 2022
Gold
Wendy Johansson
Be a Product Boss!
2022 • Design in Product 2022
Gold
Jorge Arango
Design as an Antidote to VUCA
2019 • Enterprise Community
Alexandra Schmidt
Enterprise UX Playbook
2022 • Enterprise Community
Renee Reid
Becoming a ResearchH.E.R (Highly Enterprise Ready)
2019 • Enterprise Experience 2019
Gold
Prerna Makanawala
Achieving Balanced Design Consistency
2021 • Design at Scale 2021
Gold
Peter Merholz
The 2025 State of UX/Design Organizational Health
2025 • Rosenfeld Community
Jemma Ahmed
Research at an inflection point: Adapting to a new era of collaboration, equity, and innovation
2025 • Advancing Research 2025
Gold
Billy Carlson
Ideation tips for Product Managers
2022 • Design in Product 2022
Gold
Chelsea Mauldin
Let's Talk About Money
2022 • Civic Design 2022
Gold
Séamus Byrne
Aligning Teams with Choreography
2024 • Enterprise Experience 2020
Gold
Laine Riley Prokay
Carving a Path for Early Career DesignOps Practitioners
2022 • DesignOps Summit 2022
Gold
Amber Knabl
Empowering innovation: The critical role of inclusive product development in the AI era
2024 • Designing with AI 2024
Gold
Laurent Christoph
Scale the impact of DesignOps in 3D: Diligence, Decision, Discipline
2025 • DesignOps Community
Jemma Ahmed
Bringing together market and user research
2019 • Advancing Research Community
Bria Alexander
Welcome
2022 • DesignOps Summit 2022
Gold

More Videos

Jon Fukuda

"If design, product, and development use different tools, we end up talking past one another."

Jon Fukuda Amy Evans Ignacio Martinez Joe Meersman

The Big Question about Innovation: A Panel Discussion

September 25, 2024

Sam Proulx

"Thinking about accessibility is really thinking about creating technology that works for all people all the time."

Sam Proulx

Accessibility: An Opportunity to Innovate

March 9, 2022

Anna Avrekh

"Requesting glossy portfolios up front risks excluding candidates with strong problem-solving skills who may not have had opportunities to showcase visual design."

Anna Avrekh Amy Jiménez Márquez Morgan C. Ramsey Catarina Tsang

Diversity In and For Design: Building Conscious Diversity in Design and Research

June 9, 2021

Greg Nudelman

"The biggest thing for me is that bots bring our human values to AI and help curb abuses."

Greg Nudelman

Designing Conversational Interfaces

November 14, 2019

George Abraham

"The Ab Builder is a collaborative workspace where designers and developers have synchronized component libraries and themes."

George Abraham Stefan Ivanov

Design Systems To-Go: Reimagining Developer Handoff, and Introducing App Builder (Part 2)

October 1, 2021

Shipra Kayan

"Synthetic data created by AI—like fake personas and journeys—is super derivative and often not insightful."

Shipra Kayan

Make your research synthesis speedy and more collaborative using a canvas

January 24, 2025

Sam Proulx

"Screen readers navigate based on the code and semantics, not on visual layout or appearance."

Sam Proulx

Designing For Screen Readers: Understanding the Mental Models and Techniques of Real Users

December 10, 2021

Shipra Kayan

"Going from 100 to 200 people, suddenly product managers started hearing from a dozen people a week with conflicting requests."

Shipra Kayan

How we Built a VoC (Voice of the Customer) Practice at Upwork from the Ground Up

September 30, 2021

Dane DeSutter

"The computer metaphor of mind does not do a very good job of explaining what we see people do with their bodies."

Dane DeSutter

Keeping the Body in Mind: What Gestures and Embodied Actions Tell You That Users May Not

March 26, 2024