Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval

Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Louis Rosenfeld
GenAI for UXers: A Rosenbot Demo and Discussion
2025 • Rosenfeld Community
Katy Mogal
But Do Your Insights Scale?
2021 • Advancing Research 2021
Gold
Carl Turner
You Can Do This: Understand and Solve Organizational Problems to Jumpstart a Dead Project
2023 • Advancing Research 2023
Gold
Christian Crumlish
Introduction by our Conference Chair
2022 • Design in Product 2022
Gold
Billy Carlson
Principles of Team Wireframing
2023 • DesignOps Summit 2023
Gold
Nathan Curtis
Beyond the Toolkit: Spreading a System Across People & Products
2016 • Enterprise UX 2016
Gold
Courtney Maya George
Scale Your Organization and Grow Your Designers
2022 • DesignOps Summit 2022
Gold
Shan Shen
Translating UX Terms into Business Contexts
2023 • Design in Product 2023
Gold
Jon Fukuda
Storytelling for DesignOps
2023 • DesignOps Community
Jen Briselli
Learning is the north star: service design for adaptive capacity
2025 • Advancing Service Design 2025
Gold
Fredrik Matheson
First-time users, longtime strategies: Why Parkinson’s Law is making you less effective at work – and how to design a fix.
2016 • Enterprise UX 2016
Gold
Mitchell Bernstein
Organizing Chaos: How IBM is Defining Design Systems with Sketch for an Ever-Changing AI Landscape
2021 • DesignOps Summit 2021
Gold
Vasileios Xanthopoulos
A Top-Down and Bottom-Up Approach to User-Centric Maturity at Scale
2024 • Enterprise Experience 2020
Gold
Jennifer Strickland
Adopting a "Design By" Method
2021 • Civic Design 2021
Gold
Meaghan Waters
Lack of Product Thinking will Doom Your Legacy Modernization
2021 • Design at Scale 2021
Gold
Melissa Eggleston
Practical People Skills for Building Trust on Teams and with Partners
2021 • Civic Design 2021
Gold

More Videos

Greg Petroff

"People are starting to catch on with the fact that design matters and that it’s maturely important to solving the problems we face."

Greg Petroff

Software as Material—A Redux

June 6, 2023

Brendan Jarvis

"Why was the universe doing this to me? That was a bad question."

Brendan Jarvis

Framing Tomorrow by Questioning Today

June 8, 2022

Prerna Makanawala

"Instead of aiming for 80% consistency, I believe 30-40% consistency with 60-70% freedom works better."

Prerna Makanawala

Achieving Balanced Design Consistency

June 9, 2021

Tricia Wang

"Shapers combine human and machine intelligence and amplify each other’s strengths."

Tricia Wang

From Users to Shapers of AI: The Future of Research

March 25, 2024

Cheryl Platz

"Pro-social gaming is about designing for positive social interactions that contribute to thriving communities."

Cheryl Platz

Embrace Your Fun Factor: Game Development Best Practices for Product Design

January 9, 2026

Ariel Kennan

"I think we all know no matter if you’re new to this work or you’ve been doing it for years, we’re always still learning."

Ariel Kennan

Theme Two Intro

November 17, 2022

Tricia Wang

"How might we has become like a church—untouchable and sanctimonious instead of a flexible tool."

Tricia Wang

The most popular design thinking strategy is BS

January 27, 2022

Kristin Wisnewski

"This circle represents the size of IBM, 385,000 IBMers. The CIO team is only 3% of that, and our design team is just 1% of the CIO."

Kristin Wisnewski

Measuring What Matters

October 23, 2019

Scott Jensen

"Facilitating a shared understanding among everyone on product takes a lot of time."

Scott Jensen Sarah Delaney Carmen Liu

Short Take #2: UX/Product Lessons from Your Industry Peers

December 6, 2022