Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval

Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Peter Van Dijck

Peter Van Dijck

UX and AI builder, CEO Sputnik Legal

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

•

Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.
•

Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.
•

UX and product teams can and should learn evals as a practical, non-technical skill.
•

Creating your own golden dataset is essential and cannot be outsourced or fully automated.
•

Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.
•

Evaluations measure task performance, not the underlying model itself, allowing comparison across models.
•

Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.
•

Biases are baked into models during training via evals used in post-training refinement.
•

LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.
•

Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Previous video

Next video

Ask the Rosenbot

Or choose a question:

What are the three components that make up an AI eval?

How can UX professionals use evals to improve AI product reliability?

Why is it necessary to create your own golden dataset for AI evaluations?

How do evals help in comparing performance across different AI models?

Why can't AI models learn or improve after deployment through prompting alone?

Chris Geison

2022 • Advancing Research 2022

Cheryl Platz

Embrace Your Fun Factor: Game Development Best Practices for Product Design

2026 • Rosenfeld Community

Tatyana Mamut

Opening Keynote: Breaking Conway's Law--or How to Work Differently and Not Ship Your Org Chart

2019 • Enterprise Experience 2019

Sara Conklin

A UXer’s 12-Month Journey from Climate Concern to Climate Credibility

2025 • Climate UX Interest Group

Kristin Skinner

8 Types of Measures in Design Operations

2020 • DesignOps Community

Indi Young

Thinking styles: Mend hidden cracks in your market

2025 • Rosenfeld Community

George Hinchliffe

Delivering Amazing Experiences

2021 • Design at Scale 2021

Ellie Krysl

Planned Right. Managed Right. Designed Right.

2023 • Enterprise UX 2023

Chris Hodowanec

Agile + User Experience: How to navigate the Agile landscape as an UX Practitioner

2022 • Civic Design 2022

Benjamin Real

Showing the Value of DesignOps by Not Having a DesignOps Team

2020 • DesignOps Summit 2020

Bria Alexander

OKRs—Helpful or Harmful?

2022 • DesignOps Community

Courtney Maya George

Scale Your Organization and Grow Your Designers

2022 • DesignOps Summit 2022

Deanna Zandt

The Unspoken Complexity of “Self-Care” with Deanna Zandt

2022 • Civic Design Community

Jen Cardello

Standardizing Product Merits for Leaders, Designers, and Everyone

2018 • Enterprise Experience 2018

Dane DeSutter

Beyond the Console: The rise of the Gamer Experience and how gaming will impact UX Research across industries

2024 • QuantQual Interest Group

Bob Baxley

Leading with Design Operations Past and Present

2019 • DesignOps Community

More Videos

Jon Fukuda

"Empathy means constantly going back to the people using the tools and hearing their feedback."

Jon Fukuda Amy Evans Ignacio Martinez Joe Meersman

The Big Question about Innovation: A Panel Discussion

September 25, 2024

Sam Proulx

"Accessibility is not a shackle; it’s a way to expand our minds and innovate."

Accessibility: An Opportunity to Innovate

March 9, 2022

Anna Avrekh

"Bad data in, bad data out — making sure research participants are diverse by race, gender, tenure, and location is crucial for inclusive products."

Anna Avrekh Amy Jiménez Márquez Morgan C. Ramsey Catarina Tsang

Diversity In and For Design: Building Conscious Diversity in Design and Research

June 9, 2021

Greg Nudelman

"The biggest thing for me is that bots bring our human values to AI and help curb abuses."

Designing Conversational Interfaces

November 14, 2019

George Abraham

"The components you design in Sketch or XD come with matching coded components in the app builder, so what you design is what you get in code."

George Abraham Stefan Ivanov

Design Systems To-Go: Reimagining Developer Handoff, and Introducing App Builder (Part 2)

October 1, 2021

Shipra Kayan

"AI clustering gives a first stab at themes, but you have to move things around yourself."

Make your research synthesis speedy and more collaborative using a canvas

January 24, 2025

Sam Proulx

"The experience of using a screen reader is probably 10 times faster as you become more expert with it."

Designing For Screen Readers: Understanding the Mental Models and Techniques of Real Users

December 10, 2021

Shipra Kayan

"I tried to do everything all by myself at first and it didn’t catch on because no one else knew I was doing it."

How we Built a VoC (Voice of the Customer) Practice at Upwork from the Ground Up

September 30, 2021

Dane DeSutter

"The computer metaphor of mind does not do a very good job of explaining what we see people do with their bodies."

Keeping the Body in Mind: What Gestures and Embodied Actions Tell You That Users May Not

March 26, 2024

Latest Books All books

Sentient Design

Sentient Design

Crafting Intelligent Interfaces with AI

By Josh Clark, Veronika Kindred

June 2026

Designing Assistant Technology

Designing Assistant Technology

AI That Makes Us Smarter

By Christopher Noessel

March 2026

The Staff Designer

The Staff Designer

Grow, Influence, and Lead as an Individual Contributor

By Catt Small

December 2025

Design for Privacy

Design for Privacy

Keeping Personal Information Private

By Robert Stribley

November 2025

Service Design (2nd edition)

Service Design (2nd edition)

From Insight to Implementation

By Lavrans Løvlie, Ben Reason, Andy Polaine

October 2025

The Game Development Strategy Guide

The Game Development Strategy Guide

Crafting Modern Video Games That Thrive

By Cheryl Platz

September 2025

Stop Wasting Research

Stop Wasting Research

Maximize the Product Impact of Your Organization's Customer Insights

By Jake Burghardt

June 2025

We Need to Talk

We Need to Talk

A Survival Guide for Tough Conversations

By Joshua Graves

April 2025

Human-Centered Security

Human-Centered Security

How to Design Systems That Are Both Safe and Usable

December 2024

The Design Conductors

The Design Conductors

Your Essential Guide to Design Operations

October 2024

Research That Scales

Research That Scales

The Research Operations Handbook

By Kate Towsey

September 2024

The User Experience Team of One (2nd Edition)

The User Experience Team of One (2nd Edition)

A Research and Design Survival Guide

By Leah Buley, Joe Natoli

August 2024

Design for Impact

Design for Impact

Your Guide to Designing Effective Product Experiments

By Erin Weigel

June 2024

Managing Priorities

Managing Priorities

How to Create Better Plans and Make Smarter Decisions

By Harry Max

May 2024

Duly Noted

Duly Noted

Extend Your Mind through Connected Notes

By Jorge Arango

January 2024

Dig deeper with the Rosenbot

In what ways can AI tools accelerate value delivery in design operations, and what pitfalls should be avoided?

Why is focusing on a specific area crucial during a career pivot into climate?

How can barriers like childcare and internet access be addressed to promote equitable participation?