Log in or create a free Rosenverse account to watch this video.
Log in Create free account100s of community videos are available to free members. Conference talks are generally available to Gold members.
Hands-on AI #1: Let’s write your first AI eval
This video is featured in the Evals + Claude playlist.
Summary
If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.
Key Insights
-
•
Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.
-
•
Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.
-
•
UX and product teams can and should learn evals as a practical, non-technical skill.
-
•
Creating your own golden dataset is essential and cannot be outsourced or fully automated.
-
•
Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.
-
•
Evaluations measure task performance, not the underlying model itself, allowing comparison across models.
-
•
Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.
-
•
Biases are baked into models during training via evals used in post-training refinement.
-
•
LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.
-
•
Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.
Notable Quotes
"Evals are like a way to define what good looks like."
"The model was baked and once it’s baked, it does not learn again until they bake a new one."
"You need to be looking at the data. Nobody wants to, but that’s core work."
"Without a golden dataset, you have to build the golden dataset yourself."
"We’re not teaching the model anything; we’re improving our prompts and context."
"Confidence scores from the model are not a good idea because the model has no memory."
"Biases are baked in through the evals used during model training and post-training."
"LLMs judging other LLMs might sound crazy, but if you do it right, it works."
"Evals are a product and UX skill; learning them lets you make these systems do what you want."
"There is a large and growing capability overhang in these models we haven’t discovered yet."
Or choose a question:
More Videos
"Empathy means constantly going back to the people using the tools and hearing their feedback."
Jon Fukuda Amy Evans Ignacio Martinez Joe MeersmanThe Big Question about Innovation: A Panel Discussion
September 25, 2024
"Accessibility is not a shackle; it’s a way to expand our minds and innovate."
Sam ProulxAccessibility: An Opportunity to Innovate
March 9, 2022
"Bad data in, bad data out — making sure research participants are diverse by race, gender, tenure, and location is crucial for inclusive products."
Anna Avrekh Amy Jiménez Márquez Morgan C. Ramsey Catarina TsangDiversity In and For Design: Building Conscious Diversity in Design and Research
June 9, 2021
"The biggest thing for me is that bots bring our human values to AI and help curb abuses."
Greg NudelmanDesigning Conversational Interfaces
November 14, 2019
"The components you design in Sketch or XD come with matching coded components in the app builder, so what you design is what you get in code."
George Abraham Stefan IvanovDesign Systems To-Go: Reimagining Developer Handoff, and Introducing App Builder (Part 2)
October 1, 2021
"AI clustering gives a first stab at themes, but you have to move things around yourself."
Shipra KayanMake your research synthesis speedy and more collaborative using a canvas
January 24, 2025
"The experience of using a screen reader is probably 10 times faster as you become more expert with it."
Sam ProulxDesigning For Screen Readers: Understanding the Mental Models and Techniques of Real Users
December 10, 2021
"I tried to do everything all by myself at first and it didn’t catch on because no one else knew I was doing it."
Shipra KayanHow we Built a VoC (Voice of the Customer) Practice at Upwork from the Ground Up
September 30, 2021
"The computer metaphor of mind does not do a very good job of explaining what we see people do with their bodies."
Dane DeSutterKeeping the Body in Mind: What Gestures and Embodied Actions Tell You That Users May Not
March 26, 2024