Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge

Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Alla Weinberg
Workers Are Sick of Change: The Cure is Psychological Safety
2023 • Enterprise UX 2023
Gold
Jack Behar
How to Build Prototypes that Behave like an End-Product
2022 • Design in Product 2022
Gold
Gillian Salerno-Rebic
From Insight to Impact: How JourneySpark Used WEVO Pulse + Pro to Drive a 50% Lift in Ad Engagement
2025 • Designing with AI 2025
Gold
Rachael Dietkus, LCSW
Everything You Need to Know about the Civic Design 2022 Call for Presentations
2022 • Civic Design Community
Jonathon Colman
How to Maximize the Impact of Content Design
2024 • DesignOps Summit 2020
Gold
Sam Proulx
Mobile Accessibility: Why Moving Accessibility Beyond the Desktop is Critical in a Mobile-first World
2022 • DesignOps Summit 2022
Gold
Raven Veal
Dark Metrics: Illuminating the Negative Impact of Digital Health Design
2021 • Advancing Research 2021
Gold
Chris Geison
What is Research Strategy?: A Panel of Research Leaders Discuss this Emergent Question
2021 • Advancing Research Community
Anupama Dhareshwar
From blueprint to bot: Designing resilient AI-powered services
2025 • Advancing Service Design 2025
Gold
Sam Ladner
Data Exhaust and Personal Data: Learning from Consumer Products to Enhance Enterprise UX
2016 • Enterprise UX 2016
Gold
Laura Smith
Embedding Service Design and Agile Practice within UK Planning Teams to Create Services that Last
2024 • Advancing Service Design 2024
Gold
Cheryl Platz
Merging Improv with Design
2019 • Enterprise Community
Briana Thomas
When Design Ops Comes in H.O.T. : A Tale of a Transformed Design Org
2021 • DesignOps Summit 2021
Gold
Eric Shumake
An AMA on UX's Role in Healthcare
2026 • Rosenfeld Community
Kelly Dern
AI as a Design Partner: How to Get the Most Out of AI Tools to Scale Your Process
2023 • DesignOps Summit 2023
Gold
Jorge Arango
Design as an Antidote to VUCA
2019 • Enterprise Community

More Videos

Jim Kalbach

"Jazz soloists draw from a lifetime of patterns, sometimes quoting TV theme songs like the Muppets or Sanford and Son."

Jim Kalbach

Jazz Improvisation as a Model for Team Collaboration

November 6, 2017

Louis Rosenfeld

"We look to work with authors we like and enjoy being around because writing a book is a long, collaborative journey."

Louis Rosenfeld

Coffee with Lou: Should You Write a (UX) Book?

March 7, 2024

Catt Small

"Building influence has been such a challenging part of growing in my own career. It takes a lot of communication and understanding others’ goals."

Catt Small Micah Bennett Brian Carr Jessica Harllee

What's Next for ICs: Exploring Staff and Principal Designer Roles

February 22, 2024

Marieke McCloskey

"Sharing your research skills broadly within your company can create unexpected opportunities to improve experiences."

Marieke McCloskey

User Science: Product Analytics & User Research

March 11, 2021

Llewyn Paine

"Wonder Studio automatically segments actors, maps their movements to a 3D model, and renders a synthetic avatar video."

Llewyn Paine

[Demo] Deploying AI doppelgangers to de-identify user research recordings

June 5, 2024

Joshua Noble

"Matching is very much an art, not a science; there are no hard and fast rules for what counts as a good match."

Joshua Noble

Casual Inference

October 6, 2023

Sara Logel

"We’re biologically wired to respond to intellectual challenges the same way as physical threats."

Sara Logel

Your Colleagues are Your Users Too

March 29, 2023

Bria Alexander

"Learning is a marathon, not a sprint, especially in the context of online conferences."

Bria Alexander Louis Rosenfeld

Welcome

January 8, 2024

Sam Proulx

"If you have to learn a workaround, you want to learn it once and reuse it again and again."

Sam Proulx

Online Shopping: Designing an Accessible Experience

June 7, 2023