Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge

Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Séamus Byrne
Aligning Teams with Choreography
2024 • Enterprise Experience 2020
Gold
Sydney Lawson
Anatomy of a Strong User Panel
2026 • Advancing Research 2026
Conference
Sam Proulx
Understanding Screen Readers on Mobile: How And Why to Learn from Native Users
2023 • Enterprise UX 2023
Gold
Catherine Courage
The Enterprise UX Journey: Lessons From the Voyage & The Opportunity Ahead
2015 • Enterprise UX 2015
Gold
Milan Guenther
A Shared Language for Co-Creating Ambitious Endeavours
2023 • Enterprise UX 2023
Gold
Craig Villamor
Resilient Enterprise Design
2017 • Enterprise Experience 2017
Gold
Caroline Jarrett
Have fun with statistics?
2024 • Rosenfeld Community
Kim Lenox
Leading Distributed Global Teams
2019 • Enterprise Community
Smitha Papolu
Theme 3 Discussion
2019 • Enterprise Experience 2019
Gold
Victor Udoewa
Research in the Pluriverse
2023 • Advancing Research 2023
Gold
Sahibzada Mayed
Cultivating Design Ecologies of Care, Community, and Collaboration
2023 • DesignOps Summit 2023
Gold
JP Allen
Navigating the UX Tool Landscape
2021 • Advancing Research 2021
Gold
Marjorie Stainback
Transforming Strategic Research Capacity through Democratization
2019 • DesignOps Summit 2019
Gold
Phil Hesketh
Designing Accessible Research Workflows
2021 • DesignOps Summit 2021
Gold
Frances Yllana
DesignOps–Leading the Path to Parity
2023 • DesignOps Community
Libby Maurer
Treating Diversity & Inclusion in Hiring as a Design Problem
2019 • Enterprise Community

More Videos

Jon Fukuda

"Each team should talk about their core values like integrity, accountability, and respect to work together with the same mindset."

Jon Fukuda Amy Evans Ignacio Martinez Joe Meersman

The Big Question about Innovation: A Panel Discussion

September 25, 2024

Sam Proulx

"We move beyond checklist thinking to focus on real needs of real people."

Sam Proulx

Accessibility: An Opportunity to Innovate

March 9, 2022

Anna Avrekh

"Allies in the room who are not affected by bias are critical voices to help call out inequities."

Anna Avrekh Amy Jiménez Márquez Morgan C. Ramsey Catarina Tsang

Diversity In and For Design: Building Conscious Diversity in Design and Research

June 9, 2021

Greg Nudelman

"Just start building a quick little service to get a taste of what this technology can do."

Greg Nudelman

Designing Conversational Interfaces

November 14, 2019

George Abraham

"Our approach eliminates the need for developers to inspect design specs to recreate styling and layouts from scratch."

George Abraham Stefan Ivanov

Design Systems To-Go: Reimagining Developer Handoff, and Introducing App Builder (Part 2)

October 1, 2021

Shipra Kayan

"How can we make research keep up with the speed of changes in the business?"

Shipra Kayan

Make your research synthesis speedy and more collaborative using a canvas

January 24, 2025

Sam Proulx

"If something is only visible on hover, it either needs to be visible to screen readers at all times or have an alternative way to access it."

Sam Proulx

Designing For Screen Readers: Understanding the Mental Models and Techniques of Real Users

December 10, 2021

Shipra Kayan

"The rebrand caused our NPS to plummet and freaked out our executives — that was the trigger to get serious."

Shipra Kayan

How we Built a VoC (Voice of the Customer) Practice at Upwork from the Ground Up

September 30, 2021

Dane DeSutter

"Gesture and embodied actions will be a key tool for understanding user experience as computing moves beyond traditional screen-based interactions."

Dane DeSutter

Keeping the Body in Mind: What Gestures and Embodied Actions Tell You That Users May Not

March 26, 2024