Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge

Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Jon Temple
Panel: Stacks, Security, and Stakeholders: The Hidden Work of UXR Tool Procurement
2026 • Advancing Research 2026
Gold
Brianna Sylver
Lead With Purpose
2020 • Advancing Research 2020
Gold
Feyikemi Akinwolemiwa
Play to innovate: How curiosity and experimentation transform UX
2026 • Advancing Research 2026
Gold
Dana Chisnell
The Sensemaking Business
2026 • Advancing Research 2026
Gold
Heidi Trost
To Protect People, You Have to Protect Information: A Human-Centered Design Approach to Cybersecurity
2025 • Rosenfeld Community
Dave Hora
A Research Skills Evolution
2021 • Advancing Research 2021
Gold
Mary-Lynne Williams
Exit Interview #4: From Product Design Leadership to Sound Healing
2026 • Rosenfeld Community
Sheryl Cababa
Integrating Systems Thinking Into Your Practice as a Designer
2025 • Rosenfeld Community
Kavana Ramesh
Meaningful inclusion: Practicing accessibility research with confidence
2024 • DesignOps Summit 2024
Gold
Victor Udoewa
Research in the Pluriverse
2023 • Advancing Research 2023
Gold
George Zhang
UX Research Excellence Framework
2021 • Advancing Research 2021
Gold
Helen Armstrong
Augment the Human. Interrogate the System.
2023 • Enterprise UX 2023
Gold
Stephen Anderson
Puzzled? How to Coordinate Humans for Complex Challenges
2021 • Enterprise Community
Dan Willis
Enterprise Storytelling Sessions
2015 • Enterprise UX 2015
Gold
Yunyan Li
UX Best Practices
2021 • Design at Scale 2021
Gold
Matt Duignan
Atomizing Research: Trend or Trap
2020 • Advancing Research 2020
Gold

More Videos

Greg Petroff

"Take your cross-functional partners out to lunch. Build relationships because people don’t resist change; they resist what they don’t know."

Greg Petroff

Software as Material—A Redux

June 6, 2023

Brendan Jarvis

"The stories we tell are framed by the questions we ask. Our stories repeated become our beliefs and our beliefs influence our behavior."

Brendan Jarvis

Framing Tomorrow by Questioning Today

June 8, 2022

Prerna Makanawala

"When things behave the same way, users don’t have to worry about what will happen—that’s the intuitive factor."

Prerna Makanawala

Achieving Balanced Design Consistency

June 9, 2021

Tricia Wang

"Language matters — calling people 'users' reflects and reinforces the reality of reduced agency."

Tricia Wang

From Users to Shapers of AI: The Future of Research

March 25, 2024

Cheryl Platz

"If 71% of people self-identify as gamers, why aren’t you learning from the video game industry?"

Cheryl Platz

Embrace Your Fun Factor: Game Development Best Practices for Product Design

January 9, 2026

Ariel Kennan

"There’s inherent conflict that will arise when navigating many stakeholders."

Ariel Kennan

Theme Two Intro

November 17, 2022

Tricia Wang

"Companies parachuting in to 'study' marginalized groups often do more harm than good when they don't build local capacity."

Tricia Wang

The most popular design thinking strategy is BS

January 27, 2022

Kristin Wisnewski

"We’re the voice of the employee—we’re arbiters of truth, defenders of experience, and sometimes validators against manipulation."

Kristin Wisnewski

Measuring What Matters

October 23, 2019

Scott Jensen

"Sometimes it’s reasonable to consider your design as strategy and ship a piece now and other pieces later."

Scott Jensen Sarah Delaney Carmen Liu

Short Take #2: UX/Product Lessons from Your Industry Peers

December 6, 2022