Case Study 03

PoC — AI Interview Copilot

Candidate evaluation is time-consuming, inconsistent, and prone to bias. A cross-functional initiative with Design, Engineering, and AI21 Labs to research the hiring workflow end-to-end and design an AI-augmented process — without losing the human judgment that hiring requires.

Company

Thoughtworks (Internal)

Partners

AI21 Labs · Talent Acquisition · Engineering · Design

Role

UX Lead — Research, Interaction Design, Feasibility Prototyping

Type

Proof of Concept · Internal AI Product

90%
AI accuracy after model calibration — up from 83% baseline
100%
Human-AI agreement (Kappa 1.0) after calibration — perfect alignment
3
AI roles identified: assistant, automator, real-time coach — each with distinct UX implications

The problem

Thoughtworks's global Talent team identified a set of compounding challenges in their interview process: high cognitive load on interviewers, inconsistent scorecard quality, difficulty maintaining a holistic view of candidates across multiple interviewers, and no systematic mechanism for interviewer feedback and improvement.

The core tension: how do you make hiring more consistent and scalable without introducing automation that removes the human judgment that makes good hiring possible?

"It would help so much with data hygiene. We had a case last month where an interviewer left the company and didn't document the interview properly — we ended up with a blind spot."

Understanding the human experience today

Before proposing any AI solution, I mapped the complete hiring workflow — from sourcing through to candidate feedback — identifying every touchpoint, pain point, and decision point. This was done using traditional research: stakeholder interviews across Talent, Engineering, and Recruiting leadership.

Pre-Interview Pain

High cognitive load

Interviewers needed to familiarize with candidate info, review criteria, and mentally prepare — all before the interview started. No structured support.

During Interview Pain

Split attention

Interviewers were simultaneously conducting the interview, tracking questions, taking notes, and monitoring time — degrading quality across all four.

Post-Interview Pain

Low scorecard quality

Scorecards were filled from memory, often missing attribute justifications. Without good scorecards, subsequent interviewers couldn't build on previous sessions.

Systemic Pain

No feedback loop

Interviewers received no systematic feedback on their interviewing quality. Skill gaps compounded over time with no visibility or improvement mechanism.

Finding AI's role

Not every pain point warranted AI. I used an interaction model framework to identify where AI could augment (not replace) human judgment — evaluating each opportunity against four AI roles:

A

AI as Assistant

Providing contextual information, prompting interviewers with criteria, surfacing candidate data — reducing cognitive load without replacing judgment.

B

AI as Automator

Parsing interview transcripts, pre-filling scorecards based on what was said — reducing post-interview admin while leaving final judgment to the human.

C

AI as Coach

Providing live feedback on interviewer performance — question quality, attribute coverage, potential bias signals — enabling real-time improvement.

D

AI as Evaluator

Checking AI-generated recommendations against human SME ratings to measure accuracy, bias, and calibration — a governance layer, not a replacement.

Designing the PoC

Paired with developers to understand technical feasibility first — specifically how different transcript processing types (batch vs. streaming) would result in different user interactions and expectations. This shaped the prototype architecture before a single screen was designed.

The PoC integrated with Greenhouse (existing ATS) and Zoom, creating two complementary experiences: a real-time interview assistant that runs during the call, and a post-interview scorecard evaluator that processes transcripts against the job rubric automatically.

Additional functionalities — like note-taking alongside AI recommendations — were added to preserve and document the human factor, ensuring AI augmented rather than replaced interviewer judgment.

Governing and evaluating AI performance

A key part of this project was designing the evaluation framework itself — not just the product. SME raters independently graded the same interview data, and AI performance was measured against their assessments using Cohen's Kappa (a statistical measure of agreement beyond chance).

Baseline

83% accuracy

Without any calibration, AI achieved 83% accuracy on meeting criteria evaluation. Human rater agreement was 81% (Kappa 0.54 — moderate).

After Human Calibration

84% + perfect agreement

After clarifying rubric interpretation with human raters, accuracy improved to 84% and human agreement reached 100% (Kappa 1.0).

After Model Calibration

90% accuracy

After optimizing prompt engineering and context window, AI accuracy reached 90% — with perfect human-AI agreement maintained.

Next Steps

User validation

The PoC identified the right scope. Next step: prioritize and validate with end users — ensuring it addresses real needs before scaling.

90%
AI AccuracyAfter model calibration — up from 83% baseline
1.0
Kappa ScorePerfect human-AI agreement after calibration
PoC
Validated for pilotNext step: user validation with end users across global interviewing teams

"This would make our lives so much easier."

— Early feedback from Thoughtworks interviewers during concept review

Previous

METRO CRM Redesign

Next

Travel Management System