Candidate evaluation is time-consuming, inconsistent, and prone to bias. A cross-functional initiative with Design, Engineering, and AI21 Labs to research the hiring workflow end-to-end and design an AI-augmented process — without losing the human judgment that hiring requires.
Thoughtworks's global Talent team identified a set of compounding challenges in their interview process: high cognitive load on interviewers, inconsistent scorecard quality, difficulty maintaining a holistic view of candidates across multiple interviewers, and no systematic mechanism for interviewer feedback and improvement.
The core tension: how do you make hiring more consistent and scalable without introducing automation that removes the human judgment that makes good hiring possible?
"It would help so much with data hygiene. We had a case last month where an interviewer left the company and didn't document the interview properly — we ended up with a blind spot."
Before proposing any AI solution, I mapped the complete hiring workflow — from sourcing through to candidate feedback — identifying every touchpoint, pain point, and decision point. This was done using traditional research: stakeholder interviews across Talent, Engineering, and Recruiting leadership.
Pre-Interview Pain
High cognitive load
Interviewers needed to familiarize with candidate info, review criteria, and mentally prepare — all before the interview started. No structured support.
During Interview Pain
Split attention
Interviewers were simultaneously conducting the interview, tracking questions, taking notes, and monitoring time — degrading quality across all four.
Post-Interview Pain
Low scorecard quality
Scorecards were filled from memory, often missing attribute justifications. Without good scorecards, subsequent interviewers couldn't build on previous sessions.
Systemic Pain
No feedback loop
Interviewers received no systematic feedback on their interviewing quality. Skill gaps compounded over time with no visibility or improvement mechanism.
Not every pain point warranted AI. I used an interaction model framework to identify where AI could augment (not replace) human judgment — evaluating each opportunity against four AI roles:
AI as Assistant
Providing contextual information, prompting interviewers with criteria, surfacing candidate data — reducing cognitive load without replacing judgment.
AI as Automator
Parsing interview transcripts, pre-filling scorecards based on what was said — reducing post-interview admin while leaving final judgment to the human.
AI as Coach
Providing live feedback on interviewer performance — question quality, attribute coverage, potential bias signals — enabling real-time improvement.
AI as Evaluator
Checking AI-generated recommendations against human SME ratings to measure accuracy, bias, and calibration — a governance layer, not a replacement.
Paired with developers to understand technical feasibility first — specifically how different transcript processing types (batch vs. streaming) would result in different user interactions and expectations. This shaped the prototype architecture before a single screen was designed.
The PoC integrated with Greenhouse (existing ATS) and Zoom, creating two complementary experiences: a real-time interview assistant that runs during the call, and a post-interview scorecard evaluator that processes transcripts against the job rubric automatically.
Additional functionalities — like note-taking alongside AI recommendations — were added to preserve and document the human factor, ensuring AI augmented rather than replaced interviewer judgment.
A key part of this project was designing the evaluation framework itself — not just the product. SME raters independently graded the same interview data, and AI performance was measured against their assessments using Cohen's Kappa (a statistical measure of agreement beyond chance).
Baseline
83% accuracy
Without any calibration, AI achieved 83% accuracy on meeting criteria evaluation. Human rater agreement was 81% (Kappa 0.54 — moderate).
After Human Calibration
84% + perfect agreement
After clarifying rubric interpretation with human raters, accuracy improved to 84% and human agreement reached 100% (Kappa 1.0).
After Model Calibration
90% accuracy
After optimizing prompt engineering and context window, AI accuracy reached 90% — with perfect human-AI agreement maintained.
Next Steps
User validation
The PoC identified the right scope. Next step: prioritize and validate with end users — ensuring it addresses real needs before scaling.
"This would make our lives so much easier."
— Early feedback from Thoughtworks interviewers during concept review