Thought

Business

Why interviews aren't enough: the role-specific work samples we use to test skills before hiring

Work samples predict on-the-job performance better than interviews because they simulate a small, realistic version of the actual work. Good tests are time-boxed (2 to 4 hours), use anonymized fake data, focus on judgment and approach over polish, and share criteria upfront. Bad tests are abstract puzzles, multi-day take-homes, or trivia quizzes that don't reflect the real job.

Why interviews alone don't tell us enough

Interviews are famously weak predictors of on-the-job performance. Decades of organizational research have consistently shown that unstructured interviews measure how well someone communicates about their work, not whether they can actually do the work. A confident candidate with a polished narrative can interview well and deliver badly. A less polished candidate can interview worse and deliver far better.

Work samples (small, job-like tasks done under realistic conditions) outperform interviews, reference checks, and credentials as predictors of real performance. That's the research finding, but more importantly, it matches our direct experience. The candidates who excel in our interviews don't map cleanly to the ones who excel on the team. Adding a role-specific skill test closes that gap.

This article covers how we design those tests and what we run for each core role at SUFFIX. The goal isn't gatekeeping. It's making sure both sides have the clearest possible picture before committing to working together.

The core principle: test what the job actually does

A good work sample simulates a small, realistic version of the actual job. A bad one is an abstract puzzle, a trivia quiz, or a scenario so contrived that it doesn't resemble anything the person will do after they're hired.

This principle runs through every test we use. Before designing a test, we ask: what does this role actually do on a typical Tuesday? What decisions do they make? What outputs do they produce? What judgment calls do they get right or wrong? Then we design a test that exercises those specific muscles, at a scale that fits a few hours of work.

The side benefit of this principle: the tests also serve as role previews. A candidate who finds the test engaging probably enjoys the actual work. A candidate who grinds through it probably wouldn't enjoy doing it every day either. Self-selection works in both directions, and we want it to.

Role-specific tests

Each role test follows the same four-part structure: what the test is, what it simulates, what we're evaluating, and what a good response looks like.

UX/UI Designer

What the test is: A small design brief covering a specific screen or flow, delivered in Figma. The candidate completes the design and presents it to the team in a short working session.

What it simulates: The daily work of translating a brief into a structured design artifact, then communicating the thinking behind it to teammates and clients.

What we're evaluating: Design thinking (not just visual polish), Figma proficiency at a working level, the structural decisions that show up in information architecture and layout, and the ability to articulate why choices were made. Presentation quality matters because designers at SUFFIX spend meaningful time explaining decisions in client meetings.

What a good response looks like: The candidate makes defensible structural decisions, uses Figma competently (components, auto-layout, consistent spacing), and walks through their reasoning in a way that invites discussion. A rough but thoughtful design with clear reasoning beats a polished design with no articulated rationale.

Front-end Developer

What the test is: A small build task against a design spec. The candidate implements a component or page section with specific behavior and interactions.

What it simulates: The daily work of taking a design and making it real, including the decisions that aren't on the mockup (responsive behavior, state handling, accessibility considerations, code organization).

What we're evaluating: JavaScript proficiency, tool and framework familiarity appropriate to the role, how the candidate structures their code, and what they do when the spec has gaps. The last point matters: real specs always have gaps, and how a developer handles them says more than technical correctness alone.

What a good response looks like: Clean, readable code that fulfills the spec, sensible choices in ambiguous areas with the reasoning noted, and a working demo. We care about how they got there, not whether it matches our internal style guide exactly.

Back-end Developer

What the test is: A small system design or implementation task: model a data structure, implement a basic API endpoint, or reason about a specific technical problem involving database queries and data handling.

What it simulates: The judgment-heavy parts of back-end work: making decisions about data models, security, and system structure before anyone sees the interface.

What we're evaluating: Language proficiency, database and query thinking, awareness of common security considerations (input validation, authentication, data exposure), and the ability to justify trade-offs rather than default to one pattern.

What a good response looks like: A working solution, notes on what they would do differently at production scale, and awareness of the security and scalability implications of their choices. We're testing thinking, not memorization of frameworks.

Digital Marketer

What the test is: A strategic brief: given a fictional brand, propose a campaign approach with target audience, channel mix, creative angle, and how success would be measured.

What it simulates: The work of translating business goals into campaign strategy, which is what a marketer does at the start of every client engagement.

What we're evaluating: Strategic reasoning, familiarity with the current digital marketing landscape, creativity balanced with practicality, and problem-solving when constraints are unclear.

What a good response looks like: A specific, defensible strategy rather than a generic "awareness plus conversion" plan. Clear audience definition, channel choices that match the audience, measurable success criteria, and realistic resource assumptions. We're looking for judgment, not buzzwords.

Digital Media Optimizer

What the test is: A scenario analysis: given a set of fake campaign performance data across Google Ads, Meta Ads, and LinkedIn Ads, identify what's working, what isn't, how to reallocate budget, and what to recommend to the client next.

What it simulates: The weekly rhythm of a media optimizer's real job: read the numbers, find the signal, adjust the spend, explain the decision.

What we're evaluating: Analytical skill with real ad platform data, KPI literacy, budget allocation reasoning, and the ability to extract actionable Learning from messy performance data. Also: how the candidate communicates findings to a hypothetical non-technical client.

What a good response looks like: Clear identification of what's performing and what isn't, a reallocation recommendation backed by the data, and a short client-facing summary that explains the reasoning in plain language. Optimizers at SUFFIX need both sides: the analytical reasoning and the communication.

Account Executive

What the test is: After the interview ends, the candidate is asked to write a meeting summary of the interview itself and send it back within a defined window.

What it simulates: The daily work of an AE: listening to a conversation with a client or internal team, capturing what mattered, organizing it into decisions, action items, and open questions, and sending a clear summary to everyone involved.

What we're evaluating: Listening, synthesis, written communication, structural thinking under mild time pressure, and the ability to distinguish what mattered from what was said. All the core skills of the role, tested on a real conversation the candidate already has full context for.

What a good response looks like: A clearly structured summary with decisions, action items (with owners and dates where implied), open questions, and next steps. Not a transcript. Not a paragraph essay. The format the AE will send every day on the job.

This is our flagship example because the test is elegant: the raw material is shared, the candidate already has context, the skill is exactly what the job requires, and the output is directly comparable to what we'd send a real client. No simulation artifacts, no uncertainty about whether the test reflects the job.

Principles of a fair work-sample test

Work-sample testing is only useful if it's designed fairly. Unfair tests produce worse hires, damage the candidate experience, and select for the wrong things. Our principles:

Time-boxed: The test fits within 2 to 4 hours of focused work. Longer than that starts selecting for candidates with excess free time rather than for skill.

Paid when the effort is substantial: For anything beyond a simple exercise, we compensate the candidate for their time. A free day of work is not an acceptable ask.

Anonymized fake data: We do not use real client work as test material. The scenarios are fabricated or heavily anonymized. If we wouldn't show it to a client, we don't show it to a candidate either.

Judgment and approach over polish: We evaluate what the candidate thinks and why, not whether their output is production-ready. Rough work with strong reasoning beats polished work with weak reasoning.

Shared criteria upfront: The candidate knows what we're looking for before they start. Surprise evaluation criteria are a form of gatekeeping, not assessment.

Feedback regardless of outcome: Candidates who put meaningful time into a test deserve meaningful feedback, whether they're hired or not. Ghosting candidates after a work sample is both rude and bad for the industry.

What we deliberately don't test

Equally important: the things we decided not to include.

Portfolio-only evaluation: A portfolio shows past work, not current capability. Past work may have been collaborative, may be dated, or may not reflect how the person works under our constraints. Portfolios are useful context, not sufficient signal.

Trivia-style technical quizzes: Knowing the name of every framework or the exact syntax of a rarely-used API tests memorization, not judgment. The internet exists. What matters is how someone thinks when they hit something they don't know.

Highly contrived fake scenarios: If the test couldn't plausibly happen at the job, it's not testing for the job. Creative puzzles look rigorous and predict nothing useful.

Multi-day take-homes: Anything requiring more than half a day of focused work shifts the assessment toward who has time rather than who has skill. It also degrades the candidate experience and biases the applicant pool.

Whiteboard algorithm questions for non-algorithmic roles: These test for a specific kind of performance under artificial pressure. They don't predict whether someone can do the actual job.

What this looks like from the candidate's side

Good hiring is reciprocal. The test tells us whether the candidate can do the work. It also tells them what the work actually looks like.

This matters in both directions. A candidate who completes the test and finds the work energizing is a strong hire signal: they can do it and they enjoy it. A candidate who completes the test and realizes the work isn't what they expected can self-select out before either side commits further. That self-selection is a feature, not a failure. Mis-hires are expensive for everyone, and the test prevents them by making the work visible.

The candidate experience we aim for: clear brief, realistic scope, respectful time commitment, transparent criteria, and actual feedback when the process ends. When we do this well, candidates who don't get the role still leave with a useful sense of what working at SUFFIX would have been like, and occasionally come back for a future opening when the timing is better.

FAQ

Why use work-sample tests instead of just interviews?

Interviews mostly measure how well someone talks about their work. They're famously weak predictors of on-the-job performance. Work samples measure how someone actually does the work, under conditions that resemble the real job. The gap between "interviews well" and "performs well" is larger than most teams think, and work samples are how you close it. Both matter. The interview handles culture and communication fit. The work sample handles capability.

How long should a pre-hire skill test be?

Two to four hours of focused work. Longer than that starts selecting for candidates with excess free time rather than for skill, and it degrades the candidate experience to the point where strong candidates drop out of the process. If the role genuinely needs a longer evaluation, break it into stages (a short initial test and a more substantial paid project for finalists), and be explicit about what each stage is testing for and how long it should take.

Should candidates be paid for time spent on skill tests?

For short exercises under a couple of hours, no payment is standard, though the candidate deserves clear scoping and timely feedback. For anything more substantial, yes. A paid day of work respects the candidate's time, attracts a broader applicant pool (including candidates who can't afford to work for free), and signals that the organization treats candidates as people, not as free labor. Paying also discourages teams from scoping tests larger than necessary.

What makes a good work-sample test for a creative or strategic role?

It should simulate the actual work, not a made-up puzzle version of it. A strategy test should give the candidate a realistic brief and ask for a strategic recommendation, including the trade-offs. A creative test should give them a real-feeling problem and ask how they would approach it, with reasoning. The evaluation focuses on judgment (why these choices, what was considered and rejected, what the tradeoffs are) rather than execution polish. Creative and strategic roles especially reward tests that reveal thinking, not just output.

Writer

Account Executive

Supasuta Netrungsee