There's a version of AI-generated interview questions that makes the hiring process worse. You know it when you see it: generic behavioral questions pulled from a training set of common HR content, rephrased slightly for the role title, and delivered with no connection to what the specific job actually requires. "Tell me about a time you led a project" is not a better question just because an AI generated it.

The failure mode is structural, not incidental. Question generation models trained on broad hiring data produce questions that are statistically common, which means they surface competencies that appear frequently in job descriptions in general — not competencies that are specific to this role, this company, or this stage of the position's scope. The result is questions that would be plausible for almost any professional role, which means they produce answers that are hard to compare across candidates for a specific position.

Why Generic Questions Produce Unusable Answers

The comparison problem is the practical consequence of using unanchored AI question generation. When two candidates both answer "tell me about a time you dealt with ambiguity," one might interpret the question as being about project scope, the other as being about organizational change. Both give strong answers — to different questions. When you try to score them against each other, you're comparing apples to apples in structure but entirely different domains in substance.

In our experience building the JD-driven question generator inside Proofglint, this problem becomes obvious the moment you look at what actual JDs contain versus what generic question banks ask about. A mid-market SaaS account executive role that explicitly requires "experience with multi-stakeholder procurement decisions involving IT, legal, and finance" is not well-served by a question about "building relationships with customers." The JD has told you exactly what kind of relationship-building matters. A generic question bank doesn't know that.

The question "Walk me through how you've navigated a sales process that required sign-off from IT, legal, and the economic buyer simultaneously" produces a fundamentally different signal. It requires candidates to demonstrate the specific competency the role needs, not a general one. It makes scoring straightforward — either the candidate has done this and can describe it with useful specificity, or they haven't.

What JD-Anchored Generation Actually Does

When Proofglint generates questions from a job description, the process starts with extraction rather than generation. The system reads the JD and identifies three layers:

  1. Required technical skills and tools — Explicitly named in the JD. If the JD says "proficiency in dbt and Airflow," questions should verify that proficiency specifically, not general data engineering skill.
  2. Behavioral competencies implied by the responsibilities — Not always stated directly. "Lead cross-functional initiatives to improve data quality" implies project ownership, stakeholder management, and the ability to drive action without direct authority. The JD names the responsibility; the extraction layer identifies the competencies underneath it.
  3. Context-specific constraints — Stage, team size, tooling environment. A "data engineer" at a ten-person startup has different scope than the same title at a 500-person company with a dedicated platform team. These context signals, when present in the JD, should shape the questions.

Only after that extraction does the generation step happen — and the generated questions are anchored to the specific role profile, not to a general role category. This produces questions that are harder to answer with a stock response and more likely to surface genuine role-relevant signal from candidates who have done the actual work.

Accuracy Benchmarks for JD-Matched Questions

We ran an internal evaluation comparing JD-anchored questions to a generic question bank across 200 candidate flows processed through Proofglint during our design-partner pilot. The metric we measured was inter-rater agreement on candidate scores — the degree to which two independent reviewers, scoring the same candidate response, reached the same result.

For generic questions, the inter-rater agreement rate was 54%. For JD-anchored questions covering the same competency areas, it was 79%. That 25-point difference reflects a real difference in question quality: when the question is specific enough, the answer has clear distinguishing features that reviewers agree on. When the question is generic, reviewers bring their own mental model of what a strong answer looks like, and those models diverge.

We also measured question relevance ratings from hiring managers — asking them to score each question on whether it tested something they actually cared about for the role. JD-anchored questions averaged a 4.3 out of 5 on that measure. Generic questions from the same bank averaged 2.8. Hiring managers reliably knew the difference, even when they couldn't articulate the mechanism behind it.

A question that tests something generic teaches you almost nothing about whether this candidate fits this role. You need questions that can only be answered well by someone who has actually done the specific work the JD describes.

The JD as a Signal Source, Not Just a Filter

There's a way of thinking about job descriptions as purely administrative documents — legal requirement, ATS keyword stuffing, HR artifact. In our view, that framing wastes the most signal-rich document in the hiring process.

A well-written JD contains the hiring manager's mental model of what success in the role looks like. The specific language used in responsibilities ("drive adoption" vs. "implement" vs. "design and maintain") reflects expectations about ownership level, scope of influence, and seniority calibration. The combination of required and preferred qualifications tells you which skills are non-negotiable and which indicate ceiling rather than floor.

When question generation is grounded in that language — not in a paraphrase of it, but in the actual phrasing the hiring manager used — the questions reflect the hiring manager's actual model. The candidate's response is then interpretable against that model. This is the difference between interview questions that surface signal and interview questions that produce conversation.

Practical Limits and Honest Caveats

JD-anchored generation is only as good as the JD itself. If the job description is generic — and many are, especially at companies without dedicated recruiting ops — the anchoring produces better-than-nothing questions, but not the quality of questions you'd get from a specific, carefully written JD. We've seen this in our own data: JDs under 300 words produce noticeably lower-relevance question outputs than JDs over 500 words with specific responsibility language.

This creates a virtuous incentive: companies that invest in writing specific job descriptions get better interview questions from the generation layer. The JD quality improvement that flows from thinking carefully about what a role actually requires is itself valuable — independent of the question generation that follows from it.

The other practical limit is question count. More questions don't produce more signal past a certain point. In our pilot data, flows with five JD-anchored questions produced as much useful candidate differentiation as flows with eight to ten questions, because the quality-per-question was higher. Three to five well-designed JD-anchored questions, tested against the role's actual requirements, will outperform a longer list of generic ones in almost every scenario we've observed.

The goal of AI-generated interview questions should never be volume — it should be relevance. Questions that are genuinely hard to answer without the specific experience the role requires, and easy to score because the distinguishing characteristics are clear. JD anchoring is the mechanism that makes that possible. Without it, you're generating questions that sound professional but don't help you make a better hiring decision.