Skip to content

Knowledge / Internal people operations and HR teams

Eval harness for HR policy RAG assistants

Use AI to generate test questions, compare answer evidence, and explain source gaps. Keep employment-impacting decisions, eligibility judgments, and policy exceptions outside full automation.

AI fitmediumRiskhighReviewrequired

Why this workflow matters

Internal HR assistants look easy until policies differ by location, employee type, effective date, or benefits provider. A useful eval harness checks whether the answer is sourced, current, permission-safe, and escalated when the assistant should not answer.

Inputs and outputs

Inputs

  • Policy question set
  • Source documents and effective dates
  • User role and location metadata
  • Expected citation and escalation rules
  • Prior incorrect answers

Outputs

  • Answer-quality report
  • Missing-source list
  • Freshness and permission failures
  • Escalation examples
  • Reviewer notes for policy owners

Current manual workflow

Start by modeling the work as it happens now.

  • Gather real employee questions, policy-owner examples, and known edge cases.
  • Attach the expected source, answer boundaries, effective dates, and escalation rules to each test case.
  • Run the assistant for different employee roles, locations, and permission contexts.
  • Score citations, freshness, refusal behavior, permission handling, and escalation.
  • Send failures to the policy owner with the source text and exact assistant answer.

Where AI helps

Use models around the exception work.

  • Generate realistic question variants from approved policy text.
  • Compare the assistant answer against cited source passages.
  • Cluster failures by stale source, missing permission, unsupported claim, or missing escalation.
  • Draft policy-owner review notes from failing cases.
  • Suggest where the knowledge base needs source cleanup or metadata.

System pattern

Keep deterministic checks in charge of the hard boundaries.

Architecture

  • Represent each policy document with owner, effective date, employee population, location, and review cadence.
  • Run retrieval only after role and permission filters are applied.
  • Evaluate answers against source excerpts, required citations, and escalation rules.
  • Use AI to summarize failures after deterministic checks identify citation, freshness, and permission defects.
  • Route high-risk failures to HR, legal, or policy owners before the assistant is expanded.

Keep deterministic

  • Permission checks before retrieval.
  • Effective-date filtering.
  • Jurisdiction and employee-type routing.
  • Required escalation triggers.
  • Audit logging for reviewed answers.

Do not fully automate

  • Employment eligibility decisions.
  • Disciplinary or performance decisions.
  • Final interpretation of ambiguous policy language.
  • Exceptions that affect pay, leave, benefits, or protected categories.

Evaluation and controls

A useful workflow design explains how to check the work.

Supported-answer rate

Answers cite the right source text for the user context.

Stale-source rate

No answer relies on a document outside its effective date.

Escalation recall

Sensitive or ambiguous questions route to a human owner.

Permission failure rate

No answer exposes policy content outside the user's access context.

HR systems

Permission prefilter

Retrieval is scoped before model generation starts.

People operations

Policy owner review

Failed cases map back to a named policy owner.

Legal or HR lead

Escalation boundary

Questions affecting pay, benefits, discipline, or eligibility can require human review.

Knowledge owner

Freshness cadence

Documents have effective dates and review dates before they are used.

Pilot checklist

Test the workflow before widening automation.

  • Choose one low-volume HR policy area with a clear source owner.
  • Create 50-100 questions across employee roles, locations, and common edge cases.
  • Label expected source passages and escalation cases.
  • Run the assistant with permission and effective-date metadata enabled.
  • Review failures with HR and legal before broadening the assistant.

Synthetic example

An employee asks whether a leave policy applies to contractors in a specific location. The assistant finds a general employee handbook answer, but the eval harness flags a population mismatch and missing escalation. The fix is a deterministic employee-type filter plus an HR owner review path.

Sources and review notes

Source context matters when the workflow touches risk.

This page is not employment, privacy, or legal advice. HR assistants can affect employee rights and should be reviewed by qualified HR, privacy, and legal owners before use.

Guidance on AI and data protection

UK Information Commissioner's Office

Data-protection guidance for organizations using AI systems.

Responsible AI in Recruitment

GOV.UK

Official guidance on responsible procurement and assurance of AI systems in HR and recruitment.

AI Risk Management Framework

NIST

General framework for AI risk management and lifecycle controls.

Related playbooks

Adjacent workflows to compare.

Workflow review

Have a similar workflow that needs controls and evals?

Share the role, market, source systems, work item, and current failure modes. The useful first step is usually a small eval or shadow review before any automation is trusted.