The 90% Problem: Why GenAI Alone Can’t Solve Prior Authorization

The 90% Problem: Why GenAI Alone Can’t Solve Prior Authorization

The 90% Problem: Why GenAI Alone Can’t Solve Prior Authorization

GenAI can produce a solid first draft of policy reasoning extracts, but in prior authorization (PA) the remaining 10% error is a patient safety, compliance, and financial disaster. The fix is GenAI with evidence: a Human in the Loop (HITL) QA Sidecar—so every output is citable, auditable, and measurably correct.


The “90% is not good enough” reality

In prior authorization, a single determination often rests on many specific criteria (diagnosis, labs, time windows, step therapies, exclusions, prescriber specialty, quantities and renewal thresholds). Even if a model is “90% accurate”, the combined probability that all criteria are correct plummets:

  • 10 criteria at 90% each → 0.910 ≈ 35% overall correctness.
  • 15 criteria at 90% each → 0.915 ≈ 21% overall correctness.

Now flip the target: if you drive per-criterion accuracy to 99.9%, then 10-criterion decisions are 0.99910 ≈ 99.0% correct.


Why “pure GenAI” breaks in PA

  • Narrative PDFs, not data. Policies are prose—time-boxed, exception-heavy, and plan-specific.
  • No provenance. Free-text answers often lack line-level citations back to the exact clause.
  • Non-determinism. Small prompt changes yield different outputs, making audits riskier. Strict regex will break too, because templates can be misformatted, contain misspellings, or a policy may not be self-contained.
  • Hidden edge cases. Negative conditions (“not for…”, “unless…”) and nested logic (AND/OR) are easy to miss.
  • Compounding error. Even “good” clause recall becomes unacceptable when multiplied across a full determination.

Conclusion: GenAI alone is a great first draft—and an unacceptable final decision system.


The solution: GenAI with Human-grade Verification

We pair GenAI with evidence and guardrails:

  1. Gold Standard dataset (target 99.9% clause accuracy—like Clorox, we can’t claim 100%).
    • Normalize and atomize each policy into clause-level “coverage atoms” (e.g., cnid_lab_a1c_ge_7_90d, cnid_step_any_two_orals_n2, cnid_specialist_endocrinology).
    • Provenance by design: every atom links to {plan, uid, pdf_url}.
    • Versioned snapshots & drift: clause and CNID prevalence over time for each plan/drug/indication.
  2. HITL QA Sidecar (the reviewer cockpit)
    • Side-by-side view: model policy on the left; exact cited clauses on the right, with edit permissions.
    /
    Figure 1 — HITL QA Sidecar (Streamlit). Left: model claims and structured fields (plan, drug, indication, CNIDs). Right: the exact cited clauses with {plan, uid, clause_id, pdf_url} and time windows highlighted. Reviewers accept or correct each field; approvals are versioned and hash logged.
    • Pass/fail rubric: reviewers check scope (plan/drug/indication), logic (AND/OR), thresholds (A1c ≥ 7), auth windows (≤ 90 days), ICD-10 codes, and renewal criteria.
    • Structured outputs only: if a set of criteria doesn’t match, it can be sent back into an earlier phase of the data pipeline.
    • One-click corrections: reviewers can swap in the correct clause; corrections are logged to improve future runs.
  3. Small deterministic math on top
    • Counts, ladders, prevalence: these can all be computed across plans, not guessed.
    • Pre-Check bundles: Likely / Borderline / Unlikely CNIDs per plan-drug pairings are a future feature development.
    • Doc Pack: auto-generated appeal packets and flags for unmet criteria.

Quality you can measure (and promise)

Auditability KPIs

  • Traceability: every published claim retains {plan, uid, clause_id, statement_hash, pdf_url}.
  • Snapshot integrity: criteria and clause hashes match the drug × plan × indication combination at first pass. Each answer is linked to a specific version.

A concrete example (GLP-1s)

Draft claim: “UHC requires two oral agent trials within 90 days and A1c ≥ 7% for initial approval for drug X; renewal requires ≥ 1% A1c reduction.”

Sidecar check items

  • Step therapy atoms present? cnid_step_any_two_orals_n2, window_days=90
  • Lab threshold atom present? cnid_lab_a1c_ge_7_90d
  • Renewal atom present? cnid_renewal_a1c_drop_ge_1
  • Scope correct? (GLP-1, adult T2D)
  • Citations present? Show the exact UHC clause lines; verify no clause contradicts.
  • Decision: Approve or request edit (e.g., step window is 120 days, not 90).

Outcome: The claim ships only if every atom is supported by a cited clause from the current snapshot. Otherwise, it is corrected or rejected.


Why this builds trust with clinicians, payers, and manufacturers

  • No black boxes. Every claim is tied to line-level evidence and normalized into a deterministic schema.
  • Deterministic “math”. Counts, restrictiveness ladders, prevalence, and pre-checks are computed from ground clinical truth, not guessed.
  • Auditable by design. Hashes, timestamps, and snapshot versioning make compliance and root-cause analysis straightforward. Hash-awareness and idempotency are critical facets of provenance.
  • Human-grade standards. The Sidecar enforces a repeatable QA rubric so “90% drafts” become “99%+ decisions.”

Call to action

See it live: Request a 10-minute walkthrough of the HITL QA Sidecar (we’ll review a GLP-1 example end to end).

Email to request the demo

The button above creates an email to pharmularyproject@gmail.com with the subject pre-filled.


Bottom line: GenAI gets you a fast, credible first draft. In prior authorization, safety and compliance demand evidence, grammar, and humans—so accuracy compounds and trust is earned.

© Pharmulary Project

Andrew Vargas, PharmD

About the Author

Andrew Vargas, PharmD is a Clinical Coding Pharmacist and founder of Pharmacist Write, where he translates managed-care and GLP-1 policy into practical insights for patients and professionals.

🧠 Read full bio · View all articles