The 90% Problem: Why GenAI Alone Can’t Solve Prior Authorization

The 90% Problem: Why GenAI Alone Can’t Solve Prior Authorization

The 90% Problem: Why GenAI Alone Can’t Solve Prior Authorization

GenAI can produce a solid first draft of policy reasoning extracts, but in prior authorization (PA) the remaining 10% error is a patient safety, compliance, and financial disaster. The fix is GenAI with evidence: a Human in the Loop (HITL) QA Sidecar—so every output is citable, auditable, and measurably correct.


The “90% is not good enough” reality

In prior authorization, a single determination often rests on many specific criteria (diagnosis, labs, time windows, step therapies, exclusions, prescriber specialty, quantities and renewal thresholds). Even if a model is “90% accurate”, the combined probability that all criteria are correct plummets:

  • 10 criteria at 90% each → 0.910 ≈ 35% overall correctness.
  • 15 criteria at 90% each → 0.915 ≈ 21% overall correctness.

Now flip the target: if you drive per-criterion accuracy to 99.9%, then 10-criterion decisions are 0.99910 ≈ 99.0% correct.


Why “pure GenAI” breaks in PA

  • Narrative PDFs, not data. Policies are prose—time-boxed, exception-heavy, and plan-specific.
  • No provenance. Free-text answers often lack line-level citations back to the exact clause.
  • Non-determinism. Small prompt changes yield different outputs, making audits riskier. Strict regex will break too, because templates can be misformatted, contain misspellings, or a policy may not be self-contained.
  • Hidden edge cases. Negative conditions (“not for…”, “unless…”) and nested logic (AND/OR) are easy to miss.
  • Compounding error. Even “good” clause recall becomes unacceptable when multiplied across a full determination.

Conclusion: GenAI alone is a great first draft—and an unacceptable final decision system.


The solution: GenAI with Human-grade Verification

We pair GenAI with evidence and guardrails:

  1. Gold Standard dataset (target 99.9% clause accuracy—like Clorox, we can’t claim 100%).
    • Normalize and atomize each policy into clause-level “coverage atoms” (e.g., cnid_lab_a1c_ge_7_90d, cnid_step_any_two_orals_n2, cnid_specialist_endocrinology).
    • Provenance by design: every atom links to {plan, uid, pdf_url}.
    • Versioned snapshots & drift: clause and CNID prevalence over time for each plan/drug/indication.
  2. HITL QA Sidecar (the reviewer cockpit)
    • Side-by-side view: model policy on the left; exact cited clauses on the right, with edit permissions.
    /
    Figure 1 — HITL QA Sidecar (Streamlit). Left: model claims and structured fields (plan, drug, indication, CNIDs). Right: the exact cited clauses with {plan, uid, clause_id, pdf_url} and time windows highlighted. Reviewers accept or correct each field; approvals are versioned and hash logged.
    • Pass/fail rubric: reviewers check scope (plan/drug/indication), logic (AND/OR), thresholds (A1c ≥ 7), auth windows (≤ 90 days), ICD-10 codes, and renewal criteria.
    • Structured outputs only: if a set of criteria doesn’t match, it can be sent back into an earlier phase of the data pipeline.
    • One-click corrections: reviewers can swap in the correct clause; corrections are logged to improve future runs.
  3. Small deterministic math on top
    • Counts, ladders, prevalence: these can all be computed across plans, not guessed.
    • Pre-Check bundles: Likely / Borderline / Unlikely CNIDs per plan-drug pairings are a future feature development.
    • Doc Pack: auto-generated appeal packets and flags for unmet criteria.

Quality you can measure (and promise)

Auditability KPIs

  • Traceability: every published claim retains {plan, uid, clause_id, statement_hash, pdf_url}.
  • Snapshot integrity: criteria and clause hashes match the drug × plan × indication combination at first pass. Each answer is linked to a specific version.

A concrete example (GLP-1s)

Draft claim: “UHC requires two oral agent trials within 90 days and A1c ≥ 7% for initial approval for drug X; renewal requires ≥ 1% A1c reduction.”

Sidecar check items

  • Step therapy atoms present? cnid_step_any_two_orals_n2, window_days=90
  • Lab threshold atom present? cnid_lab_a1c_ge_7_90d
  • Renewal atom present? cnid_renewal_a1c_drop_ge_1
  • Scope correct? (GLP-1, adult T2D)
  • Citations present? Show the exact UHC clause lines; verify no clause contradicts.
  • Decision: Approve or request edit (e.g., step window is 120 days, not 90).

Outcome: The claim ships only if every atom is supported by a cited clause from the current snapshot. Otherwise, it is corrected or rejected.


Why this builds trust with clinicians, payers, and manufacturers

  • No black boxes. Every claim is tied to line-level evidence and normalized into a deterministic schema.
  • Deterministic “math”. Counts, restrictiveness ladders, prevalence, and pre-checks are computed from ground clinical truth, not guessed.
  • Auditable by design. Hashes, timestamps, and snapshot versioning make compliance and root-cause analysis straightforward. Hash-awareness and idempotency are critical facets of provenance.
  • Human-grade standards. The Sidecar enforces a repeatable QA rubric so “90% drafts” become “99%+ decisions.”

Call to action

See it live: Request a 10-minute walkthrough of the HITL QA Sidecar (we’ll review a GLP-1 example end to end).

Email to request the demo

The button above creates an email to pharmularyproject@gmail.com with the subject pre-filled.


Bottom line: GenAI gets you a fast, credible first draft. In prior authorization, safety and compliance demand evidence, grammar, and humans—so accuracy compounds and trust is earned.

© Pharmulary Project

About Andrew

Andrew Vargas, PharmD

About the Author

Andrew Vargas, PharmD is a Pharmacist practicing watchdoggery and founder of Pharmacist Write. He builds coverage intelligence tools and writes about what pharmacy benefits managers would prefer stayed invisible—turning policy into something patients, consultants, and purchasers can actually use.

🧠 Read full bio · View all articles