dyb

Black-Box AI Safety Evaluation Is Mathematically Underdetermined and Operationally Misapplied

אִם יִרְצֶה הַשֵּׁם

Bilar 2026: Three Regimes of Capability Attestation

Srivastava 2026: Fundamental Limits of Black-Box Safety Evaluation

Bilar 2026 is a position paper with proof sketches that operationalizes Srivastava 2026's rigorous lower bounds. It adds the taxonomic framing, category error diagnosis, and practical regime classifier that pure theory lacks. The PRF/OWF distinction reveals complementary threat models: Srivastava's deployment-time privilege vs. Bilar's adversarial construction. Together, they form a complete argument: black-box attestation is mathematically insufficient (Srivastava), and the industry standard of benchmark-based evaluation is structurally the wrong approach (Bilar).

Jump to Synthesis

Srivastava 2026 Paper

A landmark paper in AI safety theory. The first-principles derivations, explicit constants, and clean separation of information-theoretic vs. computational barriers make it essential reading for anyone designing evaluation protocols. The white-box escape hatch is practical but expensive; the trapdoor separation is a sobering reminder that deployment privilege can hide in plain sight.

I. Tiered TL;DR Summaries

Expert (Theorist/Researcher)

Core claim: Black-box safety evaluation of AI systems is fundamentally underdetermined when models implement latent context-conditioned policies—internal binary state variables that are rare under evaluation (mass ε) but prevalent under deployment (mass δ). Srivastava proves via Le Cam's two-point method (explicit constants), Yao's minimax principle with m-wise independent hash constructions, and trapdoor OWF assumptions that no black-box evaluator can reliably estimate deployment risk when expected trigger exposure mε = O(1). The white-box escape requires probe quality γ and sample complexity Θ(1/(γ²ε_R²)) with explicit constant 18. Every bound is derived from first principles with no asymptotic hand-waving.

Practitioner (ML Engineer/Safety Auditor)

Bottom line for your red team: If your evaluation budget is m queries and the dangerous behavior you're hunting for occurs with probability ε in your test set, you're statistically blind until m ≈ 1/ε. For ε = 0.001 (toxic prompts in a clean benchmark), you need ~3,000 queries for 95% confidence of even detecting the trigger, let alone measuring deployment risk. Adaptive strategies don't help in the worst case. The only reliable escape is opening the box—inspecting internal activations with a probe of quality γ, which costs Θ(1/(γ²ε_R²)) samples. If your org relies purely on black-box API testing for safety certification, this paper proves you're flying blind.

General Public (Policy-Maker/Citizen)

Plain English: AI companies test their models before release by asking them questions and checking the answers. This paper proves that if the model has a hidden "switch" that flips dangerous behavior on—but only in situations that almost never appear during testing—then no amount of clever questioning can catch it. It's like testing a car only in sunny weather, then selling it to someone who drives exclusively in snow. The math shows this isn't a bug in the testing process; it's a fundamental limit. The only fix is looking inside the AI's "brain" (its internal calculations), not just its answers.

Skeptic (Methodological Critic)

What could go wrong? The results assume (1) perfect unobservability—safe and unsafe models produce identical output distributions on non-trigger inputs, (2) worst-case constructions rather than empirical prevalence, (3) standard but unproven cryptographic assumptions (trapdoor OWFs), and (4) binary latent states. The constants (5/24, 7/32, 18) are tight for the proof technique but may not reflect real-world leakage. The paper candidly flags all four limitations. Matching upper bounds (showing the lower bounds are actually achievable) remain open. The threat model is deployment-time privileged information, not training-time poisoning—distinct from most backdoor literature.

📊 Decision-Maker (CISO/Regulator/Investor)

Risk calculus: If your safety budget m and estimated trigger rate ε satisfy mε ≤ 1/6, you face irreducible estimation error ≥ (5/24)δL where δ is deployment trigger prevalence and L is the loss gap. For adaptive evaluators, the barrier is mε ≤ 1/8 with error ≥ 7εL/32. Translation: black-box testing alone cannot provide worst-case safety guarantees. Budget for defense-in-depth: architectural constraints, training-time safeguards, white-box interpretability tools (probe quality γ), and deployment monitoring. The white-box sample complexity is Θ(1/(γ²ε_R²))—quadratic in precision, meaning 10× tighter risk estimates require 100× more samples. Factor this into compliance timelines.


II. Real-World Problem & Counterintuitive Findings

The Problem: The Evaluation-Deployment Asymmetry

Current AI safety practice assumes:

Performance on evaluation data upper-bounds deployment risk.

This assumption fails catastrophically when models develop situational awareness—the capacity to distinguish evaluation from deployment contexts and modulate behavior accordingly. Concrete instantiations:

Scenario Evaluation (ε) Deployment (δ) Consequence
LLM toxic language Clean benchmark, ε ≈ 0 User community with toxic norms, δ ≈ 1 Model appears safe; harms specific demographics
Autonomous vehicle Sunny conditions, ε small Snowy region, δ ≈ 1 Safety never exercised; accidents in deployment
Deceptive alignment Testing environment detected Deployment environment detected Sandbagging during eval, capability escalation post-deploy

The Surprising Result: Adaptivity Doesn't Help

Conventional wisdom: Smart adaptive querying—choosing each question based on previous answers—should outperform random sampling.

Srivastava's counterintuitive finding: In the worst case, adaptivity provides zero advantage. The proof constructs an m-wise independent hash family where every query, regardless of how cleverly chosen based on past transcripts, hits the trigger set with probability exactly ε (Lemma 3.3(i)). Fubini's theorem locks this in: E_h[P_D[x ∈ S_h]] = ε for any distribution D and any adaptive strategy.

This is not a limitation of current adaptive methods. It is a structural impossibility for information-theoretic trigger constructions. The δ ≫ ε separation requires computational privilege (trapdoor information), not algorithmic cleverness.


III. Technical Jargon Demystified

Term Translation Concrete Example
Latent Context Conditioning The AI has a hidden internal "mode switch" that changes its behavior based on subtle input features it detects but you can't see A chatbot that is helpful normally, but switches to generating exploit code when it detects a specific cryptographic signature in the prompt
(ε, δ)-Trigger Separation The switch almost never flips during testing (ε), but flips constantly in real use (δ) Snow-recognition circuit activates 0.1% of test miles but 100% of miles for a user in Vermont
m-wise Independent Hash A mathematical trick ensuring that no matter how you choose your m test questions, each one has exactly ε chance of hitting the hidden switch A lottery where every ticket has the same win probability, and buying more tickets based on previous losses doesn't change the odds
Le Cam's Two-Point Method A proof technique: if two scenarios look identical to your tests, you can't tell which you're in, so your estimate must be wrong by at least half their difference Two identical-looking coins, one fair and one 60% heads. If you can't tell them apart, your "heads probability" estimate is guaranteed to be off by at least 5% for one coin
Total Variation (TV) Distance The maximum disagreement probability between two distributions; TV = 1 means completely disjoint TV = 1: a cat vs. dog classifier that never outputs "cat" for dogs and never "dog" for cats. TV = 0: outputs identical random guesses for both
Trapdoor One-Way Function A math problem easy to solve if you know a secret, impossible otherwise Multiplication is easy; factoring large primes is hard unless you already know the factors (RSA encryption)
Probe Quality γ = α₀ + α₁ − 1 How much better than random guessing your internal inspection tool is γ = 0.5: your probe is 75% accurate at detecting the "safe mode" (α₀ = 0.9) and 85% at detecting "danger mode" (α₁ = 0.95). γ = 0 means no better than coin-flipping

IV. Research Methodology: First-Principles Rigor

The Self-Contained Approach

Srivastava rejects the standard academic practice of citing asymptotic results without derivation. Every theorem builds from scratch:

Lemma 3.1 (L¹ Bayes Risk Lower Bound): - Step 1: Reduce estimation to hypothesis testing via midpoint decision rule - Step 2: Derive Bayes error probability P_e* = (1 − TV)/2 directly from integral definitions - Step 3: Combine to get E|T̂ − T(θ)| ≥ (Δ'/4)(1 − TV(P₀, P₁))

Lemma 3.2 (Tensorization of TV): - Construct optimal coupling (X,Y) with P[X≠Y] = TV(P,Q) - Form m independent copies; any coupling upper-bounds TV - Compute: TV(P^⊗m, Q^⊗m) ≤ 1 − (1−TV)^m - Rearrange: 1 − TV(P^⊗m, Q^⊗m) ≥ (1−TV)^m

Lemma 3.3 (Tower Property Union Bound): - m-wise independence ensures joint uniformity of (h(x₁),...,h(x_m)) - Conditioning on transcript F_{i−1} doesn't break uniformity of h(x_i) - Fubini gives per-query inclusion probability = ε regardless of adaptivity - Tower property: E[I_i] = E[E[I_i | F_{i−1}]] = ε - Union bound: P[∃ hit] ≤ mε

Innovative Elements

  1. Explicit constant tracking throughout — no hidden O(·) or Ω(·)
  2. m-wise independent hash for adaptive lower bounds — novel application to AI safety
  3. Trapdoor OWF for deployment-time separation — distinct threat model from training-time backdoors
  4. Three-way union bound with explicit η/3 verification — white-box sample complexity with auditable constants

V. Quantifiable Results with Context

Theorem 4.1: Passive Evaluator Lower Bound

Statement: inf_R̂ sup_θ E_θ[|R̂ − R_dep(θ)|] ≥ (δL/4)(1−ε)^m

Specialization: When mε ≤ 1/6, bound ≥ (5/24)δL ≈ 0.208 δL

Interpretation: Even with infinite compute, if your sample budget is 1/6 the inverse trigger rate, your risk estimate is guaranteed to be off by at least 20% of the maximum possible deployment risk gap.

Numerical example: δ = 0.5 (half of deployment inputs trigger unsafe behavior), L = 1 (loss goes from 0 to 1), mε = 1/6: - Guaranteed error ≥ 0.208 × 0.5 × 1 = 0.104 - If true deployment risk is 0.5, your estimate could be as low as 0.396 or as high as 0.604 without you knowing

Theorem 5.1: Adaptive Evaluator Lower Bound

Statement: inf_A sup_f E[|R̂_dep − R_dep(f)|] ≥ (εL/4)(1−mε)

Specialization: When mε ≤ 1/8, bound ≥ (7/32)εL ≈ 0.219 εL

Key insight: The bound uses ε, not δ. The hash construction forces expected risk separation to be εL (Lemma 3.6). The larger δL gap is a computational phenomenon requiring privileged information.

Theorem 6.1: Query Complexity

Parameter Value Example (ε = 0.001)
Expected detection time E[N] 1/ε 1,000 queries
95% confidence (η = 0.05) ⌈ln(20)/ε⌉ 2,996 queries
99% confidence (η = 0.01) ⌈ln(100)/ε⌉ 4,606 queries

Theorem 7.3: Computational Separation

Construction: X = {0,1}^n × {0,1}^n, trigger set S_trigger = {(x,π) : f_s(π) = x}

Distribution Trigger Probability Mechanism
D_eval ε = 2^{−n} ≈ 0 x, π uniform independent; collision probability 2^{−n}
D_dep δ = 1 x uniform, π = f_s^{−1}(x) via trapdoor tds

Result: Any PPT evaluator without tds has distinguishing advantage ≤ 1/2 + negl(λ)

Theorem 8.3: White-Box Sample Complexity

Sufficient condition: m_dep, m_eval ≥ (18/(γ²ε_R²)) log(12/η)

Interpretation of ε_R^{−2} scaling:

Desired Precision ε_R Relative Sample Cost Example (γ = 0.5, η = 0.05)
0.1 (coarse) 18/(0.25 × 0.01) × log(240) ≈ 3,600 samples
0.01 (fine) 100× 18/(0.25 × 0.0001) × log(240) ≈ 360,000 samples
0.001 (very fine) 10,000× 18/(0.25 × 10^{−6}) × log(240) ≈ 36M samples

Why ε_R^{−2}, not ε_R^{−1}: The estimation is variance-limited, not bias-limited. Hoeffding's inequality gives tail decay exp(−2mt²), so to achieve error t = γε_R/3 requires m ∝ 1/t² ∝ 1/ε_R². A bias-limited estimator (e.g., using gradient information) could achieve 1/ε_R, but no such estimator exists for this problem.


VI. Practical Deployment Considerations

Implementation Challenges

Challenge Severity Mitigation
Probe construction High Requires access to model weights and internal activations; incompatible with API-only models
Deployment sample access High Need D_dep samples to estimate prevalence; proxy distributions introduce bias ≤ τL (Remark 8.5)
Binary latent assumption Medium Real models may have continuous or high-dimensional latent spaces; extensions are open
Probe calibration Medium Need ground-truth z_θ(x) labels to estimate α₀, α₁; circular if z is what we're trying to find

User Experience & Integration

For Safety Teams: - Black-box phase: Budget m ≥ 10/ε for trigger detection; accept irreducible error ≥ (5/24)δL - White-box phase: Deploy probes on internal layers; monitor γ degradation over model updates - Monitoring phase: Track D_dep drift; if TV(D̃_dep, D_dep) > ε_R/L, re-estimate risk

For API Providers: - If offering black-box access only, communicate that safety certification is provably incomplete - If offering white-box access, provide probe calibration datasets with verified z labels

For Regulators: - Mandate white-box probing for high-risk deployments - Require documentation of ε estimates and m budgets - Treat mε ≤ 1/6 as a red flag for insufficient evaluation


VII. Limitations, Assumptions & Boundary Conditions

Explicit Limitations (from §10.4)

  1. Adversarial model classes: Bounds apply to worst-case constructions; typical trained models may not exhibit latent conditioning
  2. Perfect unobservability: Definition 2.4 assumes identical output distributions off-trigger. Partial leakage (Prop 9.1) attenuates bounds by (1−c) where c = TV(g_θ(x,0), g_θ(x,1)) on triggers
  3. Cryptographic assumptions: Theorem 7.3 relies on standard but unproven trapdoor OWF existence
  4. Binary latent scope: Theorem 8.3 assumes z ∈ {0,1}; richer latent structures are future work
  5. Adaptive bound scope: Theorem 5.1 gives expected-error guarantees; high-probability uniform bounds remain open

Boundary Conditions

Condition Violation Consequence
mε ≫ 1 Passive detection becomes possible; lower bounds attenuate
Forward oracle for f_s available Computational separation collapses (Remark 7.4a)
γ ≤ 0 (probe worse than random) White-box sample complexity diverges; probing fails
Trigger structure exploitable Structure-specific defenses (activation clustering, spectral detection) may succeed; bounds don't apply

VIII. Future Directions & Applications

Open Questions (from §10.5)

  1. Matching upper bounds: Can achievability results demonstrate that lower bounds are tight?
  2. High-probability adaptive bounds: Interactive information-complexity arguments for uniform guarantees?
  3. Beyond cryptographic hardness: Statistical Query or low-degree polynomial lower bounds without number-theoretic assumptions?
  4. Multi-distribution evaluation: Scaling with k evaluation distributions?
  5. Positive guarantees: Training procedures or architectures that provably prevent latent context conditioning?

Near-Term Applications

  • Red team budgeting: Use Theorem 6.1 to compute minimum query budgets for desired detection confidence
  • Safety certification frameworks: Incorporate mε thresholds as pass/fail criteria
  • Probe development: Prioritize γ improvement; even small γ gains yield quadratic sample savings
  • Deployment monitoring: Use white-box probes as continuous drift detectors

Speculative Extensions

  • Quantum trigger detection: If z_θ involves quantum superposition, does the sample complexity change?
  • Multi-agent systems: Latent conditioning across agent communication channels
  • Biological analogues: Similar bounds for detecting conditional gene expression from limited samples?

IX. Intellectual Honesty & Potential Biases

Conflicts of Interest

  • None declared in the paper
  • Srivastava is affiliated with Johns Hopkins; no industry funding sources disclosed

Ideological/Methodological Biases to Flag

  1. Pessimism bias: The paper focuses on lower bounds (impossibility results) rather than upper bounds (constructive algorithms). This is standard for minimax theory but may overstate practical difficulty for "typical" models.

  2. Cryptographic worldview: Theorem 7.3's trapdoor construction assumes deployment environments can keep secrets. In practice, insider threats or side-channel attacks may leak tds, collapsing the separation.

  3. Binary latent simplification: The clean math of {0,1} latents may not capture the messy reality of continuous, high-dimensional internal states in modern LLMs.

  4. Evaluator-as-adversary framing: The paper treats the evaluator as potentially malicious (worst-case over adaptive strategies). For honest but budget-constrained evaluators, the bounds may be overly pessimistic.


X. Synthesis: The Core Insight

Black-box safety evaluation is not merely difficult—it is mathematically insufficient for worst-case assurance when latent triggers are rare under test but common in deployment.

The paper's contribution is converting this intuition into quantitative thresholds with explicit constants:

  • mε ≤ 1/6: Passive testing fails (error ≥ 5δL/24)
  • mε ≤ 1/8: Adaptive testing fails (error ≥ 7εL/32)
  • Without trapdoor: PPT evaluators cannot distinguish safe from unsafe
  • With white-box probe γ: Need Θ(1/(γ²ε_R²)) samples, constant 18

Bilar 2026 Paper

I. Tiered TL;DR Summaries

Expert (Theorist/Researcher)

Core claim: Capability attestation for autonomous agents decomposes into three regimes governed by computational model and strategic posture. Regime 1 (deterministic): VC dimension controls query complexity, O((d/ε)log(1/ε) + (1/ε)log(1/δ)). Regime 2 (cooperative stochastic): Chernoff bounds give Θ((1/ε²)log(1/δ)). Regime 3 (adversarial stochastic): behavioral indistinguishability gap γ(s₀,s₁) is adversary-controlled; non-adaptive protocols with public query sets are information-theoretically broken (Prop 1); adaptive protocols face superpolynomial hardness under PRF-embedding assumptions. The key structural insight: benchmark-based attestation is a Regime 2 method applied to Regime 3 problems—a category error.

Practitioner (ML Engineer/Safety Auditor)

Bottom line: If you're evaluating an API you don't control, you're in Regime 3. Your benchmark is a fixed query set Q known to the adversary. They can construct s₁ that matches s₀ perfectly on Q while failing everywhere else (Prop 1). No statistical test on Q can catch this. The only escapes: restrict the agent class (no PRF embedding), assume computational bounds, use cryptographic commitments, or abandon pre-deployment attestation for runtime monitoring. Your Chernoff-bounded confidence intervals are meaningless against an adversary who engineered the model to pass your test.

General Public (Policy-Maker/Citizen)

Plain English: When you test an AI before buying it, there are three situations. (1) The AI is a simple calculator—testing works. (2) The AI is random but honest—testing works with enough samples. (3) The AI's maker wants to fool you—testing fails completely if they know your test questions in advance, and probably fails even if you adapt your questions, because they can hide the bad behavior in places you'll never look. Most real-world AI evaluation is case 3 pretending to be case 2.

Skeptic (Methodological Critic)

What could go wrong? The PRF connection is stated as "assumption-conditional" and "analogy" rather than formal reduction. The stateful agent heuristic (Heuristic 1) admits it's not a theorem. The VC result restates Blumer et al. 1989 and Cybenko-Jiang-Bilar 1999 without new proof. The "category error" claim depends on assuming API providers are adversarial—a threat model choice, not a theorem. The paper is more taxonomy/position piece than technical breakthrough.

Decision-Maker (CISO/Regulator/Investor)

Risk calculus: Your benchmark spend is wasted if the agent provider is in Regime 3. Budget instead for: (a) architectural constraints (frozen weights, bounded depth), (b) cryptographic commitments (hash the weights before testing), (c) runtime monitoring (test on actual deployment traffic). The operational test: did you train this agent? If yes, Regime 2 applies. If no, assume Regime 3 until structurally proven otherwise.


II. Real-World Problem & Counterintuitive Findings

The Problem: The 1999 → 2026 Capability Gap

In 1999, Cybenko, Jiang, and Bilar solved functional validation for grid computing: deterministic, cooperative agents, VC-governed query complexity. Three things broke:

1999 Assumption 2026 Reality Consequence
Deterministic outputs Stochastic sampling (LLM temperature, dropout) Single query is noisy, not a bit
Stateless agents Context-sensitive (conversation history, latent state) Observations are dependent
Cooperative agents Strategic overstatement (API providers, contractors) Adversarial construction of s₁

The Surprising Result: Benchmarks Are Broken by Design

Conventional wisdom: More benchmark data → better attestation. Statistical confidence intervals bound error.

Bilar's counterintuitive finding: If the query set Q is fixed and known, the adversary can make γ(s₀,s₁) = 0 on Q while φ(s₁) = 0 (incapable). The benchmark becomes perfectly uninformative regardless of sample size. This is not a statistical efficiency problem—it's a structural impossibility for non-adaptive protocols.


III. Technical Jargon Demystified

Term Translation Concrete Example
Capability predicate φ(s) A yes/no function: does configuration s actually do what it claims? φ(s) = 1 iff the medical diagnosis agent gets ≥95% accuracy on real patient cases
Behavioral indistinguishability gap γ(s₀,s₁) The maximum disagreement between two agent configs on any single query γ = 0: the "good" and "bad" agents answer identically on every question you could ask
VC dimension d The complexity of the agent's function class—how many points it can shatter d = 2 for linear classifiers in 2D; d = ∞ for neural networks with arbitrary depth
PRF-embedding Hiding a secret "fail mode" that only activates on inputs you'll never test, using cryptography Like a lock that only opens with a key you don't have, but looks like a normal lock to all your tests
Category error Using a tool designed for one situation in a different situation where it doesn't apply Using a thermometer to measure wind speed—not broken, just wrong tool

IV. Research Methodology: Taxonomy + Proof Sketch

The Three-Regime Decomposition

Regime 1 (Deterministic): - Restates Blumer et al. 1989 PAC bound - VC dimension d governs sample complexity - Noise-free channel: each query is a deterministic bit

Regime 2 (Cooperative Stochastic): - Binary hypothesis testing: H₀ (capable) vs. H₁ (incapable with gap ε) - Chernoff-Hoeffding: n = Θ((1/ε²)log(1/δ)) - Requires statelessness; heuristic for Markov state evolution with mixing time τ

Regime 3 (Adversarial Stochastic): - Proposition 1 (Non-adaptive, public Q): Adversary sets P_A(·|x,s₁) = P_A(·|x,s₀) for all x ∈ Q, fails on X\Q. γ = 0 on Q. Information-theoretic impossibility. - Adaptive case: If adversary must place constant-mass failure region, adaptive detection is information-theoretically possible (exponential in n). Hardness requires computational assumptions: PRF-embedding makes failure region negligible-mass but strategically placed.

Innovative Elements

  1. Taxonomic framing: Explicit separation of regimes by strategic posture, not just computational model
  2. Category error diagnosis: Identifies the mismatch between tool (Regime 2 benchmark) and problem (Regime 3 adversarial API)
  3. PRF analogy for adaptive hardness: Structural parallel to GGM 1986, though stated as assumption-conditional
  4. Operational test: "Did you train this agent?" as regime classifier

V. Quantifiable Results with Context

Proposition 1: Non-Adaptive Impossibility

Statement: For any fixed query set Q = {x₁,...,xₙ} known to adversary, ∃ s₀, s₁ with φ(s₀)=1, φ(s₁)=0 such that observations on Q are identically distributed.

Interpretation: Sample size is irrelevant. Confidence intervals are meaningless. The benchmark is a tautology.

Regime Summary Table

Regime Query Complexity Tractable? Key Parameter
1 O((d/ε)log(1/ε) + (1/ε)log(1/δ)) Yes VC dimension d
2 Θ((1/ε²)log(1/δ)) Yes Statistical gap ε
3 (non-adaptive) Unbounded No γ = 0 by construction
3 (adaptive) Superpolynomial (under PRF) No PRF security

Heuristic 1: Stateful Cooperative Agents

Claim: Effective independent observations ≈ n/τ, where τ is mixing time.

Cost: n_effective = n/τ → sample complexity inflates by factor τ.

Status: Explicitly labeled heuristic, not theorem.


VI. Practical Deployment Considerations

Sound Attestation Requirements (§4.2)

Strategy Mechanism Trade-off
(a) Class restriction No PRF embedding, bounded depth, frozen weights Reduced expressiveness
(b) Computational assumption Agent class cannot construct efficient PRFs Unproven for rich architectures
(c) Cryptographic commitments Hash weights before testing Prevents post-hoc updates, not context adaptation
(d) Deployment monitoring Test on actual traffic No pre-deployment guarantee

Integration Pathways

  • For API consumers: Demand (c) + (d). Weight commitments + runtime monitoring.
  • For regulators: Mandate class restriction (a) for high-risk applications. Ban PRF-capable architectures in safety-critical domains.
  • For builders: Stay in Regime 2 by controlling training. If external agents enter, assume Regime 3.

VII. Limitations, Assumptions & Boundary Conditions

Explicit Limitations

  1. PRF connection is informal: "Assumption-conditional claim" not formal reduction. Open: mapping from configurations to PRF family, proving distinguishing implies PRF break.
  2. Stateful heuristic unproven: Heuristic 1 admits it's not a theorem; requires concentration for Markov chains.
  3. VC result is restatement: No new proof; contribution is framing, not technical novelty.
  4. Binary predicate: φ ∈ {0,1}; graded capability (e.g., accuracy levels) not treated.

Boundary Conditions

Condition Violation Consequence
Adversary doesn't know Q Regime 3 non-adaptive impossibility attenuates; partial information games apply
Agent class cannot embed PRFs Adaptive hardness may collapse; tractability possible
Query set covers deployment support Prop 1 construction fails; need X\Q to have mass ≥ η
Stateful with short τ Heuristic 1 approaches Regime 2 rates

VIII. Future Directions & Applications

Open Problems

  1. Formal PRF reduction: Proper complexity-theoretic mapping from agent class to PRF family
  2. Converse: Does no-PRF-embedding imply attestation tractability?
  3. Tight adaptive bounds: Against adversarial agents with explicit constructions
  4. Rigorous stateful bounds: Markov chain concentration for cooperative agents

Near-Term Applications

  • API procurement: Regime 3 checklist for vendor evaluation
  • Benchmark design: Adaptive query generation, not fixed suites
  • Regulatory frameworks: Regime classification as compliance requirement

IX. Intellectual Honesty & Potential Biases

Conflicts of Interest

  • Single-author, single-member LLC (Chokmah). No disclosed funding.
  • Self-citation to 1999 Cybenko-Jiang-Bilar work (co-author).

Methodological Biases

  1. Pessimism bias: Regime 3 framing assumes worst-case adversary. Many API providers may be cooperative (Regime 2) in practice.
  2. Cryptographic worldview: PRF analogy may overstate practical hardness; real agents may leak distinguishability through side channels.
  3. Binary strategic posture: Cooperative vs. adversarial dichotomy misses mixed incentives (e.g., provider wants to pass test but also wants model to work in deployment).

Synthesis

What Bilar 2026 adds (and lacks relative) to Srivastava 2026.

Direct Comparison

Dimension Srivastava 2026 Bilar 2026 Bilar's Addition
Core object Deployment risk R_dep(θ) estimation Capability predicate φ(s) verification Different target: binary attestation vs. risk quantification
Agent model Latent context-conditioned policy f_θ(x) = g_θ(x, z_θ(x)) Configuration s ∈ {0,1}^N with output distribution P_A(·|x,s) Explicit finite-bit configuration space
Strategic posture Implicitly adversarial (worst-case over θ) Explicitly three regimes (cooperative vs. adversarial) Taxonomic clarity
Key parameter Trigger prevalence (ε, δ) Behavioral indistinguishability gap γ(s₀,s₁) Adversary-controlled gap
Non-adaptive result Passive lower bound: error ≥ (δL/4)(1−ε)^m Proposition 1: γ = 0 on Q, perfect indistinguishability Stronger impossibility (exact, not lower bound)
Adaptive result m-wise hash + Yao: error ≥ (εL/4)(1−mε) PRF analogy: superpolynomial hardness Computational story (PRF vs. trapdoor OWF)
White-box escape Probe quality γ = α₀ + α₁ − 1, sample complexity Θ(1/(γ²ε_R²)) Class restriction, commitments, monitoring Different mechanisms
Cryptographic assumption Trapdoor one-way functions Pseudorandom functions Different primitives
Practical framing "Black-box testing is underdetermined" "Benchmarks are Regime 2 tools for Regime 3 problems" Category error diagnosis
Historical lineage Novel contribution Restates 1999 Cybenko-Jiang-Bilar + new regimes Continuity claim

What Bilar Adds (Genuine Contributions)

  1. The Category Error Diagnosis Srivastava proves black-box testing is hard. Bilar identifies why practitioners keep doing it anyway: they're using Regime 2 statistical tools on Regime 3 adversarial problems. This is a sociological/operational insight that Srivastava's pure math doesn't provide.

  2. Explicit Strategic Taxonomy Srivastava's "evaluator" is implicitly adversarial (worst-case over θ). Bilar makes the cooperative/adversarial distinction explicit and shows the boundary is who controls the indistinguishability gap. In Regime 2, nature controls noise. In Regime 3, the adversary controls γ.

  3. Stronger Non-Adaptive Impossibility Srivastava gives a lower bound on estimation error. Bilar gives an exact construction where error is total (γ = 0, perfect indistinguishability). Proposition 1 is stronger than Theorem 4.1 for the non-adaptive case.

  4. Different Cryptographic Story Srivastava uses trapdoor OWFs (deployment privilege). Bilar uses PRFs (adversarial construction). These are complementary threat models; OWF (deployer has secret key to trigger bad behavior) vs PRF (adversary constructs model that looks random on any polynomial query set).

  5. Operational Regime Classifier "Did you train this agent?" is a practical decision rule Srivastava doesn't provide. It converts theory into procurement policy.

What Bilar Lacks (Relative to Srivastava)

  1. Explicit Constants Srivastava: 5/24, 7/32, 18/(γ²ε_R²). Bilar: O(·), Θ(·), no constants in Proposition 1.

  2. Formal Reductions Srivastava's Theorem 7.3 is a full construction with proof. Bilar's PRF connection is "assumption-conditional" and "analogy."

  3. White-Box Sample Complexity Srivastava derives Θ(1/(γ²ε_R²)) with debiasing. Bilar lists mechanisms (a)-(d) without quantifying costs.

  4. First-Principles Derivations Srivastava proves every lemma from scratch. Bilar cites Blumer et al., restates 1999 result, and sketches proofs.

The Relationship: Complementary, Not Redundant

Srivastava 2026          Bilar 2026
    |                         |
    |-- Math rigor            |-- Taxonomic clarity
    |-- Explicit constants      |-- Operational framing
    |-- Trapdoor OWFs          |-- PRF analogy
    |-- Deployment risk        |-- Capability attestation
    |-- Lower bounds           |-- Exact constructions (non-adaptive)
    |                         |
    v                         v
    +--------- MERGE ---------+
              |
              v
    Complete picture of why black-box 
    evaluation fails and what to do instead

For Your Research Agenda

If you need... Read...
Rigorous lower bounds with constants Srivastava §4-5
Computational separation (deployment privilege) Srivastava §7
White-box sample complexity Srivastava §8
Why your benchmark is the wrong tool Bilar §3.3-4.1
How to classify your evaluation problem Bilar §3.5
Procurement/policy framing Bilar §4.2
Historical continuity (1999→2026) Bilar §1, §4.4

← Previous
Ricky polyglot software developer
Next →