dyb

Can a Compiler's Inlining Heuristic Be Reframed as Distributed Optimization?

Early Results from Instrumenting MIR with Shadow Prices


There's a line of thinking in network optimization that goes something like this: when you have a system of agents making local decisions subject to shared resource constraints, and each agent responds to a price signal reflecting the scarcity of those resources, the system as a whole can converge on a globally efficient allocation. This is the core insight behind Network Utility Maximization (NUM), formalized by Low, Chiang, and others in the context of TCP congestion control.

We've been exploring whether this framing applies to compilers. Specifically: can a compiler's inlining pass — which decides, function by function, whether the benefit of inlining justifies the code-size cost — be understood as an instance of dual decomposition? If so, each function's "shadow price" would represent its marginal utility per unit of code size, and the inlining threshold would scale in response.

This post describes a first experiment testing that idea. The results are preliminary but interesting enough to share.


The Setup

We chose MIR, Vladimir Makarov's lightweight JIT compiler infrastructure, as our testbed. MIR is small enough to instrument without getting lost in infrastructure, and its inlining logic is straightforward: a function is inlined if its IR instruction count falls below a cost threshold.

We modified MIR in two ways. First, we added a shadow_price field to the MIR_func struct — a double-precision value representing the utility gain per IR instruction for that function. Second, we wrote a NUM-aware inlining decision function that scales the baseline threshold by this shadow price:

adjusted_threshold = baseline_threshold × (shadow_price / 100.0)

A shadow price of 100.0 is neutral (threshold unchanged). A price of 1000.0 means the system values this function at 10× the baseline — it's willing to tolerate 10× the code-size cost to inline it. A price of 10.0 means the opposite: the function is cheap and inlining should be conservative.

We then generated 250 synthetic functions (50 functions × 5 sizes: 10, 50, 100, 200, and 500 IR instructions) and ran them through four experimental conditions:

  • Baseline: all shadow prices set to 100.0 (neutral)
  • Uniform: all shadow prices set to 100.0 (control, identical to baseline)
  • Skewed: alternating high (1000.0) and low (10.0) shadow prices
  • Perturbed: same as skewed but with ±20% uniform noise on the prices

Each function's inlining decision was logged to a CSV along with the shadow price, IR count, baseline threshold, and adjusted threshold.


What We Measured

Three pre-registered statistical tests, chosen before running the experiment:

  1. Spearman rank correlation between shadow price and inlining outcome in the skewed condition. Threshold: ρ > 0.5 with p < 0.05.

  2. Cohen's d effect size between the inlining rates of "hot" (high shadow price) and "cold" (low shadow price) functions. Threshold: |d| > 0.5.

  3. Agreement rate between skewed and perturbed conditions, measuring whether adding noise to the shadow prices changes the inlining decisions.


Results

Test 1: Correlation

Spearman ρ = 0.82, p = 4.25 × 10⁻⁶¹. The correlation between shadow price and inlining decisions is strong and highly significant across all 250 function–size pairs.

Test 2: Effect Size

Hot functions (shadow price = 1000.0) were inlined at 80%. Cold functions (shadow price = 10.0) were inlined at 0%. Cohen's d = 2.82, which is very large.

Looking at this by function size tells a more nuanced story:

IR Count Hot Inlining Rate Cold Inlining Rate Difference
10 100% 0% 100 pp
50 100% 0% 100 pp
100 100% 0% 100 pp
200 100% 0% 100 pp
500 0% 0% 0 pp

For reference, in the baseline condition (shadow price = 100, threshold = 50), only 10-instruction functions were inlined — everything else exceeded the threshold. The skewed condition's high shadow price expands the threshold to 250 (= 50 × 10.0), bringing functions up to 200 IR instructions into inlining range. Functions at 500 IR instructions remain too large even at the maximum adjusted threshold.

This is the expected behavior under the NUM framing: shadow prices shift the decision boundary, not eliminate it. Very large functions have costs that exceed any reasonable utility signal.

Test 3: Robustness

Skewed and perturbed conditions produced identical inlining decisions on all 250 matched function pairs (100% agreement). The ±20% noise on shadow prices was not large enough to push any function across the decision boundary. This is consistent with the large gap between the hot and cold price levels — the noise magnitude (±200 on a base of 1000) stays well within the region where the threshold comfortably exceeds or falls below each function's IR count.


What This Shows (and What It Doesn't)

We can say with confidence that a linearly-scaled inlining threshold responds to shadow price signals in the way NUM dual decomposition predicts. The response is monotonic, large in effect size, and robust to moderate noise. The three pre-registered tests all pass.

What we cannot yet claim:

We tested our own decision function, not MIR's native inliner. The should_inline_with_num() function we wrote implements the linear scaling formula directly. We called it on each function rather than routing through MIR's full MIR_gen() compilation pipeline. So the results demonstrate that the formula works as designed — which is useful but not the same as demonstrating that MIR's existing heuristic implicitly solves a NUM problem.

The distinction matters. The interesting question is whether real compilers, without being explicitly instrumented with shadow prices, already exhibit NUM-like behavior in their cost-benefit tradeoffs. Our experiment doesn't answer that yet. What it does is establish the experimental infrastructure and show that the NUM threshold scaling mechanism produces sensible, statistically robust decisions when applied to synthetic IR.

The function sizes are synthetic and uniform. Each function is a sequence of identical ADD instructions. Real code has branches, loops, nested calls, and varying instruction mixes that affect inlining cost in ways our synthetic IR doesn't capture. The clean separation between hot and cold inlining rates may be partly an artifact of this uniformity.

We ran at a single optimization level. The baseline threshold of 50 IR instructions is a reasonable moderate setting, but real compilers vary their inlining aggressiveness across optimization levels. We have stratified runs (O0/O1/O2) in progress that will show whether the NUM signal holds at conservative and aggressive settings.


What Comes Next

Two extensions are underway. First, running the experiment at three optimization levels with different baseline thresholds (20, 50, 100) to test whether shadow-price sensitivity is level-dependent. Second, a more detailed size-stratified analysis to identify exactly which function sizes sit at the "decision boundary" where shadow prices change the inlining outcome.

Longer term, the more substantive test is whether MIR's actual compilation pipeline — not our standalone decision function — responds to shadow prices injected into the MIR_func struct. That would require hooking into MIR_gen() and observing whether the inlining pass reads the shadow price field during real optimization. If it does, we're looking at evidence that the compiler's existing architecture already implements something like dual decomposition. If it doesn't, we've at least demonstrated that the architecture could be extended to do so, with measurable benefits.

The full experimental infrastructure — patches, test harness, analysis scripts, and raw data — is available for reproduction.


Takeaway

This is a proof-of-concept, not a proof. We've shown that a simple shadow-price-scaled threshold produces the correlation structure that NUM theory predicts, which is a necessary condition for the compiler-as-NUM hypothesis. It is not a sufficient condition. The hard work of connecting this to real compiler behavior is ahead.

But the results are clean enough to suggest the direction is worth pursuing. A Spearman ρ of 0.82 and a Cohen's d of 2.82 leave room for the effect to weaken substantially under more realistic conditions and still remain significant. That's exactly the kind of margin you want before committing to a more expensive experiment.


This work is part of a two-paper project: a theoretical reframing of compiler architecture through the NUM lens, and an empirical companion validating the formalism against MIR. The theory paper makes no empirical claims; the empirical paper (this work) tests whether the formalism generates testable predictions. Both papers are designed to stand independently regardless of the experimental outcome.