The tokio multi-thread scheduler has a benchmark called spawn_many_remote_busy2 that measures remote task spawn latency when all workers are busy. On a clean checkout at c6d58ce, it takes ~42 ms. That is a 4× degradation over the single-worker-busy case, which suggested the scheduler’s idle state machine and inject queue were the bottlenecks.
I ran an iterative mutation loop: one change, compile, benchmark, keep or discard, record, repeat. After fourteen iterations I had a stack of commits with a cumulative 96.8% reduction on busy2. Stacked results tell a story but obscure causality, so I cherry-picked the most promising change onto a clean baseline and re-measured the full suite. Two observations emerged.
Observation 1: fetch_add(0, SeqCst) can be a load(SeqCst)
Location: tokio/src/runtime/scheduler/multi_thread/idle.rs:154
Measured effect (isolated): ~30% on busy2
notify_should_wakeup reads a u8 atomic state with:
| |
On ARM64, fetch_add(0, SeqCst) compiles into __aarch64_ldset8_acq_rel, a function call. load(SeqCst) compiles to ldar, a single instruction. The difference is measurable.
Why it’s controversial. The codebase explicitly pushes back on this:
“Note: we ‘read’ the counters using
fetch_add(0, SeqCst)rather thanload(SeqCst)because read-write-modify operations are guaranteed to observe the latest value, while the load is not.” —tokio/tests/task_hooks.rs:167
What I did. Before this work, idle.rs had zero loom tests. I wrote two that simulate the exact double-check pattern from worker_to_notify:
worker_notify_double_check_race— one worker exits searching (fetch_sub) while a spawner concurrently runs:notify_should_wakeup()→ lock →notify_should_wakeup()→unpark_one()→ pop sleeper. After both threads settle, the test asserts thatnum_searching,num_unparked, and the sleeper queue are mutually consistent.worker_notify_stress_multi_exit— two workers exit searching concurrently while a spawner tries to wake one. After all threads settle, the test asserts no corrupted counters.
Both pass with LOOM_MAX_PREEMPTIONS=2. I also ran them with LOOM_MAX_PREEMPTIONS=5 — still pass.
I also tried to break them. I swapped the SeqCst load in the simulated notify_should_wakeup to Relaxed, re-ran with LOOM_MAX_PREEMPTIONS=5, and both tests still passed.
At first I suspected the tests weren’t stressful enough. Then I found the real reason in Loom’s own documentation:
“SeqCst accesses (e.g. load, store, ..): They are regarded as AcqRel.” — tokio-rs/loom docs
Loom does not implement the full C11 memory model. A SeqCst load in Loom is modeled as Acquire. A SeqCst store is modeled as Release. The total-order guarantee that distinguishes SeqCst from AcqRel — the exact property task_hooks.rs cites as the reason to prefer fetch_add(0, SeqCst) — is not modeled.
This means: no loom test can validate or invalidate the SeqCst-specific ordering claim. It is not a question of test quality or coverage depth. It is a fundamental limitation of the tool. What the tests prove is that the double-check pattern is structurally sound at the AcqRel level. What no loom test can prove is whether load(SeqCst) provides stronger cross-atomic ordering than fetch_add(0, SeqCst) in the full C11 model.
My take: the fetch_add(0) → load change is not ready for a PR. The ARM64 codegen win is real, but the codebase documents a design principle against it (task_hooks.rs:167), and the specific ordering property it claims to defend — the SeqCst total-order guarantee — is not testable in Loom. It needs maintainer review, not more loom tests.
Observation 2: Batch-pop on global queue ticks
Location: tokio/src/runtime/scheduler/multi_thread/worker.rs:1062
Measured effect (isolated): busy2 drops from 42.1 ms to 2.6 ms (−94%)
What the change does. Every global_queue_interval ticks, each worker checks the inject queue for remotely spawned tasks. The original path pulls a single task:
| |
The patched path pulls a batch proportional to queue depth ÷ worker count, pushes the remainder into the local queue, and only then falls back to local work:
| |
Why it helps. The busy2 benchmark saturates every worker. Without batch-pop, each worker pulls one task at a time from the shared inject queue, acquiring the synced lock on every tick. The result is lock thrashing:
flowchart LR
subgraph before["Before: single task pull"]
A[Worker 0] -->|lock| I0[Inject queue]
B[Worker 1] -->|lock| I0
C[Worker 2] -->|lock| I0
D[Worker 3] -->|lock| I0
end
flowchart LR
subgraph after["After: batch pop"]
A2[Worker 0] -->|lock once, grab n tasks| I1[Inject queue]
B2[Worker 1] -->|lock once, grab n tasks| I1
C2[Worker 2] -->|lock once, grab n tasks| I1
D2[Worker 3] -->|lock once, grab n tasks| I1
end
style after fill:#f0f8e8
Batch-pop converts O(n) lock acquisitions into O(n / batch_size). The improvement is structural, not a constant factor.
What about the other benchmarks?
I ran the full rt_multi_threaded suite on baseline and on batch-pop, repeating the benchmarks most likely to show movement. The result:
| Benchmark | Baseline (c6d58ce) | Batch-pop only | Δ |
|---|---|---|---|
spawn_many_local | 10.02 ms | 9.55 ms | noise |
spawn_many_remote_idle | 6.11 ms | 5.63 ms | noise |
spawn_many_remote_busy1 | 7.84 ms | 7.59 ms | noise |
spawn_many_remote_busy2 | 42.1 ms | 2.6 ms | −94% |
ping_pong | 1.05 ms | 1.04 ms | noise |
yield_many | 17.28 ms | 17.69 ms | noise |
Only busy2 moves. Everything else is within criterion variance.
The TARGET_GLOBAL_QUEUE_INTERVAL tuning (halved from 200 µs to 100 µs to 75 µs during the iterative loop) is a magic-number tweak with no principled justification. The batch-pop change carries ~36 ms of the total win; interval tuning adds maybe 0.5 ms on top. It should be dropped.
Takeaways
Batch-pop is the only change with a clear, isolated effect. It has a massive effect on the worst-case benchmark and no consistent regression on the others. The change is purely structural — same data structures, different batching behavior — so it is auditable. Whether it is appropriate for upstream is a question for tokio maintainers.
fetch_add(0)→loadis an open question. The ARM64 codegen win is real, but the codebase documents a principled objection, and the cross-atomic ordering claim cannot be validated in a unit test. It needs maintainer review, not more benchmarks.Isolation testing is the only honest way to attribute effects. Stacked commits produced a 96.8% cumulative number. Cherry-picking onto a clean baseline revealed that batch-pop alone carries ~94% of the total improvement. Everything else was noise.
Method
Iterative benchmark-driven loop on an ARM64 machine (Debian). Each iteration: one change, compile in release mode, run criterion, log result, keep or discard. Failed experiments were discarded immediately.
After the loop, I isolated the top change on a clean baseline branch and re-measured the full suite. Stacked results tell a story; isolation testing tells you what actually happened.
Raw data
| Benchmark | Baseline (c6d58ce) | Batch-pop only | n |
|---|---|---|---|
spawn_many_local | 10.02 ms | 9.55 ms | 3 |
spawn_many_remote_idle | 6.11 ms | 5.63 ms | 3 |
spawn_many_remote_busy1 | 7.84 ms | 7.59 ms | 3 |
spawn_many_remote_busy2 | 42.1 ms | 2.6 ms | 3 |
ping_pong | 1.05 ms | 1.04 ms | 1 |
yield_many | 17.28 ms | 17.69 ms | 1 |
chained_spawn | 224.7 µs | 224.9 µs | 1 |
threaded_scheduler_spawn | 4.74 µs | 3.99 µs | 1 |
basic_scheduler_spawn | 821 ns | 839 ns | 1 |
basic_scheduler_spawn10 | 4.87 µs | 4.92 µs | 1 |
threaded_scheduler_spawn10 | 14.70 µs | 14.80 µs | 1 |
References
| Resource | Location |
|---|---|
| Tokio repo | ~/rust-repos/tokio |
| Loom test worktree | ~/rust-repos/tokio-loom-test |
| Score log | ~/rust-repos/tokio/.opt_log.txt |
| Benchmark sweep data | rust-optimization-workflow skill |