Two observations from benchmark-driven tokio scheduler exploration

The tokio multi-thread scheduler has a benchmark called spawn_many_remote_busy2 that measures remote task spawn latency when all workers are busy. On a clean checkout at c6d58ce, it takes ~42 ms. That is a 4× degradation over the single-worker-busy case, which suggested the scheduler’s idle state machine and inject queue were the bottlenecks.

I ran an iterative mutation loop: one change, compile, benchmark, keep or discard, record, repeat. After fourteen iterations I had a stack of commits with a cumulative 96.8% reduction on busy2. Stacked results tell a story but obscure causality, so I cherry-picked the most promising change onto a clean baseline and re-measured the full suite. Two observations emerged.

Observation 1: `fetch_add(0, SeqCst)` can be a `load(SeqCst)`

Location: tokio/src/runtime/scheduler/multi_thread/idle.rs:154
Measured effect (isolated): ~30% on busy2

notify_should_wakeup reads a u8 atomic state with:

1
2
3
4
// Before
let state = State(self.state.fetch_add(0, SeqCst));
// After
let state = State(self.state.load(SeqCst));

On ARM64, fetch_add(0, SeqCst) compiles into __aarch64_ldset8_acq_rel, a function call. load(SeqCst) compiles to ldar, a single instruction. The difference is measurable.

Why it’s controversial. The codebase explicitly pushes back on this:

“Note: we ‘read’ the counters using fetch_add(0, SeqCst) rather than load(SeqCst) because read-write-modify operations are guaranteed to observe the latest value, while the load is not.” — tokio/tests/task_hooks.rs:167

What I did. Before this work, idle.rs had zero loom tests. I wrote two that simulate the exact double-check pattern from worker_to_notify:

worker_notify_double_check_race — one worker exits searching (fetch_sub) while a spawner concurrently runs: notify_should_wakeup() → lock → notify_should_wakeup() → unpark_one() → pop sleeper. After both threads settle, the test asserts that num_searching, num_unparked, and the sleeper queue are mutually consistent.
worker_notify_stress_multi_exit — two workers exit searching concurrently while a spawner tries to wake one. After all threads settle, the test asserts no corrupted counters.

Both pass with LOOM_MAX_PREEMPTIONS=2. I also ran them with LOOM_MAX_PREEMPTIONS=5 — still pass.

I also tried to break them. I swapped the SeqCst load in the simulated notify_should_wakeup to Relaxed, re-ran with LOOM_MAX_PREEMPTIONS=5, and both tests still passed.

At first I suspected the tests weren’t stressful enough. Then I found the real reason in Loom’s own documentation:

“SeqCst accesses (e.g. load, store, ..): They are regarded as AcqRel.” — tokio-rs/loom docs

Loom does not implement the full C11 memory model. A SeqCst load in Loom is modeled as Acquire. A SeqCst store is modeled as Release. The total-order guarantee that distinguishes SeqCst from AcqRel — the exact property task_hooks.rs cites as the reason to prefer fetch_add(0, SeqCst) — is not modeled.

This means: no loom test can validate or invalidate the SeqCst-specific ordering claim. It is not a question of test quality or coverage depth. It is a fundamental limitation of the tool. What the tests prove is that the double-check pattern is structurally sound at the AcqRel level. What no loom test can prove is whether load(SeqCst) provides stronger cross-atomic ordering than fetch_add(0, SeqCst) in the full C11 model.

My take: the fetch_add(0) → load change is not ready for a PR. The ARM64 codegen win is real, but the codebase documents a design principle against it (task_hooks.rs:167), and the specific ordering property it claims to defend — the SeqCst total-order guarantee — is not testable in Loom. It needs maintainer review, not more loom tests.

Observation 2: Batch-pop on global queue ticks

Location: tokio/src/runtime/scheduler/multi_thread/worker.rs:1062
Measured effect (isolated): busy2 drops from 42.1 ms to 2.6 ms (−94%)

What the change does. Every global_queue_interval ticks, each worker checks the inject queue for remotely spawned tasks. The original path pulls a single task:

1
2
// On global queue tick
worker.handle.next_remote_task().or_else(|| self.next_local_task())

The patched path pulls a batch proportional to queue depth ÷ worker count, pushes the remainder into the local queue, and only then falls back to local work:

1
2
3
4
5
6
7
let n = usize::min(
    worker.inject().len() / worker.handle.shared.remotes.len() + 1,
    cap,
);
let mut tasks = unsafe { worker.inject().pop_n(&mut synced.inject, n) };
let ret = tasks.next();
self.run_queue.push_back(tasks);

Why it helps. The busy2 benchmark saturates every worker. Without batch-pop, each worker pulls one task at a time from the shared inject queue, acquiring the synced lock on every tick. The result is lock thrashing:

flowchart LR
    subgraph before["Before: single task pull"]
        A[Worker 0] -->|lock| I0[Inject queue]
        B[Worker 1] -->|lock| I0
        C[Worker 2] -->|lock| I0
        D[Worker 3] -->|lock| I0
    end

flowchart LR
    subgraph after["After: batch pop"]
        A2[Worker 0] -->|lock once, grab n tasks| I1[Inject queue]
        B2[Worker 1] -->|lock once, grab n tasks| I1
        C2[Worker 2] -->|lock once, grab n tasks| I1
        D2[Worker 3] -->|lock once, grab n tasks| I1
    end

    style after fill:#f0f8e8

Batch-pop converts O(n) lock acquisitions into O(n / batch_size). The improvement is structural, not a constant factor.

What about the other benchmarks?

I ran the full rt_multi_threaded suite on baseline and on batch-pop, repeating the benchmarks most likely to show movement. The result:

Benchmark	Baseline (`c6d58ce`)	Batch-pop only	Δ
`spawn_many_local`	10.02 ms	9.55 ms	noise
`spawn_many_remote_idle`	6.11 ms	5.63 ms	noise
`spawn_many_remote_busy1`	7.84 ms	7.59 ms	noise
`spawn_many_remote_busy2`	42.1 ms	2.6 ms	−94%
`ping_pong`	1.05 ms	1.04 ms	noise
`yield_many`	17.28 ms	17.69 ms	noise

Only busy2 moves. Everything else is within criterion variance.

The TARGET_GLOBAL_QUEUE_INTERVAL tuning (halved from 200 µs to 100 µs to 75 µs during the iterative loop) is a magic-number tweak with no principled justification. The batch-pop change carries ~36 ms of the total win; interval tuning adds maybe 0.5 ms on top. It should be dropped.

Takeaways

Batch-pop is the only change with a clear, isolated effect. It has a massive effect on the worst-case benchmark and no consistent regression on the others. The change is purely structural — same data structures, different batching behavior — so it is auditable. Whether it is appropriate for upstream is a question for tokio maintainers.
fetch_add(0) → load is an open question. The ARM64 codegen win is real, but the codebase documents a principled objection, and the cross-atomic ordering claim cannot be validated in a unit test. It needs maintainer review, not more benchmarks.
Isolation testing is the only honest way to attribute effects. Stacked commits produced a 96.8% cumulative number. Cherry-picking onto a clean baseline revealed that batch-pop alone carries ~94% of the total improvement. Everything else was noise.

Method

Iterative benchmark-driven loop on an ARM64 machine (Debian). Each iteration: one change, compile in release mode, run criterion, log result, keep or discard. Failed experiments were discarded immediately.

After the loop, I isolated the top change on a clean baseline branch and re-measured the full suite. Stacked results tell a story; isolation testing tells you what actually happened.

Raw data

Benchmark	Baseline (`c6d58ce`)	Batch-pop only	n
`spawn_many_local`	10.02 ms	9.55 ms	3
`spawn_many_remote_idle`	6.11 ms	5.63 ms	3
`spawn_many_remote_busy1`	7.84 ms	7.59 ms	3
`spawn_many_remote_busy2`	42.1 ms	2.6 ms	3
`ping_pong`	1.05 ms	1.04 ms	1
`yield_many`	17.28 ms	17.69 ms	1
`chained_spawn`	224.7 µs	224.9 µs	1
`threaded_scheduler_spawn`	4.74 µs	3.99 µs	1
`basic_scheduler_spawn`	821 ns	839 ns	1
`basic_scheduler_spawn10`	4.87 µs	4.92 µs	1
`threaded_scheduler_spawn10`	14.70 µs	14.80 µs	1

References

Resource	Location
Tokio repo	`~/rust-repos/tokio`
Loom test worktree	`~/rust-repos/tokio-loom-test`
Score log	`~/rust-repos/tokio/.opt_log.txt`
Benchmark sweep data	`rust-optimization-workflow` skill

Observation 1: fetch_add(0, SeqCst) can be a load(SeqCst)#

Observation 2: Batch-pop on global queue ticks#

Takeaways#

Method#

Raw data#

References#