The tokio multi-thread scheduler has a benchmark called spawn_many_remote_busy2 that measures remote task spawn latency when all workers are busy. On a clean checkout at c6d58ce, it takes ~42 ms. That is a 4× degradation over the single-worker-busy case, which suggested the scheduler’s idle state machine and inject queue were the bottlenecks.

I ran an iterative mutation loop: one change, compile, benchmark, keep or discard, record, repeat. After fourteen iterations I had a stack of commits with a cumulative 96.8% reduction on busy2. Stacked results tell a story but obscure causality, so I cherry-picked the most promising change onto a clean baseline and re-measured the full suite. Two observations emerged.


Observation 1: fetch_add(0, SeqCst) can be a load(SeqCst)

Location: tokio/src/runtime/scheduler/multi_thread/idle.rs:154
Measured effect (isolated): ~30% on busy2

notify_should_wakeup reads a u8 atomic state with:

1
2
3
4
// Before
let state = State(self.state.fetch_add(0, SeqCst));
// After
let state = State(self.state.load(SeqCst));

On ARM64, fetch_add(0, SeqCst) compiles into __aarch64_ldset8_acq_rel, a function call. load(SeqCst) compiles to ldar, a single instruction. The difference is measurable.

Why it’s controversial. The codebase explicitly pushes back on this:

“Note: we ‘read’ the counters using fetch_add(0, SeqCst) rather than load(SeqCst) because read-write-modify operations are guaranteed to observe the latest value, while the load is not.” — tokio/tests/task_hooks.rs:167

What I did. Before this work, idle.rs had zero loom tests. I wrote two that simulate the exact double-check pattern from worker_to_notify:

  1. worker_notify_double_check_race — one worker exits searching (fetch_sub) while a spawner concurrently runs: notify_should_wakeup() → lock → notify_should_wakeup()unpark_one() → pop sleeper. After both threads settle, the test asserts that num_searching, num_unparked, and the sleeper queue are mutually consistent.
  2. worker_notify_stress_multi_exit — two workers exit searching concurrently while a spawner tries to wake one. After all threads settle, the test asserts no corrupted counters.

Both pass with LOOM_MAX_PREEMPTIONS=2. I also ran them with LOOM_MAX_PREEMPTIONS=5 — still pass.

I also tried to break them. I swapped the SeqCst load in the simulated notify_should_wakeup to Relaxed, re-ran with LOOM_MAX_PREEMPTIONS=5, and both tests still passed.

At first I suspected the tests weren’t stressful enough. Then I found the real reason in Loom’s own documentation:

“SeqCst accesses (e.g. load, store, ..): They are regarded as AcqRel.”tokio-rs/loom docs

Loom does not implement the full C11 memory model. A SeqCst load in Loom is modeled as Acquire. A SeqCst store is modeled as Release. The total-order guarantee that distinguishes SeqCst from AcqRel — the exact property task_hooks.rs cites as the reason to prefer fetch_add(0, SeqCst) — is not modeled.

This means: no loom test can validate or invalidate the SeqCst-specific ordering claim. It is not a question of test quality or coverage depth. It is a fundamental limitation of the tool. What the tests prove is that the double-check pattern is structurally sound at the AcqRel level. What no loom test can prove is whether load(SeqCst) provides stronger cross-atomic ordering than fetch_add(0, SeqCst) in the full C11 model.

My take: the fetch_add(0)load change is not ready for a PR. The ARM64 codegen win is real, but the codebase documents a design principle against it (task_hooks.rs:167), and the specific ordering property it claims to defend — the SeqCst total-order guarantee — is not testable in Loom. It needs maintainer review, not more loom tests.


Observation 2: Batch-pop on global queue ticks

Location: tokio/src/runtime/scheduler/multi_thread/worker.rs:1062
Measured effect (isolated): busy2 drops from 42.1 ms to 2.6 ms (−94%)

What the change does. Every global_queue_interval ticks, each worker checks the inject queue for remotely spawned tasks. The original path pulls a single task:

1
2
// On global queue tick
worker.handle.next_remote_task().or_else(|| self.next_local_task())

The patched path pulls a batch proportional to queue depth ÷ worker count, pushes the remainder into the local queue, and only then falls back to local work:

1
2
3
4
5
6
7
let n = usize::min(
    worker.inject().len() / worker.handle.shared.remotes.len() + 1,
    cap,
);
let mut tasks = unsafe { worker.inject().pop_n(&mut synced.inject, n) };
let ret = tasks.next();
self.run_queue.push_back(tasks);

Why it helps. The busy2 benchmark saturates every worker. Without batch-pop, each worker pulls one task at a time from the shared inject queue, acquiring the synced lock on every tick. The result is lock thrashing:

flowchart LR
    subgraph before["Before: single task pull"]
        A[Worker 0] -->|lock| I0[Inject queue]
        B[Worker 1] -->|lock| I0
        C[Worker 2] -->|lock| I0
        D[Worker 3] -->|lock| I0
    end
flowchart LR
    subgraph after["After: batch pop"]
        A2[Worker 0] -->|lock once, grab n tasks| I1[Inject queue]
        B2[Worker 1] -->|lock once, grab n tasks| I1
        C2[Worker 2] -->|lock once, grab n tasks| I1
        D2[Worker 3] -->|lock once, grab n tasks| I1
    end

    style after fill:#f0f8e8

Batch-pop converts O(n) lock acquisitions into O(n / batch_size). The improvement is structural, not a constant factor.

What about the other benchmarks?

I ran the full rt_multi_threaded suite on baseline and on batch-pop, repeating the benchmarks most likely to show movement. The result:

BenchmarkBaseline (c6d58ce)Batch-pop onlyΔ
spawn_many_local10.02 ms9.55 msnoise
spawn_many_remote_idle6.11 ms5.63 msnoise
spawn_many_remote_busy17.84 ms7.59 msnoise
spawn_many_remote_busy242.1 ms2.6 ms−94%
ping_pong1.05 ms1.04 msnoise
yield_many17.28 ms17.69 msnoise

Only busy2 moves. Everything else is within criterion variance.

The TARGET_GLOBAL_QUEUE_INTERVAL tuning (halved from 200 µs to 100 µs to 75 µs during the iterative loop) is a magic-number tweak with no principled justification. The batch-pop change carries ~36 ms of the total win; interval tuning adds maybe 0.5 ms on top. It should be dropped.


Takeaways

  1. Batch-pop is the only change with a clear, isolated effect. It has a massive effect on the worst-case benchmark and no consistent regression on the others. The change is purely structural — same data structures, different batching behavior — so it is auditable. Whether it is appropriate for upstream is a question for tokio maintainers.

  2. fetch_add(0)load is an open question. The ARM64 codegen win is real, but the codebase documents a principled objection, and the cross-atomic ordering claim cannot be validated in a unit test. It needs maintainer review, not more benchmarks.

  3. Isolation testing is the only honest way to attribute effects. Stacked commits produced a 96.8% cumulative number. Cherry-picking onto a clean baseline revealed that batch-pop alone carries ~94% of the total improvement. Everything else was noise.


Method

Iterative benchmark-driven loop on an ARM64 machine (Debian). Each iteration: one change, compile in release mode, run criterion, log result, keep or discard. Failed experiments were discarded immediately.

After the loop, I isolated the top change on a clean baseline branch and re-measured the full suite. Stacked results tell a story; isolation testing tells you what actually happened.


Raw data

BenchmarkBaseline (c6d58ce)Batch-pop onlyn
spawn_many_local10.02 ms9.55 ms3
spawn_many_remote_idle6.11 ms5.63 ms3
spawn_many_remote_busy17.84 ms7.59 ms3
spawn_many_remote_busy242.1 ms2.6 ms3
ping_pong1.05 ms1.04 ms1
yield_many17.28 ms17.69 ms1
chained_spawn224.7 µs224.9 µs1
threaded_scheduler_spawn4.74 µs3.99 µs1
basic_scheduler_spawn821 ns839 ns1
basic_scheduler_spawn104.87 µs4.92 µs1
threaded_scheduler_spawn1014.70 µs14.80 µs1

References

ResourceLocation
Tokio repo~/rust-repos/tokio
Loom test worktree~/rust-repos/tokio-loom-test
Score log~/rust-repos/tokio/.opt_log.txt
Benchmark sweep datarust-optimization-workflow skill