Machine: oci-saulire — ARM64 (aarch64), 4 vCPUs, 24 GB RAM, Debian stable
Compiler: rustc 1.95.0, release mode
Tokio commit: c6d58ce (baseline)
Benchmark harness: criterion via cargo bench --bench rt_multi_threaded
Patch: tokio-batch-pop.patch

Tokio’s scheduler is work-stealing1: each worker owns a local deque and steals from peers when idle. Tasks spawned from outside the runtime — or that overflow a local queue — land in a separate inject queue (MPSC). Workers poll it on an adaptive interval that targets ~61 tasks per 200 μs, clamped between 2 and 1272. Under burst loads the inject queue becomes a serialization point: every task contends for the same mutex.

Batching the global tick

On a global queue tick, baseline pulls a single task and returns to local work:

1
worker.handle.next_remote_task().or_else(|| self.next_local_task())

The idle path, by contrast, was already batching: it pops n tasks, executes the first, and queues the rest locally. The patch extracts that logic into a Core::batch_from_inject helper and reuses it on both paths. The tick path caps the batch at 32; the idle path caps at half the local queue to avoid overflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
fn next_task(&mut self, worker: &Worker) -> Option<Notified> {
    if self.tick % self.global_queue_interval == 0 {
        self.tune_global_queue_interval(worker);

        const MAX_BATCH: usize = 32;

        self.batch_from_inject(worker, MAX_BATCH)
            .or_else(|| self.next_local_task())
    } else {
        self.next_local_task()
            .or_else(|| self.batch_from_inject(worker, self.run_queue.max_capacity() / 2))
    }
}

batch_from_inject pops a batch proportional to inject_len / num_workers + 1, bounded by the caller’s cap, pushes the remainder into the local queue, and returns the first task directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fn batch_from_inject(&mut self, worker: &Worker, max_batch: usize) -> Option<Notified> {
    if worker.inject().is_empty() {
        return None;
    }

    let cap = self.run_queue.remaining_slots().min(max_batch).max(1);

    let n = (worker.inject().len() / worker.handle.shared.remotes.len() + 1)
        .clamp(1, cap);

    let mut synced = worker.handle.shared.synced.lock();
    let mut tasks = unsafe { worker.inject().pop_n(&mut synced.inject, n) };

    let ret = tasks.next();
    self.run_queue.push_back(tasks);
    ret
}

MAX_BATCH = 32 matches crossbeam-deque’s steal_batch size. With 4 workers and LOCAL_QUEUE_CAPACITY = 2563, the formula bottoms out at inject_len / 4 + 1, capped at 32.

Fairness note: injected tasks land at the back of the local queue, but the worker pops from the back (LIFO), so they run before older local work that sits deeper in the queue. With cap=32 this head-of-line blocking is bounded to at most 31 tasks per tick. The worker cycles back to original-local work well before the next global tick.

Cap choice

I benchmarked different caps on busy2:

capbusy2 medianvs baseline
217.28 ms−49%
411.58 ms−66%
86.48 ms−81%
163.99 ms−88%
322.74 ms−92%
642.42 ms−93%
1282.23 ms−93%

The throughput knee is at 32 — everything past it is diminishing returns. But the larger reason for the cap is fairness: with 128, a single global tick can grab 128 remote tasks, execute 1, push 127 to the back of the local queue, and bury existing local work behind a wall of converted-remote tasks. Cap=32 captures 92% of the throughput win without that risk.

Below inject_len = 124 (on 4 workers) the formula never hits the cap, so shallow queues see smaller batches and the fairness bound is even tighter.

Prior art

Tokio already batch-moves tasks. push_overflow in queue.rs3 moves LOCAL_QUEUE_CAPACITY / 2 tasks into the inject queue in one go. The inject queue also exposes push_batch for batch insert. The API is there; the scheduler simply never called pop_n from the global tick path.

Crossbeam’s Chase-Lev deque4 — the de facto Rust work-stealing implementation — batches by design. Its steal_batch grabs roughly half the queue (capped at 32) in a single CAS5. Same principle: one synchronization, multiple items moved.

Results

BenchmarkBaselineBatch-popΔ
spawn_many_local10.05 ms9.74 msnoise
spawn_many_remote_idle6.99 ms7.12 msnoise
spawn_many_remote_busy17.36 ms6.92 msnoise
spawn_many_remote_busy233.96 ms2.74 ms−92%
ping_pong1.17 ms1.16 msnoise
yield_many11.30 ms11.70 msnoise

busy2 spawns a large burst from outside the runtime while all workers are saturated. Baseline acquires the inject mutex once per task; batch-pop acquires it once per batch. The improvement is structural and reproducible.

Risks

  • Fairness: bounded to at most 31 injected tasks ahead of existing local work per tick. Larger caps would risk burying local work indefinitely.
  • inject().len() is an AtomicUsize::load(Acquire) — no lock, but still a cache-coherence operation on every tick.
  • Low-volume traffic: a small steady trickle of remote tasks could see slightly worse latency if batch-pop defers them into local queues. The +1 floor and cap keep this bounded.

Batch-pop uses tokio’s existing pop_n API and applies a standard work-stealing amortization pattern. Whether upstream accepts it depends on the fairness trade-off.



  1. Carl Lerche, “Making the Tokio scheduler 10x faster,” Tokio Blog, October 13, 2019. https://tokio.rs/blog/2019-10-scheduler ↩︎

  2. tokio/src/runtime/scheduler/multi_thread/stats.rs, lines 33–39 and worker.rs:1063. https://github.com/tokio-rs/tokio/blob/c6d58ce/tokio/src/runtime/scheduler/multi_thread/stats.rs#L33 ↩︎

  3. tokio/src/runtime/scheduler/multi_thread/queue.rs, push_overflow method. https://github.com/tokio-rs/tokio/blob/c6d58ce/tokio/src/runtime/scheduler/multi_thread/queue.rs#L246 ↩︎ ↩︎

  4. David Chase and Yossi Lev. 2005. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures (SPAA ‘05). ACM, 21–28. https://doi.org/10.1145/1073970.1073974 ↩︎

  5. crossbeam-deque v0.8, Stealer::steal_batch documentation. https://docs.rs/crossbeam-deque/0.8/crossbeam_deque/struct.Stealer.html#method.steal_batch ↩︎