Wander Lairson Costa

Modernizing stalld with eBPF

• kernel

Last year I started working on stalld, a daemon that prevents thread starvation in real-time Linux systems. On systems with isolated CPUs running high-priority RT tasks, essential kernel threads can starve indefinitely. stalld monitors CPU run queues and temporarily boosts starving threads using SCHED_DEADLINE (or SCHED_FIFO as fallback).

The main goal was switching from the old sched_debug backend to a new queue_track backend. The old approach reads /sys/kernel/debug/sched/debug to get task information — file I/O, parsing, relatively high overhead. The new approach uses BPF programs to track tasks in real-time. Lower overhead, better precision, but a complete rewrite of the core monitoring logic.

This post covers what I learned along the way: eBPF verifier fights, kernel portability headaches, race conditions, and production bugs.

eBPF and the verifier

Building the queue_track backend meant diving deep into eBPF. The old sched_debug backend had to read and parse text files on every iteration. With BPF, I track tasks in the kernel as events happen and aggregate the results before copying to userspace. The overhead reduction is significant, especially on systems with many CPUs1.

The eBPF verifier presents its own challenges. It exists for good reasons — arbitrary code in the kernel would be a security nightmare. But when it rejects a program, the error messages can be difficult to diagnose.

I encountered this with logging. On some kernel versions, the verifier refused to load the program with “Argument list too long”. The fix was dropping one parameter from the logging macro:

// Original: rejected by verifier on some kernels
log_task_prefix("enqueue", p->comm, p->pid, p->tgid, "rt", cpu);
// Fixed: passes verification
log_task_prefix("enqueue", p->comm, p->pid, p->tgid, cpu);

With eBPF, writing correct code is not enough. The code must also convince the verifier it is correct. And verifier behavior changes between kernel versions.

Then there is the kernel itself changing under you. Data structures evolve. Fields move or disappear entirely. The cpu field in struct task_struct was removed in recent kernels.

This is where BPF CO-RE becomes essential. You define legacy struct variants, use bpf_core_field_exists() to check at runtime what fields exist, and adapt accordingly:

struct task_struct___legacy {
    int cpu;
};

static __always_inline int task_cpu(const struct task_struct *p)
{
    const struct task_struct___legacy *legacy = (void *)p;

    if (bpf_core_field_exists(legacy, cpu))
      return BPF_CORE_READ(legacy, cpu);

    return BPF_CORE_READ(p, thread_info.cpu);
}

I used this pattern again for __state vs state when checking if tasks are running. Once you’ve written the adaptation logic, the BPF loader does the rest. Different kernels, same binary.

Tracepoints

The queue_track backend needed a way to hook into the scheduler and track task state changes. Early on, I used fentry probes on scheduler functions like enqueue_task_rt() and dequeue_task_rt(). This worked in testing, but fentry requires a synchronize_rcu_tasks() call during attachment. On a busy system with a high-priority RT task monopolizing a CPU, that call can block indefinitely.

The result was ironic: the daemon designed to prevent starvation was itself getting starved during startup.

// fentry: requires synchronize_rcu_tasks(), can block
SEC("fentry/enqueue_task_rt")
int BPF_PROG(enqueue_task_rt_enter, struct rq *rq, struct task_struct *p)

// tracepoint: no blocking synchronization required
SEC("tp_btf/sched_wakeup")
int BPF_PROG(handle__sched_wakeup, struct task_struct *p)

Moving to tracepoints resolved the issue. Tracepoints are stable kernel APIs and do not require blocking synchronization. Startup became immediate, even on heavily loaded systems.

However, I then discovered incomplete task tracking. Tasks do more than wake up and sleep — they migrate between CPUs, get created, and exit. Missing any of these events causes stalld to accumulate stale entries (tasks that no longer exist on a given CPU) and lose track of tasks that migrated.

I added handlers for the remaining scheduler events. sched_wakeup_new catches newly created tasks from their first wakeup. sched_migrate_task handles CPU migration — dequeue from the source CPU and re-enqueue on the destination:

SEC("tp_btf/sched_migrate_task")
int BPF_PROG(handle__sched_migrate_task, struct task_struct *p, int dest_cpu)
{
    int src_cpu = task_cpu(p);

    if (dequeue_task(src_cpu, p))
        enqueue_task(dest_cpu, p);

    return 0;
}

The most subtle problem? stalld only tracked tasks that changed state after it started. Anything already on a runqueue was invisible until it rescheduled. The fix was a BPF task iterator that walks all tasks at startup:

SEC("iter/task")
int dump_task(struct bpf_iter__task *ctx)
{
    struct task_struct *task = ctx->task;
    if (!task)
        return 0;

    if (task_is_running(task))
        enqueue_task(task_cpu(task), task);

    return 0;
}

Now stalld has a complete system snapshot from the moment it starts.

Bug fixes

Some bugs manifest clearly. Consider this scenario: a task starts starving, stalld detects it and attempts to boost it, but the task exits before the boost completes. The get_current_policy() syscall fails because the task no longer exists. However, stalld was not cleaning up the stale entry, causing it to retry the same dead PID on every iteration:

[stalld] Failed to get policy for PID 12345
[stalld] Failed to get policy for PID 12345
[stalld] Failed to get policy for PID 12345
...

The fix was straightforward once identified. If the syscall fails, the task has likely exited. Remove the entry and continue:

if (get_current_policy(task->pid, &policy) < 0) {
    cleanup_starving_task_info(task);
    continue;
}

if (boost_with_deadline(task->pid, &policy) < 0) {
    cleanup_starving_task_info(task);
    continue;
}

Other bugs were more subtle. The merge_taks_info() function should clear the starving vector and rebuild it on every iteration. However, it only cleared the vector when there were zero new tasks. If new tasks existed but none were starving, the vector retained stale data. stalld would then boost tasks that were no longer starving:

// Bug: only clears when no new tasks
if (new_nr_rt_running == 0) {
    update_cpu_starving_vector(cpu, 0);
}

// Fix: unconditionally clear before rebuilding
update_cpu_starving_vector(cpu, 0);

A one-line fix, but finding it required understanding the state machine.

DL-server integration

Recent kernels include the DL-server, which provides built-in starvation handling. This raises the question: is stalld obsolete?

The answer is no. stalld is now “DL-server aware.” It monitors tasks system-wide but avoids interfering with tasks that DL-server handles. Instead, stalld addresses what DL-server cannot: starving SCHED_FIFO tasks, and starving tasks on CPUs where DL-server was manually disabled. The two mechanisms complement each other.

My initial approach was conservative. If the kernel has DL-server, stalld would switch to log-only mode automatically — observe and log, but do not boost. This avoided potential conflicts and deferred to the built-in solution. Later, I refined this to the complementary approach described above.

Summary

The migration from sched_debug to queue_track is complete. Instead of parsing debugfs files, stalld now tracks tasks in real-time with BPF. Lower overhead, better data, and it scales better on systems with lots of CPUs.

Multiple bugs resolved along the way. Four major eBPF improvements: tracepoint migration, task iterator, CO-RE portability, and verifier compatibility.

Going deeper


1: Tested on kernels 5.14 and later. The queue_track backend requires BPF CO-RE support, available in kernels 5.2+ with appropriate libbpf versions.

comments powered by Disqus