Modernizing stalld with eBPF
• kernel
Last year I started working on
stalld, a daemon
that prevents thread starvation in real-time Linux systems. On systems with
isolated CPUs running high-priority RT tasks, essential kernel threads can
starve indefinitely. stalld monitors CPU run queues and temporarily boosts
starving threads using SCHED_DEADLINE (or SCHED_FIFO as fallback).
The main goal was switching from the old sched_debug backend to a new
queue_track backend. The old approach reads /sys/kernel/debug/sched/debug
to get task information — file I/O, parsing, relatively high overhead. The
new approach uses BPF programs to track tasks in real-time. Lower overhead,
better precision, but a complete rewrite of the core monitoring logic.
This post covers what I learned along the way: eBPF verifier fights, kernel
portability headaches, race conditions, and production bugs.
eBPF and the verifier
Building the queue_track backend meant diving deep into eBPF. The old
sched_debug backend had to read and parse text files on every iteration.
With BPF, I track tasks in the kernel as events happen and aggregate the
results before copying to userspace. The overhead reduction is significant,
especially on systems with many CPUs1.
The eBPF verifier presents its own challenges. It exists for good reasons —
arbitrary code in the kernel would be a security nightmare. But when it rejects
a program, the error messages can be difficult to diagnose.
I encountered this with logging. On some kernel versions, the verifier refused to load the program with “Argument list too long”. The fix was dropping one parameter from the logging macro:
// Original: rejected by verifier on some kernels
log_task_prefix("enqueue", p->comm, p->pid, p->tgid, "rt", cpu);
// Fixed: passes verification
log_task_prefix("enqueue", p->comm, p->pid, p->tgid, cpu);
With eBPF, writing correct code is not enough. The code must also convince
the verifier it is correct. And verifier behavior changes between kernel
versions.
Then there is the kernel itself changing under you. Data structures evolve.
Fields move or disappear entirely. The cpu field in struct task_struct
was removed in recent kernels.
This is where BPF CO-RE
becomes essential. You define legacy struct variants, use bpf_core_field_exists()
to check at runtime what fields exist, and adapt accordingly:
struct task_struct___legacy {
int cpu;
};
static __always_inline int task_cpu(const struct task_struct *p)
{
const struct task_struct___legacy *legacy = (void *)p;
if (bpf_core_field_exists(legacy, cpu))
return BPF_CORE_READ(legacy, cpu);
return BPF_CORE_READ(p, thread_info.cpu);
}
I used this pattern again for __state vs state when checking if tasks
are running. Once you’ve written the adaptation logic, the BPF loader does
the rest. Different kernels, same binary.
Tracepoints
The queue_track backend needed a way to hook into the scheduler and track
task state changes. Early on, I used
fentry
probes on scheduler functions like enqueue_task_rt() and dequeue_task_rt().
This worked in testing, but fentry requires a synchronize_rcu_tasks() call
during attachment. On a busy system with a high-priority RT task
monopolizing a CPU, that call can block indefinitely.
The result was ironic: the daemon designed to prevent starvation was itself getting starved during startup.
// fentry: requires synchronize_rcu_tasks(), can block
SEC("fentry/enqueue_task_rt")
int BPF_PROG(enqueue_task_rt_enter, struct rq *rq, struct task_struct *p)
// tracepoint: no blocking synchronization required
SEC("tp_btf/sched_wakeup")
int BPF_PROG(handle__sched_wakeup, struct task_struct *p)
Moving to tracepoints resolved the issue. Tracepoints are stable kernel APIs and do not require blocking synchronization. Startup became immediate, even on heavily loaded systems.
However, I then discovered incomplete task tracking. Tasks do more than wake
up and sleep — they migrate between CPUs, get created, and exit. Missing any
of these events causes stalld to accumulate stale entries (tasks that no
longer exist on a given CPU) and lose track of tasks that migrated.
I added handlers for the remaining scheduler events. sched_wakeup_new catches
newly created tasks from their first wakeup. sched_migrate_task handles CPU
migration — dequeue from the source CPU and re-enqueue on the destination:
SEC("tp_btf/sched_migrate_task")
int BPF_PROG(handle__sched_migrate_task, struct task_struct *p, int dest_cpu)
{
int src_cpu = task_cpu(p);
if (dequeue_task(src_cpu, p))
enqueue_task(dest_cpu, p);
return 0;
}
The most subtle problem? stalld only tracked tasks that changed state after
it started. Anything already on a runqueue was invisible until it rescheduled.
The fix was a BPF task iterator that walks all tasks at startup:
SEC("iter/task")
int dump_task(struct bpf_iter__task *ctx)
{
struct task_struct *task = ctx->task;
if (!task)
return 0;
if (task_is_running(task))
enqueue_task(task_cpu(task), task);
return 0;
}
Now stalld has a complete system snapshot from the moment it starts.
Bug fixes
Some bugs manifest clearly. Consider this scenario: a task starts starving,
stalld detects it and attempts to boost it, but the task exits before the
boost completes. The get_current_policy() syscall fails because the task
no longer exists. However, stalld was not cleaning up the stale entry,
causing it to retry the same dead PID on every iteration:
[stalld] Failed to get policy for PID 12345
[stalld] Failed to get policy for PID 12345
[stalld] Failed to get policy for PID 12345
...
The fix was straightforward once identified. If the syscall fails, the task has likely exited. Remove the entry and continue:
if (get_current_policy(task->pid, &policy) < 0) {
cleanup_starving_task_info(task);
continue;
}
if (boost_with_deadline(task->pid, &policy) < 0) {
cleanup_starving_task_info(task);
continue;
}
Other bugs were more subtle. The merge_taks_info() function should clear
the starving vector and rebuild it on every iteration. However, it only
cleared the vector when there were zero new tasks. If new tasks existed but
none were starving, the vector retained stale data. stalld would then
boost tasks that were no longer starving:
// Bug: only clears when no new tasks
if (new_nr_rt_running == 0) {
update_cpu_starving_vector(cpu, 0);
}
// Fix: unconditionally clear before rebuilding
update_cpu_starving_vector(cpu, 0);
A one-line fix, but finding it required understanding the state machine.
DL-server integration
Recent kernels include the DL-server, which provides built-in starvation
handling. This raises the question: is stalld obsolete?
The answer is no. stalld is now “DL-server aware.” It monitors tasks
system-wide but avoids interfering with tasks that DL-server handles.
Instead, stalld addresses what DL-server cannot: starving SCHED_FIFO
tasks, and starving tasks on CPUs where DL-server was manually disabled.
The two mechanisms complement each other.
My initial approach was conservative. If the kernel has DL-server, stalld
would switch to log-only mode automatically — observe and log, but do not
boost. This avoided potential conflicts and deferred to the built-in solution.
Later, I refined this to the complementary approach described above.
Summary
The migration from sched_debug to queue_track is complete. Instead of
parsing debugfs files, stalld now tracks tasks in real-time with BPF.
Lower overhead, better data, and it scales better on systems with lots of CPUs.
Multiple bugs resolved along the way. Four major eBPF improvements:
tracepoint migration, task iterator, CO-RE portability, and verifier
compatibility.
Going deeper
1: Tested on kernels 5.14 and later. The queue_track
backend requires BPF CO-RE support, available in kernels 5.2+ with
appropriate libbpf versions.