Wander Lairson Costa

Deep Dive: Modernizing stalld

• kernel

If you’ve worked with real-time Linux systems — especially DPDK deployments where you’ve got isolated CPUs running single busy-loop RT tasks — you know this problem well: kernel threads starve. A high-priority RT task sits on a CPU, doing its thing, and meanwhile essential system threads never get scheduled. Eventually things start breaking. Sometimes you get degradation, sometimes you get hangs.

That’s where stalld comes in. It’s a daemon that watches CPU run queues, spots threads that are starving, and gives them a temporary boost using SCHED_DEADLINE (or SCHED_FIFO if that’s not available). For production real-time systems, it’s critical infrastructure. Which means it needs to actually work, be efficient, and not break when you upgrade your kernel.

Over the past year, we’ve made big changes to stalld, focused on one main goal: switching from the old sched_debug backend to a new queue_track backend. The old approach reads /sys/kernel/debug/sched/debug to get task information — file I/O, parsing, relatively high overhead. The new approach uses BPF programs to track tasks in real-time. Lower overhead, better precision, but a complete rewrite of the core monitoring logic.

eBPF verifier fights, kernel portability headaches, race conditions, production bugs — you name it. This is what we learned.

eBPF in Production

Building the queue_track backend meant diving deep into eBPF. Why BPF instead of just reading debugfs? Performance. The old sched_debug backend had to read and parse /sys/kernel/debug/sched/debug every time it wanted task information. File I/O, text parsing, high overhead. With BPF, we track tasks in the kernel as events happen and just read the results. Much lower overhead, especially on systems with lots of CPUs.

But here’s the thing: the eBPF verifier is frustrating. Don’t get me wrong, it’s there for good reasons — you really don’t want arbitrary code running in your kernel. But when it rejects your program and you’re trying to figure out why, it’s maddening.

We hit this with our logging. On some kernel versions, the verifier just refused to load the program. “Argument list too long” it said. We were passing too many arguments to our logging macro. The fix? Drop one parameter. That’s it. But it taught us something important: with eBPF, you’re not just writing correct code. You’re writing code that can convince the verifier it’s correct. And the verifier changes between kernel versions.

// Before: Too many arguments for some verifiers
log_task_prefix("enqueue", p->comm, p->pid, p->tgid, "rt", cpu);

// After: Streamlined to pass verification
log_task_prefix("enqueue", p->comm, p->pid, p->tgid, cpu);

Then there’s the kernel itself changing under you. Data structures evolve. Fields move around or disappear entirely. The cpu field in struct task_struct? Gone in modern kernels. Build fails. Great.

This is where BPF CO-RE saves your life. You define legacy struct variants, use bpf_core_field_exists() to check at runtime what’s actually there, and adapt. Try the modern location first, fall back to the legacy field if needed.

// Define a "legacy" task_struct variant for older kernels
struct task_struct___legacy {
    int cpu;
}

// Runtime adaptation: check if field exists
static __always_inline int task_cpu(const struct task_struct *p)
{
    const struct task_struct___legacy *legacy = (void *)p;

    // Try legacy legacy field
    if (bpf_core_field_exists(legacy, cpu))
      return BPF_CORE_READ(legacy, cpu);

    // Use the modern method
    return BPF_CORE_READ(p, thread_info.cpu);
}

We used this pattern again for __state vs state when checking if tasks are running. Once you’ve written the adaptation logic, the BPF loader does the rest. Different kernels, same binary. It actually works.

Tracepoint Migration

The queue_track backend needed a way to hook into the scheduler and track task state changes. Early on, we used fentry probes on scheduler functions like enqueue_task_rt() and dequeue_task_rt(). Worked fine in testing. But there was a problem: fentry needs a synchronize_rcu_tasks() call during attachment. On a busy system with a high-priority RT task monopolizing a CPU, that call can hang. Forever.

So the daemon designed to prevent starvation was getting starved during startup. Yeah, the irony wasn’t lost on us.

// Old: fentry probes (blocking startup)
SEC("fentry/enqueue_task_rt")
int BPF_PROG(enqueue_task_rt_enter, struct rq *rq, struct task_struct *p)

// New: tracepoints (non-blocking)
SEC("tp_btf/sched_wakeup")
int BPF_PROG(handle__sched_wakeup, struct task_struct *p)

Moving to tracepoints fixed it. They are stable kernel APIs and don’t need that blocking synchronization. Startup became instant, even on loaded systems.

But then we realized we weren’t tracking the full picture. Tasks do more than just wake up and sleep. They migrate between CPUs. They get created. They exit. Miss any of these events and stalld’s view of the system is incomplete. You get ghost tasks. You lose tracking of real tasks.

So we added handlers for everything. sched_wakeup_new catches newly created tasks right from their first wakeup. sched_migrate_task became critical — when a task moves CPUs, we dequeue it from the source and re-enqueue on the destination.

SEC("tp_btf/sched_migrate_task")
int BPF_PROG(handle__sched_migrate_task, struct task_struct *p, int dest_cpu)
{
    int src_cpu = task_cpu(p);

    // Dequeue from source CPU
    if (dequeue_task(src_cpu, p)) {
        // Only enqueue on dest if it was tracked
        enqueue_task(dest_cpu, p);
    }
    return 0;
}

The most subtle problem? stalld only tracked tasks that changed state after it started. Anything already on a runqueue was invisible until it rescheduled. The fix was a BPF task iterator that walks all tasks at startup.

SEC("iter/task")
int dump_task(struct bpf_iter__task *ctx)
{
    struct task_struct *task = ctx->task;
    if (!task)
        return 0;

    // Add already-running tasks to our tracking
    if (task_is_running(task))
        enqueue_task(task_cpu(task), task);

    return 0;
}

Now stalld has a complete system snapshot from the moment it starts. No more blind spots.

Bug Hunting: The Devil in the Details

Some bugs are loud. Here’s a fun one: a task starts starving, stalld detects it, goes to boost it, but the task exits right before we can act. The get_current_policy() syscall fails — task’s dead. But stalld didn’t clean up the entry. Next iteration? Try the same dead PID again. And again. And again.

The logs just filled up:

[`stalld`] Failed to get policy for PID 12345
[`stalld`] Failed to get policy for PID 12345
[`stalld`] Failed to get policy for PID 12345
...

Forever. Every iteration. Every dead task we’d ever detected.

The fix was obvious once you saw it. If the syscall fails, the task is probably dead. Clean it up. Move on.

// If we can't get the policy, task is probably dead
if (get_current_policy(task->pid, &policy) < 0) {
    cleanup_starving_task_info(task);
    continue;  // Don't retry
}

// Same for boost failure
if (boost_with_deadline(task->pid, &policy) < 0) {
    cleanup_starving_task_info(task);
    continue;
}

Other bugs were sneakier. The merge_taks_info() function is supposed to clear the starving vector and rebuild it every iteration. But it only cleared the vector if there were zero new tasks. If you had new tasks but none of them were actually starving? The vector didn’t get cleared. Stale data stuck around. stalld would boost tasks that weren’t starving anymore.

// Before: Only reset if no new tasks
if (new_nr_rt_running == 0) {
    update_cpu_starving_vector(cpu, 0);  // Clear
}

// After: Always reset before computing new state
update_cpu_starving_vector(cpu, 0);  // Unconditional clear

A one-line fix, but finding it meant really understanding the state machine. In RT systems, there’s no such thing as a small bug.

The DL-Server

Recent kernels added the DL-server — basically built-in starvation handling. So is stalld obsolete now? Not at all.

stalld is “DL-server aware.” It monitors tasks system-wide but doesn’t touch the starving tasks that DL-server handles. Instead, stalld focuses on what DL-server can’t: starving SCHED_FIFO tasks, and starving tasks on CPUs where DL-server was manually disabled. The two mechanisms complement each other rather than conflict.

But when we first implemented DL-server detection, we weren’t sure how stalld should behave.

Our first instinct was to be defensive. If the kernel has DL-server, stalld would switch itself to log-only mode automatically. Just watch and log, don’t do any boosting. Seemed reasonable — avoid any potential conflicts, let the built-in solution do its thing. Later, we moved to the aforementioned approach.

Impact and Lessons Learned

The migration from sched_debug to queue_track is complete and going to run in production. Instead of parsing debugfs files, we’re now tracking tasks in real-time with BPF. Lower overhead, better data, and it actually scales better on systems with lots of CPUs.

Multiple bugs resolved along the way. Four major eBPF improvements — tracepoint migration, task iterator, CO-RE portability, verifier compatibility. Every change made things more reliable without breaking what already worked.

Debugging is detective work. The symptoms are clues, but the real problem is usually one level deeper than where you’re looking. The error message tells you what broke, not why. The stack trace shows where the code was, not where it went wrong. You have to work backwards, question everything, and be ready to find out the problem is somewhere completely different than you thought.

stalld is infrastructure. When it works, nobody notices. When it fails, systems break. Every bug fix, every edge case handled, every optimization — it all matters. And honestly, that’s what makes systems programming rewarding. Somewhere out there, a DPDK deployment is running smoothly because of a one-line fix made on some random Tuesday. That’s pretty cool.


Resources


comments powered by Disqus