Wander Lairson Costa

Improving the Kernel TTY throughput

• kernel

It is incredible how we can find room for improvement even in the eldest and battle-tested codebases out there. This post is about one of these cases.

I was investigating a soft lockup bug report which only happened in an HP Proliant DL380 Gen9. The trigger for the issue is when we try to load the scsi_debug with a few hundred virtual devices:

$ modprobe scsi_debug virtual_gb=1 add_host=2 num_tgts=600
Message from syslogd@storageqe-25 at Jul 15 10:29:35 ...
 kernel:watchdog: BUG: soft lockup - CPU#33 stuck for 22s! [modprobe:35056]

Message from syslogd@storageqe-25 at Jul 15 10:30:23 ...
 kernel:watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [migration/25:140]

Message from syslogd@storageqe-25 at Jul 15 10:30:23 ...
 kernel:watchdog: BUG: soft lockup - CPU#18 stuck for 23s! [modprobe:35056]

And below, we show the dmesg output:

watchdog: BUG: soft lockup - CPU#33 stuck for 22s! [modprobe:35056]
rcu: 	14-...!: (1 GPs behind) idle=24a/1/0x4000000000000002 softirq=12191/12192 fqs=5030
Modules linked in:
rcu: 	33-...!: (20094 ticks this GP) idle=f9e/1/0x4000000000000000 softirq=8891/8891 fqs=5031
 scsi_debug
	(detected by 14, t=80246 jiffies, g=346621, q=270790)
 loop
NMI backtrace for cpu 14
 dm_service_time
CPU: 14 PID: 0 Comm: swapper/14 Tainted: G               X --------- ---
 dm_multipath
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 02/17/2017
 rfkill
Call Trace:
 intel_rapl_msr
 <IRQ>
 intel_rapl_common sb_edac
 dump_stack+0x64/0x7c
...
...
CPU: 18 PID: 35056 Comm: modprobe Tainted: G             L X --------- ---
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 02/17/2017
RIP: 0010:number+0x21b/0x340
Code: 4c 8d 44 28 01 48 39 c3 76 03 c6 00 20 48 ...
RSP: 0018:ffffb03ec3650b98 EFLAGS: 00000046
RAX: 0000000000000000 RBX: ffffb03f43650cff RCX: 0000000000000038
RDX: ffffb03ec3650baf RSI: 0000000000000001 RDI: 0000000000000000
RBP: 00000000fffffffd R08: ffffb03ec3650d06 R09: 0000000000000004
R10: ffffb03ec3650c98 R11: ffffb03ec3650d06 R12: 0000000000000020
R13: 0000000000000000 R14: 0000000000004a0e R15: 0000000000000a00
FS:  00007f519cfb1740(0000) GS:ffff9b096fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a8b65190b8 CR3: 0000000550bec004 CR4: 00000000001706e0
Call Trace:
 <IRQ>
 vsnprintf+0x363/0x560
 sprintf+0x56/0x70
 info_print_prefix+0x7b/0xd0
 record_print_text+0x52/0x150
 console_unlock+0x144/0x330
 vprintk_emit+0x14d/0x230
 printk+0x58/0x6f
 print_modules+0x62/0xaf
 watchdog_timer_fn+0x190/0x200
 ? lockup_detector_update_enable+0x50/0x50
 __hrtimer_run_queues+0x12a/0x270
 hrtimer_interrupt+0x110/0x2c0
 __sysvec_apic_timer_interrupt+0x5f/0xd0
 sysvec_apic_timer_interrupt+0x6d/0x90
 </IRQ>

Notice printk() and (in special) console_unlock() in the stack trace. This is not a coincidence. It happened every single time.

I found out that scsi_debug sends a lot of information to the console output, causing the kernel to spend most of its time flushing data to the serial TTY. vprintk_emit() disables preemption before calling console_unlock()1. Because the preemption is disabled and the serial TTY driver holds a spinlock while writing to the port, the RCU stall detector reports a stall. A slow serial console causing an RCU stall is well known and documented:

"Booting Linux using a console connection that is too slow to keep up with the boot-time
console-message rate. For example, a 115Kbaud serial console can be way too slow to keep
up with boot-time message rates and will frequently result in RCU CPU stall warning
messages. Especially if you have added debug printk()s."

In the hrtimer interrupt handler, the kernel calls the check_rcu_stall() function, which will call printk() to dump the stall information. At this time, the interrupt handler becomes the console owner and keeps flushing the remaining data to the serial console1. The CPU will then get stuck in the interrupt routine, and triggers a soft lockup.

I suspected there was something odd with the serial port. It is configured with 115200 bps, 8 bits per data. Considering one start bit and one stop bit, it gives an expected throughput of approximately 11520 bytes per second. Let’s use bpftrace while loading the scsi_debug driver to verify if it is indeed the case:

$ bpftrace -e '
BEGIN {
	@ = 0;
}

kfunc:serial8250_console_putchar {
	@++;
}

interval:s:1 {
	printf("%d B/s\n", @);
	@ = 0;
}'
Attaching 3 probes...
2522 B/s
2517 B/s
2518 B/s
2507 B/s
2510 B/s
2414 B/s
...

serial8250_console_putchar() is the function that effectively writes a character to the serial port. We count how many times we call it per second.

Maybe the tracing machinery is causing an overhead, jeopardizing the serial performance? Let’s try another approach:

$ bpftrace -e '
BEGIN {
	@ = (uint32) 0;
}

kfunc:uart_console_write {
	@ += args->count;
}

interval:s:1 {
	printf("%u B/s\n", @);
	@ = 0;
}'
Attaching 3 probes...
1838 B/s
2460 B/s
2536 B/s
2415 B/s
2461 B/s
2247 B/s
2521 B/s
2441 B/s
2409 B/s
2378 B/s
2353 B/s
2398 B/s
2376 B/s
2414 B/s
...

Which has less overhead, as we don’t call the probe function for each character sent. Again the throughput is not what we would expect.

The scripts above report the overall system throughput. The next step is determining if the suboptimal performance is due to system overhead or the serial TTY code itself. The following bpftrace script measures more precisely the throughput of the serial console driver:

$ bpftrace -e '
kfunc:uart_console_write {
  @c[cpu] = args->count;
  @t[cpu] = nsecs;
}

kretfunc:uart_console_write {
  $elapsed = nsecs - @t[cpu];
  $rate = @c[cpu] * 1000000000 / $elapsed;
  printf("%ld B/s\n", $rate);
}'
Attaching 2 probes...
2206 B/s
2502 B/s
2585 B/s
2553 B/s
2559 B/s
2537 B/s
2555 B/s
2557 B/s
2554 B/s
2557 B/s
2561 B/s
2542 B/s
2534 B/s
2255 B/s
2555 B/s
2482 B/s
...

No luck. To facilitate my tests, I created a small driver + app that calls the console driver the same way printk() does. After loading the serco driver, use sertest -n <N> <driver-path> to send N bytes to the console. Let’s test it:

$ time ./sertest -n 2500 /dev/serco

real    0m0.997s
user    0m0.000s
sys     0m0.997s

Ok, 2500 bytes sent in approximately 1 second. That matches our previous results. With the help of the function tracer, we can see what is going on in more detail:

$ trace-cmd record -p function_graph -g serial8250_console_write \
   ./sertest -n 1 /dev/serco

$ trace-cmd report

            |  serial8250_console_write() {
 0.384 us   |    _raw_spin_lock_irqsave();
 1.836 us   |    io_serial_in();
 1.667 us   |    io_serial_out();
            |    uart_console_write() {
            |      serial8250_console_putchar() {
            |        wait_for_xmitr() {
 1.870 us   |          io_serial_in();
 2.238 us   |        }
 1.737 us   |        io_serial_out();
 4.318 us   |      }
 4.675 us   |    }
            |    wait_for_xmitr() {
 1.635 us   |      io_serial_in();
            |      __const_udelay() {
 1.125 us   |        delay_tsc();
 1.429 us   |      }
...
...
...
 1.683 us   |      io_serial_in();
            |      __const_udelay() {
 1.248 us   |        delay_tsc();
 1.486 us   |      }
 1.671 us   |      io_serial_in();
 411.342 us |    }

So, it takes 411 us to send just one single byte. The function serial8250_console_write() is the one responsible to dispatch the TTY data to the serial port (it calls uart_console_write()). It writes a character and waits for the chip controller to process the data before sending the next. Modern serial controllers have a FIFO of at least 16 bytes. Maybe we can exploit that and improve the situation. Well, that is what I did. Let’s how things go now:

$ time ./sertest -n 2500 /dev/serco

real    0m0.112s
user    0m0.056s
sys     0m0.055s

Wow! Much better! Now, if we send 11500 bytes, it should take approximately one second:

$ time ./sertest -n 11500 /dev/serco

real    0m3.142s
user    0m0.000s
sys     0m3.142s

confused

Let’s run trace-cmd one more time, but I will only show the piece of the trace output that interests us:

sertest-21649 [008]  2360.845784: funcgraph_exit:         1.942 us   |        }
sertest-21649 [008]  2360.845784: funcgraph_entry:        1.710 us   |        io_serial_in();
sertest-21649 [008]  2360.845786: funcgraph_entry:                   |        __const_udelay() {
sertest-21649 [008]  2360.845786: funcgraph_entry:                   |          delay_tsc() {
sertest-21649 [008]  2360.845786: funcgraph_entry:        0.128 us   |            preempt_count_add();
sertest-21649 [008]  2360.845787: funcgraph_entry:        0.130 us   |            preempt_count_sub();
sertest-21649 [008]  2360.845787: funcgraph_entry:        0.130 us   |            preempt_count_add();
sertest-21649 [008]  2360.845787: funcgraph_entry:        0.131 us   |            preempt_count_sub();
sertest-21649 [008]  2360.845787: funcgraph_entry:        0.131 us   |            preempt_count_add();
sertest-21649 [008]  2360.845788: funcgraph_entry:        0.130 us   |            preempt_count_sub();
sertest-21649 [008]  2360.845788: funcgraph_exit:         1.666 us   |          }
sertest-21649 [008]  2360.845788: funcgraph_exit:         1.915 us   |        }
sertest-21649 [008]  2360.845788: funcgraph_entry:      + 12.787 us  |        io_serial_in();
sertest-21649 [008]  2360.845801: funcgraph_entry:                   |        __const_udelay() {
sertest-21649 [008]  2360.845801: funcgraph_entry:                   |          delay_tsc() {
sertest-21649 [008]  2360.845801: funcgraph_entry:        0.152 us   |            preempt_count_add();
sertest-21649 [008]  2360.845802: funcgraph_entry:        0.151 us   |            preempt_count_sub();
sertest-21649 [008]  2360.845802: funcgraph_entry:        0.151 us   |            preempt_count_add();
sertest-21649 [008]  2360.845802: funcgraph_entry:        0.143 us   |            preempt_count_sub();
sertest-21649 [008]  2360.845803: funcgraph_entry:        0.145 us   |            preempt_count_add();
sertest-21649 [008]  2360.845803: funcgraph_entry:        0.147 us   |            preempt_count_sub();
sertest-21649 [008]  2360.845803: funcgraph_exit:         1.909 us   |          }
sertest-21649 [008]  2360.845803: funcgraph_exit:         2.204 us   |        }
sertest-21649 [008]  2360.845803: funcgraph_entry:        1.743 us   |        io_serial_in();
sertest-21649 [008]  2360.845805: funcgraph_entry:                   |        __const_udelay() {

io_serial_in() usually executes in less than 2 us, but sometimes something odd happens, and the controller takes more than 10 us to reply. That is the reason we can’t get the expected speed of 115200 bps for such an amount of data.

We see a 25% throughput boost when working with the original modprobe sci_debug scenario. Even though we could not reach maximum performance, we improved things considerably.

1: console_unlock() will flush the printk() ring buffer before unlocking the console.

1: This behavior is vital to ensure essential information flushes from the ring buffer in the case of fatal errors.

comments powered by Disqus