Kernel Profiling

Lab Objectives

  • Familiarize yourself with the basics of Linux kernel profiling
  • Understanding basic profiling tools
  • Learning profiling methodologies and good practices

Overview

Up until now we have studied how the different components of the Linux kernel work, and how to write drivers that interface with them in order to provide support for devices or protocols. This has helped us understand how the Linux kernel works, but most people will not get to write kernel drivers.

Nonetheless, the skills learned will help us to write applications that better integrate with the whole operating system. In order to do this, one has to have a good view of both the user space and the kernel space.

This session aims to merge the work we have done up until now in the kernel space with real world use cases where we do not write kernel space code, but we look through the kernel using profiling tools, in order to debug issues that we're having when writing regular, low-level, applications.

Another focus of this session will be learning a general methodology for debugging software issues, and we will approach some tools that give us insight from the kernel on the way our application runs.

Profiling Tools

The main tool that we will focus our attention on is perf, which offers support for tracing applications, and also inspecting general aspects of the system. We will also be using debugging tools that most people have used in their day to day life, such as htop, ps, lsof and others.

perf

perf is a tool that instruments the CPU using tracepoints, kprobes and uprobes. This tool allows us to take a look at what functions are being called at a given point. This allows us to take a peak at where the kernel is pending the most time, print out call stacks of functions, and in general log what the CPU is running.

perf integrates modules such as: * static tracing * dynamic tracing * resource monitoring

The tracing interface that is offered by perf can be used by itself, using the perf command together with its subcommands.

root@qemux86:~# ./skels/kernel_profiling/perf

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints

 See 'perf help COMMAND' for more information on a specific command.

In the output above we can see all of perf's subcommands together with a description of their functionality, the most significant of which are:

  • stat - displays statistics such as the number of context switches and page faults;
  • top - an interactive interface where we can inspect the most frequent function calls and their caller. This interface allows us direct feedback while profiling;
  • list - lists the static trace point that we can instrument inside the kernel. These are useful when trying to get an insight from inside the kernel;
  • probe - add a dynamic trace point that instruments a function call in order to be recorded by perf;
  • record - records function calls and stack traces based on tracing points defined by the user; It can also record specific function calls and their stack traces. The record is saved in a file, named perf.data by default;
  • report - displays the information saved in a perf recording.

Another way to use perf's interface is through scripts that wrap over perf that offer a higher level way of looking at events or data, without needing to know the intricacies of the command. An example of this is the iosnoop.sh script, which displays what I/O transfers are taking place.

ps

ps is the Linux tool that allows us to monitor the processes that are running at a given time on the machine, including the kernel threads. This is a simple and easy to use way of checking at a glance what processes are running on the CPU, and what is their CPU and memory usage.

In order to list all the processes running, we use to ps aux command in the following way:

TODO
root@qemux86:~/skels/kernel_profiling/0-demo# cd
 root@qemux86:~# ps aux
 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 root         1  0.0  0.5   2004  1256 ?        Ss   12:06   0:12 init [5]
 root         2  0.0  0.0      0     0 ?        S    12:06   0:00 [kthreadd]
 [...]
 root       350  4.5  4.4  11132 10688 hvc0     T    12:07  17:21 ./io-app
 root      1358  0.0  0.0      0     0 ?        I    14:30   0:00 [kworker/u2:1-e
 root      2293  0.1  1.5   5516  3704 ?        Ss   18:18   0:00 sshd: root@pts/
 root      2295  0.0  1.3   3968  3232 pts/0    Ss+  18:19   0:00 -sh
 root      2307  0.0  0.0      0     0 ?        I    18:19   0:00 [kworker/u2:2-e
 root      2350  0.0  0.7   3032  1792 hvc0     R+   18:26   0:00 ps aux
 root      2392  2.6  0.0      0     0 ?        D    18:31   0:00 test-script

One information of note is that the 7th column represents the that of the process, S meaning suspended, D suspended due to I/O, and R meaning running.

time

The time command allows us to inspect the amount of time spent by a process in I/O, running the application code, or running code in kernel space. This can be useful in order to find out whether an application's issue comes from running too much in kernel space, so it has some overhead when it does system calls, or the issue is in the user code.

root@qemux86:~# time dd if=/dev/urandom of=./test-file bs=1K count=10
10+0 records in
10+0 records out
10240 bytes (10 kB, 10 KiB) copied, 0.00299749 s, 3.4 MB/s

real        0m0.020s
user        0m0.001s
sys 0m0.015s

In the output above we timed the generation of a file using dd. The result of the timing is displayed at the bottom of output. The values outputted by the tool are the following:

  • real - the amount of time has passed from the start of the application to its finishing;
  • user - time spent running the dd code;
  • sys - time spent running kernel code on behalf of the process.

We see that the sum of the user and sys values doesn't add up to the real value. This happens either when the application runs on multiple cores, in which case the sum might be higher, or the application sleeps, in which case the sum is lower.

top

top is an application that is found on most systems which lists in real time the applications that are running on the system. top runs interactively, and it auto-refreshes its output, as opposed to ps. We use this tool when we want a high level of continuous monitoring.

Profiling Methodology

When doing profiling, our goal is to identify the cause of a problem. Usually this problem is observed by someone when their application doesn't work as expected. When we say that an application did not work as expected, this can mean different things for different people. For example, one person might complain that the application has a slowdown, while another might say that the application runs on the CPU, but it doesn't output anything.

The first step in any problem solving context is to understand the default behaviour of the application we're trying to debug, and to make sure that it is now not running in the expected parameters.

Exercises

Important

To solve exercises, you need to perform these steps:

  • prepare skeletons from templates
  • build modules
  • copy modules to the VM
  • start the VM and test the module in the VM.

The current lab name is kernel_profiling. See the exercises for the task name.

The skeleton code is generated from full source examples located in tools/labs/templates. To solve the tasks, start by generating the skeleton code for a complete lab:

tools/labs $ make clean
tools/labs $ LABS=<lab name> make skels

You can also generate the skeleton for a single task, using

tools/labs $ LABS=<lab name>/<task name> make skels

Once the skeleton drivers are generated, build the source:

tools/labs $ make build

Then, copy the modules and start the VM:

tools/labs $ make copy
tools/labs $ make boot

The modules are placed in /home/root/skels/kernel_profiling/<task_name>.

Alternatively, we can copy files via scp, in order to avoid restarting the VM. For additional details about connecting to the VM via the network, please check Connecting to the Virtual Machine.

Review the Exercises section for more detailed information.

Warning

Before starting the exercises or generating the skeletons, please run git pull inside the Linux repo, to make sure you have the latest version of the exercises.

If you have local changes, the pull command will fail. Check for local changes using git status. If you want to keep them, run git stash before pull and git stash pop after. To discard the changes, run git reset --hard master.

If you already generated the skeleton before git pull you will need to generate it again.

Note

This session will require us to use the perf tracing tool. When running natively on our systems, we have to install the linux-tools-<version>-generic package using a package manager in order to run it. Because in our visual machine we don't have access to a package manager, we will be downloading the perf binary from this link. Download the application in the skels/kernel_profiling directory, and grant in execution permissions.

Warning

When running perf, make sure that you're running the downloaded version, not the version in the PATH variable.

Note

When going through this session's exercises, we will have to run command in parallel. In order to do this, we will have to connect to the virtual machine using SSH. We recommend using the core-image-sato-sdk-qemu image, since it has the tools that we need. To run the virtual machine using the core-image-sato-sdk-qemu file system, uncomment line 16 in the qemu/Makefile file.

Note

If you wish to run the perf-tools based scripts that we have included in the repository, such as iosnoop.sh, you will have to grant it execution privilleges, in order to be copied to the virtual machine file system.

Note

In order to improve the course of SO2, its components and the way it is conducted, your opinions are very useful to us. Please fill the feedback form on curs.upb.ro platform.

The form is anonymous and is active between May 22 and June 2, 2023. The results will be visible to the SO2 team after all the grades have been marked.

We invite you to evaluate the activity of the SO2 team and specify its strengths and weaknesses and your suggestions for improving the subject. Your feedback is very important to us to increase the quality of the subject in the coming years.

We are particularly interested in:

  • What did you not like and what do you think did not go well?
  • Why didn't you like it and why do you think it didn't go well?
  • What should we do to make things better?

0. Demo: Profiling I/O Problems

When working with I/O, we have to keep in mind that it is one of the slowest systems in the operating system, compared to memory, which is an order of magnitude faster, and scheduling, which deals with what is currently running on the CPU.

Because of this, I/O operations have do be thought out, because you might starve you application by saturating the system with requests. Another issue that you might face is that the I/O's slow speed might affect your application's responsiveness, if it waits for the I/O operations to finish.

Let's take a look at an application and debug its issues.

We are going to run the io-app application, from the 0-demo directory.

In order to inspect what is running on the CPU, and look at the stack of the process, we can use the perf record subcommand in the following way:

root@qemux86:~# ./perf record -a -g
Couldn't synthesize bpf events.
^C[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 1.724 MB perf.data (8376 samples) ]

perf will record values indefinitely, but we can close it using the Ctrl+c hotkey. We used the -a option in order to probe all CPUs, and -g option, which record the whole call stack.

To visualize the recorded information, we will use the perf report command, which will bring up a pager which will display the most frequent function calls that were found on the CPU, and their call stack.

root@qemux86:~# ./perf report --header -F overhead,comm,parent
# Total Lost Samples: 0
#
# Samples: 8K of event 'cpu-clock:pppH'
# Event count (approx.): 2094000000
#
# Overhead  Command          Parent symbol
# ........  ...............  .............
#
    58.63%  io-app           [other]
            |
             --58.62%--__libc_start_main
                       main
                       __kernel_vsyscall
                       |
                        --58.61%--__irqentry_text_end
                                  do_SYSENTER_32
                                  do_fast_syscall_32
                                  __noinstr_text_start
                                  __ia32_sys_write
                                  ksys_write
                                  vfs_write
                                  |
                                   --58.60%--ext4_file_write_iter
                                             ext4_buffered_write_iter
[...]

We have used the --header in order to print the table header, and -F overhead,comm,parent, in order to print the percentage of time where the call stack, the command and the caller.

We can see that the io-app command is doing some writes in the file system, and this contributes to much of the load on the system.

Armed with this information, we know that there are many I/O calls being done by the application. In order to look at the size of these requests, we can use the iosnoop.sh script in order to see how big these requests are.

root@qemux86:~/skels/kernel_profiling# ./iosnoop.sh 1
Tracing block I/O. Ctrl-C to end.
COMM         PID    TYPE DEV      BLOCK        BYTES     LATms
io-app       889    WS   254,0    4800512      1310720     2.10
io-app       889    WS   254,0    4803072      1310720     2.04
io-app       889    WS   254,0    4805632      1310720     2.03
io-app       889    WS   254,0    4808192      1310720     2.43
io-app       889    WS   254,0    4810752      1310720     3.48
io-app       889    WS   254,0    4813312      1310720     3.46
io-app       889    WS   254,0    4815872      524288     1.03
io-app       889    WS   254,0    5029888      1310720     5.82
io-app       889    WS   254,0    5032448      786432     5.80
jbd2/vda-43  43     WS   254,0    2702392      8192       0.22
kworker/0:1H 34     WS   254,0    2702408      4096       0.40
io-app       889    WS   254,0    4800512      1310720     2.60
io-app       889    WS   254,0    4803072      1310720     2.58
[...]

From this output we see that the io-app is reading in a loop from the fact that the first block 4800512 is repeating, and that it is doing big reads, since it is reading one megabyte fer request. This constant looping adds the load to the system that we're experiencing.

1. Investigating Reduced Responsiveness

The io.ko module, located in the kernel_profiling/1-io directory, decreases the system's responsiveness when inserted. We see that the command line stutters when typing commands, but when running top, we see that the system's load is not high, and there aren't any processes that are hogging resources.

Find out what the io.ko module is doing and why is it leading to the stuttering effect that we experience.

Hint

Trace all the functions being called and check where the CPU is spending most of its time. In order to do this, you can run either perf record and perf report to view the output, or perf top.

2. Launching New Threads

We want to run the same function in a loop 100 times in parallel. We have implemented two solutions inside the scheduling binary file, located in the kernel_profiling/2-scheduling directory.

When executing the scheduling binary, it prints a message in parallel from 100 running instances. We can tune this execution by running the application either with the first parameter 0 or 1.

Find out which solution is better, and why.

3. Tuning cp

Our goal is to write a copy of the cp tool integrated in Linux, which has been implemented by the memory binary, in the kernel_profiling/3-memory directory. It implements two approaches that we can take for the copy operation:

  • reading the contents of the source file in a buffer in memory using the read() system call, and writing that buffer to the destination file using the write() system call;
  • mapping the source and destination files to memory using the mmap system call, and copying the contents of the source file to the destination in memory.

Another tunable parameter that we're going to use is the block size of to copies that we're going to make, either through reads/writes or in memory.

1) Investigate which of the two copying mechanisms is faster. For this step, you will use the 1024 block size.

2) Once you have found which copying mechanism is faster, change the block size parameter and see which value gives you the best copies. Why?

4. I/O Latency

We have written a module that reads the content of a disk. Insert the bio.ko module, located in the 4-bio module, we see a large spike in the system's load, as can be seen in the top command, but we see that the system is still responsive.

Investigate what is causing the increased load to the system. Is it an I/O issue, or is it a scheduling issue?

Hint

Try to trace the I/O operations using perf, or use the iosnoop.sh script in order to inspect what I/O is happening at a certain point.

5. Bad ELF

Note

This is a bonus exercise that has been tested on a native Linux system. It may run under the QEMU virtual machine, but the behavior was weird in our testing. We recommend you used a native (or VirtualBox or VMware) Linux system.

We managed to build (as part of a Unikraft build) an ELF file that is valid when doing static analysis, but that can't be executed. The file is bad_elf, located in the 5-bad-elf/ folder.

Running it triggers a segmentation fault message. Running it using strace show an error with execve().

... skels/kernel_profiling/5-bad-elf$ ./bad_elf
Segmentation fault

... skels/kernel_profiling/5-bad-elf$ strace ./bad_elf
execve("./bad_elf", ["./bad_elf"], 0x7ffc3349ba50 /* 70 vars \*/) = -1 EINVAL (Invalid argument)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=NULL} ---
+++ killed by SIGSEGV +++
Segmentation fault (core dumped)

The ELF file itself is valid:

... skels/kernel_profiling/5-bad-elf$ readelf -a bad_elf

The issue is to be detected in the kernel.

Use either perf, or, better yet ftrace to inspect the kernel function calls done by the program. Identify the function call that sends out the SIGSEGV signal. Identify the cause of the issue. Find that cause in the manual page elf(5).