Deferred work¶
Lab objectives¶
- Understanding deferred work (i.e. code scheduled to be executed at a later time)
- Implementation of common tasks that uses deferred work
- Understanding the peculiarities of synchronization for deferred work
Keywords: softirq, tasklet, struct tasklet_struct, bottom-half handlers, jiffies, HZ, timer, struct timer_list, spin_lock_bh, spin_unlock_bh, workqueue, struct work_struct, kernel thread, events/x
Background information¶
Deferred work is a class of kernel facilities that allows one to schedule code to be executed at a later timer. This scheduled code can run either in the process context or in interruption context depending on the type of deferred work. Deferred work is used to complement the interrupt handler functionality since interrupts have important requirements and limitations:
- The execution time of the interrupt handler must be as small as possible
- In interrupt context we can not use blocking calls
Using deferred work we can perform the minimum required work in the interrupt handler and schedule an asynchronous action from the interrupt handler to run at a later time and execute the rest of the operations.
Deferred work that runs in interrupt context is also known as bottom-half, since its purpose is to execute the rest of the actions from an interrupt handler (top-half).
Timers are another type of deferred work that are used to schedule the execution of future actions after a certain amount of time has passed.
Kernel threads are not themselves deferred work, but can be used to complement the deferred work mechanisms. In general, kernel threads are used as "workers" to process events whose execution contains blocking calls.
There are three typical operations that are used with all types of deferred work:
- Initialization. Each type is described by a structure whose fields will have to be initialized. The handler to be scheduled is also set at this time.
- Scheduling. Schedules the execution of the handler as soon as possible (or after expiry of a timeout).
- Masking or Canceling. Disables the execution of the handler. This action can be either synchronous (which guarantees that the handler will not run after the completion of canceling) or asynchronous.
Attention
When doing deferred work cleanup, like freeing the structures associated with the deferred work or removing the module and thus the handler code from the kernel, always use the synchronous type of canceling the deferred work.
The main types of deferred work are kernel threads and softirqs. Work queues are implemented on top of kernel threads and tasklets and timers on top of softirqs. Bottom-half handlers were the first implementation of deferred work in Linux, but in the meantime it was replaced by softirqs. That is why some functions presented contain bh in their name.
Softirqs¶
softirqs can not be used by device drivers, they are reserved for various kernel subsystems. Because of this there is a fixed number of softirqs defined at compile time. For the current kernel version we have the following types defined:
enum {
HI_SOFTIRQ = 0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
IRQ_POLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
RCU_SOFTIRQ,
NR_SOFTIRQS
};
Each type has a specific purpose:
- HI_SOFTIRQ and TASKLET_SOFTIRQ - running tasklets
- TIMER_SOFTIRQ - running timers
- NET_TX_SOFIRQ and NET_RX_SOFTIRQ - used by the networking subsystem
- BLOCK_SOFTIRQ - used by the IO subsystem
- BLOCK_IOPOLL_SOFTIRQ - used by the IO subsystem to increase performance when the iopoll handler is invoked;
- SCHED_SOFTIRQ - load balancing
- HRTIMER_SOFTIRQ - implementation of high precision timers
- RCU_SOFTIRQ - implementation of RCU type mechanisms [1]
[1] | RCU is a mechanism by which destructive operations (e.g. deleting an element from a chained list) are done in two steps: (1) removing references to deleted data and (2) freeing the memory of the element. The second setup is done only after we are sure nobody uses the element anymore. The advantage of this mechanism is that reading the data can be done without synchronization. For more information see Documentation/RCU/rcu.txt. |
The highest priority is the HI_SOFTIRQ type softirqs, followed in order by the other softirqs defined. RCU_SOFTIRQ has the lowest priority.
Softirqs are running in interrupt context which means that they can not call blocking functions. If the sofitrq handler requires calls to such functions, work queues can be scheduled to execute these blocking calls.
Tasklets¶
A tasklet is a special form of deferred work that runs in interrupt
context, just like softirqs. The main difference between sofirqs and tasklets
is that tasklets can be allocated dynamically and thus they can be used
by device drivers. A tasklet is represented by struct
tasklet
and as many other kernel structures it needs to be
initialized before being used. A pre-initialized tasklet can be defined
as following:
void handler(unsigned long data);
DECLARE_TASKLET(tasklet, handler, data);
DECLARE_TASKLET_DISABLED(tasklet, handler, data);
If we want to initialize the tasklet manually we can use the following approach:
void handler(unsigned long data);
struct tasklet_struct tasklet;
tasklet_init(&tasklet, handler, data);
The data parameter will be sent to the handler when it is executed.
Programming tasklets for running is called scheduling. Tasklets are running from softirqs. Tasklets scheduling is done with:
void tasklet_schedule(struct tasklet_struct *tasklet);
void tasklet_hi_schedule(struct tasklet_struct *tasklet);
When using tasklet_schedule, a TASKLET_SOFTIRQ softirq is scheduled and all tasklets scheduled are run. For tasklet_hi_schedule, a HI_SOFTIRQ softirq is scheduled.
If a tasklet was scheduled multiple times and it did not run between schedules, it will run once. Once the tasklet has run, it can be re-scheduled, and will run again at a later timer. Tasklets can be re-scheduled from their handlers.
Tasklets can be masked and the following functions can be used:
void tasklet_enable(struct tasklet_struct * tasklet);
void tasklet_disable(struct tasklet_struct * tasklet);
Remember that since tasklets are running from softirqs, blocking calls can not be used in the handler function.
Timers¶
A particular type of deferred work, very often used, are timers. They
are defined by struct timer_list
. They run in interrupt
context and are implemented on top of softirqs.
To be used, a timer must first be initialized by calling timer_setup()
:
#include <linux/sched.h>
void timer_setup(struct timer_list * timer,
void (*function)(struct timer_list *),
unsigned int flags);
The above function initializes the internal fields of the structure and associates function as the timer handler. Since timers are planned over softirqs, blocking calls can not be used in the code associated with the treatment function.
Scheduling a timer is done with mod_timer()
:
int mod_timer(struct timer_list *timer, unsigned long expires);
Where expires is the time (in the future) to run the handler function. The function can be used to schedule or reschedule a timer.
The time unit is jiffie. The absolute value of a jiffie
is dependent on the platform and it can be found using the
HZ
macro that defines the number of jiffies for 1 second. To
convert between jiffies (jiffies_value) and seconds (seconds_value),
the following formulas are used:
jiffies_value = seconds_value * HZ ;
seconds_value = jiffies_value / HZ ;
The kernel maintains a counter that contains the number of jiffies
since the last boot, which can be accessed via the jiffies
global variable or macro. We can use it to calculate a time in the
future for timers:
#include <linux/jiffies.h>
unsigned long current_jiffies, next_jiffies;
unsigned long seconds = 1;
current_jiffies = jiffies;
next_jiffies = jiffies + seconds * HZ;
To stop a timer, use del_timer()
and del_timer_sync()
:
int del_timer(struct timer_list *timer);
int del_timer_sync(struct timer_list *timer);
These functions can be called for both a scheduled timer and an
unplanned timer. del_timer_sync()
is used to eliminate the
races that can occur on multiprocessor systems, since at the end of
the call it is guaranteed that the timer processing function does not
run on any processor.
A frequent mistake in using timers is that we forget to turn off timers. For example, before removing a module, we must stop the timers because if a timer expires after the module is removed, the handler function will no longer be loaded into the kernel and a kernel oops will be generated.
The usual sequence used to initialize and schedule a one-second timeout is:
#include <linux/sched.h>
void timer_function(struct timer_list *);
struct timer_list timer ;
unsigned long seconds = 1;
timer_setup(&timer, timer_function, 0);
mod_timer(&timer, jiffies + seconds * HZ);
And to stop it:
del_timer_sync(&timer);
Locking¶
For synchronization between code running in process context (A) and code running in softirq context (B) we need to use special locking primitives. We must use spinlock operations augmented with deactivation of bottom-half handlers on the current processor in (A), and in (B) only basic spinlock operations. Using spinlocks makes sure that we don't have races between multiple CPUs while deactivating the softirqs makes sure that we don't deadlock in the softirq is scheduled on the same CPU where we already acquired a spinlock.
We can use the local_bh_disable()
and
local_bh_enable()
to disable and enable softirqs handlers (and
since they run on top of softirqs also timers and tasklets):
void local_bh_disable(void);
void local_bh_enable(void);
Nested calls are allowed, the actual reactivation of the softirqs is done only when all local_bh_disable() calls have been complemented by local_bh_enable() calls:
/* We assume that softirqs are enabled */
local_bh_disable(); /* Softirqs are now disabled */
local_bh_disable(); /* Softirqs remain disabled */
local_bh_enable(); /* Softirqs remain disabled */
local_bh_enable(); /* Softirqs are now enabled */
Attention
These above calls will disable the softirqs only on the local processor and they are usually not safe to use, they must be complemented with spinlocks.
Most of the time device drivers will use special versions of spinlocks
calls for synchronization like spin_lock_bh()
and
spin_unlock_bh()
:
void spin_lock_bh(spinlock_t *lock);
void spin_unlock_bh(spinlock_t *lock);
Workqueues¶
Workqueues are used to schedule actions to run in process context. The base unit with which they work is called work. There are two types of work:
struct work_struct
- it schedules a task to run at a later timestruct delayed_work
- it schedules a task to run after at least a given time interval
A delayed work uses a timer to run after the specified time
interval. The calls with this type of work are similar to those for
struct work_struct
, but has _delayed in the functions
names.
Before using them a work item must be initialized. There are two types of macros that can be used, one that declares and initializes the work item at the same time and one that only initializes the work item (and the declaration must be done separately):
#include <linux/workqueue.h>
DECLARE_WORK(name , void (*function)(struct work_struct *));
DECLARE_DELAYED_WORK(name, void(*function)(struct work_struct *));
INIT_WORK(struct work_struct *work, void(*function)(struct work_struct *));
INIT_DELAYED_WORK(struct delayed_work *work, void(*function)(struct work_struct *));
DECLARE_WORK()
and DECLARE_DELAYED_WORK()
declare and
initialize a work item, and INIT_WORK()
and
INIT_DELAYED_WORK()
initialize an already declared work item.
The following sequence declares and initiates a work item:
#include <linux/workqueue.h>
void my_work_handler(struct work_struct *work);
DECLARE_WORK(my_work, my_work_handler);
Or, if we want to initialize the work item separately:
void my_work_handler(struct work_struct * work);
struct work_struct my_work;
INIT_WORK(&my_work, my_work_handler);
Once declared and initialized, we can schedule the task using
schedule_work()
and schedule_delayed_work()
:
schedule_work(struct work_struct *work);
schedule_delayed_work(struct delayed_work *work, unsigned long delay);
schedule_delayed_work()
can be used to plan a work item for
execution with a given delay. The delay time unit is jiffies.
Work items can not be masked but they can be canceled by calling
cancel_delayed_work_sync()
or cancel_work_sync()
:
int cancel_work_sync(struct delayed_work *work);
int cancel_delayed_work_sync(struct delayed_work *work);
The call only stops the subsequent execution of the work item. If the work item is already running at the time of the call, it will continue to run. In any case, when these calls return, it is guaranteed that the task will no longer run.
Attention
While there are versions of these functions that are
not synchronous (.e.g. cancel_work()
) do not
use them when you are performing cleanup work otherwise
race condition could occur.
We can wait for a workqueue to complete running all of its work items by calling flush_scheduled_work()
:
void flush_scheduled_work(void);
This function is blocking and, therefore, can not be used in interrupt
context. The function will wait for all work items to be completed.
For delayed work items, cancel_delayed_work
must be called
before flush_scheduled_work()
.
Finally, the following functions can be used to schedule work items on
a particular processor (schedule_delayed_work_on()
), or on all
processors (schedule_on_each_cpu()
):
int schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay);
int schedule_on_each_cpu(void(*function)(struct work_struct *));
A usual sequence to initialize and schedule a work item is the following:
void my_work_handler(struct work_struct *work);
struct work_struct my_work;
INIT_WORK(&my_work, my_work_handler);
schedule_work(&my_work);
And for waiting for termination of a work item:
flush_scheduled_work();
As you can see, the my_work_handler function receives the task as
the parameter. To be able to access the module's private data, you can
use container_of()
:
struct my_device_data {
struct work_struct my_work;
// ...
};
void my_work_handler(struct work_struct *work)
{
struct my_device_data * my_data;
my_data = container_of(work, struct my_device_data, my_work);
// ...
}
Scheduling work items with the functions above will run the handler in the context of a kernel thread called events/x, where x is the processor number. The kernel will initialize a kernel thread (or a pool of workers) for each processor present in the system:
$ ps -e
PID TTY TIME CMD
1? 00:00:00 init
2 ? 00:00:00 ksoftirqd / 0
3 ? 00:00:00 events / 0 <--- kernel thread that runs work items
4 ? 00:00:00 khelper
5 ? 00:00:00 kthread
7? 00:00:00 kblockd / 0
8? 00:00:00 kacpid
The above functions use a predefined workqueue (called events), and they run in the context of the events/x thread, as noted above. Although this is sufficient in most cases, it is a shared resource and large delays in work items handlers can cause delays for other queue users. For this reason there are functions for creating additional queues.
A workqueue is represented by struct workqueue_struct
. A new
workqueue can be created with these functions:
struct workqueue_struct *create_workqueue(const char *name);
struct workqueue_struct *create_singlethread_workqueue(const char *name);
create_workqueue()
uses one thread for each processor in the
system, and create_singlethread_workqueue()
uses a single
thread.
To add a task in the new queue, use queue_work()
or
queue_delayed_work()
:
int queue_work(struct workqueue_struct * queue, struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue,
struct delayed_work * work , unsigned long delay);
queue_delayed_work()
can be used to plan a work for execution
with a given delay. The time unit for the delay is jiffies.
To wait for all work items to finish call flush_workqueue()
:
void flush_workqueue(struct worksqueue_struct * queue);
And to destroy the workqueue call destroy_workqueue()
void destroy_workqueue(struct workqueue_struct *queue);
The next sequence declares and initializes an additional workqueue, declares and initializes a work item and adds it to the queue:
void my_work_handler(struct work_struct *work);
struct work_struct my_work;
struct workqueue_struct * my_workqueue;
my_workqueue = create_singlethread_workqueue("my_workqueue");
INIT_WORK(&my_work, my_work_handler);
queue_work(my_workqueue, &my_work);
And the next code sample shows how to remove the workqueue:
flush_workqueue(my_workqueue);
destroy_workqueue(my_workqueue);
The work items planned with these functions will run in the context of
a new kernel thread called my_workqueue, the name passed to
create_singlethread_workqueue()
.
Kernel threads¶
Kernel threads have emerged from the need to run kernel code in process context. Kernel threads are the basis of the workqueue mechanism. Essentially, a kernel thread is a thread that only runs in kernel mode and has no user address space or other user attributes.
To create a kernel thread, use kthread_create()
:
#include <linux/kthread.h>
struct task_struct *kthread_create(int (*threadfn)(void *data),
void *data, const char namefmt[], ...);
- threadfn is a function that will be run by the kernel thread
- data is a parameter to be sent to the function
- namefmt represents the kernel thread name, as it is displayed in ps/top ; Can contain sequences %d , %s etc. Which will be replaced according to the standard printf syntax.
For example, the following call:
kthread_create (f, NULL, "%skthread%d", "my", 0);
Will create a kernel thread with the name mykthread0.
The kernel thread created with this function will be stopped (in the
TASK_INTERRUPTIBLE state). To start the kernel thread, call the
wake_up_process()
:
#include <linux/sched.h>
int wake_up_process(struct task_struct *p);
Alternatively, you can use kthread_run()
to create and run a
kernel thread:
struct task_struct * kthread_run(int (*threadfn)(void *data)
void *data, const char namefmt[], ...);
Even if the programming restrictions for the function running within the kernel thread are more relaxed and scheduling is closer to scheduling in userspace, there are, however, some limitations to be taken into account. We will list below the actions that can or can not be made from a kernel thread:
- can't access the user address space (even with copy_from_user, copy_to_user) because a kernel thread does not have a user address space
- can't implement busy wait code that runs for a long time; if the kernel is compiled without the preemptive option, that code will run without being preempted by other kernel threads or user processes thus hogging the system
- can call blocking operations
- can use spinlocks, but if the hold time of the lock is significant, it is recommended to use mutexes
The termination of a kernel thread is done voluntarily, within the
function running in the kernel thread, by calling do_exit()
:
fastcall NORET_TYPE void do_exit(long code);
Most of the implementations of kernel threads handlers use the same model and it is recommended to start using the same model to avoid common mistakes:
#include <linux/kthread.h>
DECLARE_WAIT_QUEUE_HEAD(wq);
// list events to be processed by kernel thread
struct list_head events_list;
struct spin_lock events_lock;
// structure describing the event to be processed
struct event {
struct list_head lh;
bool stop;
//...
};
struct event* get_next_event(void)
{
struct event *e;
spin_lock(&events_lock);
e = list_first_entry(&events_list, struct event*, lh);
if (e)
list_del(&e->lh);
spin_unlock(&events_lock);
return e
}
int my_thread_f(void *data)
{
struct event *e;
while (true) {
wait_event(wq, (e = get_next_event));
/* Event processing */
if (e->stop)
break;
}
do_exit(0);
}
/* start and start kthread */
kthread_run(my_thread_f, NULL, "%skthread%d", "my", 0);
With the template above, the kernel thread requests can be issued with:
void send_event(struct event *ev)
{
spin_lock(&events_lock);
list_add(&ev->lh, &events_list);
spin_unlock(&events_lock);
wake_up(&wq);
}
Further reading¶
Exercises¶
Important
We strongly encourage you to use the setup from this repository.
- To solve exercises, you need to perform these steps:
- prepare skeletons from templates
- build modules
- start the VM and test the module in the VM.
The current lab name is deferred_work. See the exercises for the task name.
The skeleton code is generated from full source examples located in
tools/labs/templates
. To solve the tasks, start by generating
the skeleton code for a complete lab:
tools/labs $ make clean
tools/labs $ LABS=<lab name> make skels
You can also generate the skeleton for a single task, using
tools/labs $ LABS=<lab name>/<task name> make skels
Once the skeleton drivers are generated, build the source:
tools/labs $ make build
Then, start the VM:
tools/labs $ make console
The modules are placed in /home/root/skels/deferred_work/<task_name>.
You DO NOT need to STOP the VM when rebuilding modules! The local skels directory is shared with the VM.
Review the Exercises section for more detailed information.
Warning
Before starting the exercises or generating the skeletons, please run git pull inside the Linux repo, to make sure you have the latest version of the exercises.
If you have local changes, the pull command will fail. Check for local changes using git status
.
If you want to keep them, run git stash
before pull
and git stash pop
after.
To discard the changes, run git reset --hard master
.
If you already generated the skeleton before git pull
you will need to generate it again.
0. Intro¶
Using LXR, find the definitions of the following symbols:
jiffies
struct timer_list
spin_lock_bh function()
1.Timer¶
We're looking at creating a simple kernel module that displays a message at TIMER_TIMEOUT seconds after the module's kernel load.
Generate the skeleton for the task named 1-2-timer and follow the sections marked with TODO 1 to complete the task.
Hint
Use pr_info(...). Messages will be displayed on the
console and can also be viewed using dmesg. When scheduling
the timer we need to use the absolute time of the system (in
the future) in number of ticks. The current time of the
system in the number of ticks is given by jiffies
.
Thus, the absolute time we need to pass to the timer is
jiffies + TIMER_TIMEOUT * HZ
.
For more information review the Timers section.
2. Periodic timer¶
Modify the previous module to display the message in once every TIMER_TIMEOUT seconds. Follow the section marked with TODO 2 in the skeleton.
3. Timer control using ioctl¶
We plan to display information about the current process after N seconds of receiving a ioctl call from user space. N is transmitted as ioctl parameter.
Generate the skeleton for the task named 3-4-5-deferred and follow the sections marked with TODO 1 in the skeleton driver.
You will need to implement the following ioctl operations.
- MY_IOCTL_TIMER_SET to schedule a timer to run after a number of seconds which is received as an argument to ioctl. The timer does not run periodically. * This command receives directly a value, not a pointer.
- MY_IOCTL_TIMER_CANCEL to deactivate the timer.
Note
Review ioctl for a way to access the ioctl argument.
Note
Review the Timers section for information on enabling / disabling a timer. In the timer handler, display the current process identifier (PID) and the process executable image name.
Hint
You can find the current process identifier using the pid and comm fields of the current process. For details, review proc-info.
Hint
To use the device driver from userspace you must create the device character file /dev/deferred using the mknod utility. Alternatively, you can run the 3-4-5-deferred/kernel/makenode script that performs this operation.
Enable and disable the timer by calling user-space ioctl operations. Use the 3-4-5-deferred/user/test program to test planning and canceling of the timer. The program receives the ioctl type operation and its parameters (if any) on the command line.
Hint
Run the test executable without arguments to observe the command line options it accepts.
To enable the timer after 3 seconds use:
./test s 3
To disable the timer use:
./test c
Note that every time the current process the timer runs from is swapper/0 with PID 0. This process is the idle process. It is running when there is nothing else to run on. Because the virtual machine is very light and does not do much it is natural to see this process most of the time.
4. Blocking operations¶
Next we want to see what happens when we perform blocking operations in a timer routine. For this we try to call in the timer-handling routines a function called alloc_io() that simulates a blocking operation.
Modify the module so that when you receive MY_IOCTL_TIMER_ALLOC
command the timer handler will call alloc_io()
. Follow the
sections marked with TODO 2 in the skeleton.
Use the same timer. To differentiate functionality in the timer handler, use a flag in the device structure. Use the TIMER_TYPE_ALLOC and TIMER_TYPE_SET macros defined in the code skeleton. For initialization, use TIMER_TYPE_NONE.
Run the test program to verify the functionality of task 3. Run the
test program again to call alloc_io()
.
Note
The driver causes an error because a blocking function is called in the atomic context (the timer handler runs interrupt context).
5. Workqueues¶
We will modify the module to prevent the error observed in the previous task.
To do so, lets call alloc_io()
using workqueues. Schedule a
work item from the timer handler In the work handler (running in
process context) call the alloc_io()
. Follow the sections
marked with TODO 3 in the skeleton and review the Workqueues
section if needed.
Hint
Add a new field with the type struct work_struct
in your device structure. Initialize this field. Schedule
the work from the timer handler using schedule_work()
.
Schedule the timer handler aften N seconds from the ioctl.
6. Kernel thread¶
Implement a simple module that creates a kernel thread that shows the current process identifier.
Generate the skeleton for the task named 6-kthread and follow the TODOs from the skeleton.
Note
There are two options for creating and running a thread:
kthread_run()
to create and run the threadkthread_create()
to create a suspended thread and then start it running withwake_up_process()
.
Review the Kernel Threads section if needed.
Attention
Synchronize the thread termination with module unloading:
- The thread should finish when the module is unloaded
- Wait for the kernel thread to exit before continuing with unloading
Hint
For synchronization use two wait queues and two flags.
Review waiting-queues on how to use waiting queue.
Use atomic variables for flags. Review Atomic variables.