How the Linux kernel works

Opinion English How the Linux kernel works

In depth: Oxford Dictionary defines a kernel as "a softer, usually edible part of a nut" but offers as a second meaning: "The central or most important part of something." (Incidentally, it's this first definition that gives rise to the contrasting name 'shell', meaning, in Linux-speak, a command interpreter.) In case you're a bit hazy on what a kernel actually does, we'll start with a bit of theory.

The kernel is a piece of software that, roughly speaking, provides a layer between the hardware and the application programs running on a computer. In a strict, computer-science sense, the term 'Linux' refers only to the kernel - the bit that Linus Torvalds wrote in the early 90s.

All the other pieces you find in a Linux distribution - the Bash shell, the KDE window manager, web browsers, the X server, Tux Racer and everything else - are just applications that happen to run on Linux and are emphatically not part of the operating system itself. To give some sense of scale, a fresh installation of RHEL5 occupies about 2.5GB of disk space (depending, obviously, on what you choose to include). Of this, the kernel, including all of its modules, occupies 47MB, or about 2%.

Inside the kernel

But what does the kernel actually do? The diagram below shows the big picture. The kernel makes its services available to the application programs that run on it through a large collection of entry points, known technically as system calls.

The kernel uses system calls such as 'read' and 'write' to provide an abstraction of your hardware.

From a programmer's viewpoint, these look just like ordinary function calls, although in reality a system call involves a distinct switch in the operating mode of the processor from user space to kernel space. Together, the repertoire of system calls provides a 'Linux virtual machine', which can be thought of as an abstraction of the underlying hardware.

One of the more obvious abstractions provided by the kernel is the filesystem. By way of example, here's a short program (written in C) that opens a file and copies its contents to standard output:

#include 
int main()
{
    int fd, count; char buf[1000];
    fd=open("mydata", O_RDONLY);
    count = read(fd, buf, 1000);
    write(1, buf, count);
    close(fd);
}

Here, you see examples of four system calls - open, read, write and close. Don't fret over the details of the syntax; that's not important right now. The point is this: through these system calls (and a few others) the Linux kernel provides the illusion of a 'file' - a sequence of bytes of data that has a name - and protects you from the underlying details of tracks and sectors and heads and free block lists that you'd have to get into if you wanted to talk to the hardware directly. That's what we mean by an abstraction.

As you'll see from the picture above, the kernel has to work hard to maintain this same abstraction when the filesystem itself might be stored in any of several formats, on local storage devices such as hard disks, CDs or USB memory sticks - or might even be on a remote system and accessed through a network protocol such as NFS or CIFS.

There may even be an additional device mapper layer to support logical volumes or RAID. The virtual filesystem layer within the kernel enables it to present these underlying forms of storage as a collection of files within a single hierarchical filesystem.

Behind the scenes

The filesystem is one of the more obvious abstractions provided by the kernel. Some features are not so directly visible. For example, the kernel is responsible for process scheduling. At any one time, there are likely to be several processes (programs) waiting to run.

The kernel's scheduler allocates CPU time to each one, so that if you look over a longer timescale (a few seconds) you have the illusion that the computer is running several programs at the same time. Here's another little C program:

#include 
main()
{
  if (fork()) {
    write(1, "Parent\n", 7);
    wait(0);
    exit(0);
  }
  else {
    write(1, "Child\n", 6);
    exit(0);
  }
}

This program creates a new process; the original process (the parent) and the new process (the child) each write a message to standard output, then terminate. Again, don't stress about the syntax. Just notice that the system calls fork(), exit() and wait() perform process creation, termination and synchronisation respectively. These are elegantly simple calls that hide the underlying compexities of process management and scheduling.

An even less visible function of the kernel, even to programmers, is memory management. Each process runs under the illusion that it has an address space (a valid range of memory addresses) to call its own. In reality, it's sharing the physical memory of the computer with many other processes, and if the system is running low on memory, some of its address space may even be parked out on the disk in the swap area.

Another aspect of memory management is that it prevents one process from accessing the address space of another - a necessary precaution to preserve the integrity of a multi-processing operating system.

The kernel also implements networking protocols such as IP, TCP and UDP that provide machine-to-machine and process-to-process communication over a network. Again, this is all about illusions. TCP provides the illusion of a permanent connection between two processes - like a piece of copper wire connecting two telephones - but in reality no permanent connection exists. Note that specific application protocols such as FTP, DNS or HTTP are implemented by user-level programs and aren't part of the kernel.

Linux (like Unix before it) has a good reputation for security. It's the kernel that tracks the user ID and group ID of each running process and uses these to provide a yes/no decision each time an application attempts to access a resource (such as opening a file for writing), by checking the access permissions on the file. This access control model is ultimately responsible for the security of Linux systems as a whole.

Finally (apologies to the many programmers who've written pieces of the kernel that do things that aren't on this brief list), the kernel provides a large collection of modules that know how to handle the low-level details of talking to hardware devices - how to read a sector from a disk, how to retrieve a packet from a network interface card and so on. These are sometimes called device drivers.

The modular kernel

Now we have some idea of what the kernel does, let's look briefly at its physical organisation. Early versions of the Linux kernel were monolithic - that is, all the bits and pieces were statically linked into one (rather large) executable file.

In contrast, modern Linux kernels are modular: a lot of the functionality is contained in modules that are loaded into the kernel dynamically. This keeps the core of the kernel small and makes it possible to load or replace modules in a running kernel without rebooting.

The core of the kernel is loaded into memory at boot time from a file in the /boot directory called something like vmlinuz-KERNELVERSION, where KERNELVERSION is, of course, the kernel version. (To find out what kernel version you have, run the command uname -r.) The kernel's modules are under the directory /lib/modules/KERNELVERSION. All of these pieces were copied into place when the kernel was installed.

Managing modules

For the most part, Linux manages its modules without your help, but there are commands to examine and manage the modules manually, should the need arise. For example, to find out which modules are currently loaded into the kernel, use lsmod. Here's a sample of the output:

# lsmod
pcspkr              4224  0 
hci_usb            18204  2 
psmouse            38920  0 
bluetooth          55908  7 rfcomm,l2cap,hci_usb
yenta_socket       27532  5 
rsrc_nonstatic     14080  1 yenta_socket
isofs              36284  0

The fields in this output are the module's name, its size, its usage count and a list of the modules that are dependent on it. The usage count is important to prevent unloading a module that's currently active. Linux will only enable a module to be removed if its usage count is zero.

You can manually load and unload modules using modprobe. (There are two lower-level commands called insmod and rmmod that do the job, but modprobe is easier to use because it automatically resolves module dependencies.) For example, the output of lsmod on our machine shows a loaded module called isofs, which has a usage count of zero and no dependent modules. (isofs is the module that supports the ISO filesystem format used on CDs.) The kernel is happy to let us unload the module, like this:

# modprobe -r isofs

Now isofs doesn't show up on the output of lsmod and, for what it's worth, the kernel is using 36,284 bytes less memory. If you put in a CD and let it automount, the kernel will automatically reload the isofs module and its usage count will rise to 1. If you try to remove the module now, you won't succeed because it's in use:

# modprobe -r isofs 
FATAL: Module isofs is in use.

Whereas lsmod just lists the modules that are currently loaded, modprobe -l will list all the available modules. The output essentially shows all the modules living under /lib/modules/KERNELVERSION; be prepared for a long list!

In reality, it would be unusual to load a module manually with modprobe, but if you did you could pass parameters to the module via the modprobe command line. Here's an example:

# modprobe usbcore blinkenlights=1

No, we haven't just invented blinkenlights - it's a real parameter for the usbcore module.

The tricky bit is knowing what parameters a module accepts. You could phone a friend or even ask the audience, but a better approach is to use the modinfo command, which lists a variety of information about the module.

Here's an example for the module snd-hda-intel. We've pruned the output somewhat in the interests of brevity:

# modinfo snd-hda-intel 
filename:       /lib/modules/2.6.20-16-generic/kernel/sound/pci/hda/snd-hda-intel.ko
description:    Intel HDA driver
license:        GPL
srcversion:     A3552B2DF3A932D88FFC00C
alias:          pci:v000010DEd0000055Dsv*sd*bc*sc*i*
alias:          pci:v000010DEd0000055Csv*sd*bc*sc*i*
depends:        snd-pcm,snd-page-alloc,snd-hda-codec,snd
vermagic:       2.6.20-16-generic SMP mod_unload 586 
parm:           index:Index value for Intel HD audio interface. (int)
parm:           id:ID string for Intel HD audio interface. (charp)
parm:           model:Use the given board model. (charp)
parm:           position_fix:Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size). (int)
parm:           probe_mask:Bitmask to probe codecs (default = -1). (int)
parm:           single_cmd:Use single command to communicate with codecs (for debugging only). (bool)
parm:           enable_msi:Enable Message Signaled Interrupt (MSI) (int)
parm:           enable:bool

The lines of interest to us here are those starting with parm: - these show the parameters accepted by that module. These descriptions are terse, to say the least. To go hunting for further documentation, install the kernel source code. Then you'll find a directory called something like /usr/src/KERNELVERSION/Documentation.

There's some interesting stuff under here; for example, the file /usr/src/KERNELVERSION/Documentation/sound/alsa/ALSA-Configuration.txt describes the parameters recognised by many of the ALSA sound modules. The file /usr/src/KERNELVERSION/Documentation/kernel-parameters.txt is also helpful.

An example of needing to pass parameters to a module came up quite recently on one of the Ubuntu forums (see https://help.ubuntu.com/community/HdaIntelSoundHowto). Essentially the point was that the snd-hda-intel module needed a little help in driving the sound hardware correctly and would sometimes hang when it loaded at boot time. Part of the fix was to supply the option probe_mask=1 to the module. So, if you were loading the module manually, you'd type:

# modprobe snd-hda-intel probe_mask=1

More likely, you'd place a line in the file /etc/modprobe.conf like this:

options snd-hda-intel probe_mask=1

This tells modprobe to include the probe_mask=1 option every time it loads the snd-hda-intel module. Some recent Linux distrubutions split this information up into multiple files under /etc/modprobe.d rather than putting it all in modprobe.conf.

The /proc filesystem

The Linux kernel also exposes a great deal of information via the /proc filesystem. To make sense of /proc we need to broaden our concept of what a file is.

Instead of thinking of a file as permanent information stored on a hard drive or a CD or a memory stick, we need to think of it as any information that can be accessed via traditional system calls such as the open/read/write/close calls we saw earlier, and which can, therefore, be accessed by ordinary programs such as cat or less.

The 'files' under /proc are entirely a figment of the kernel's imagination and provide a view into many of the kernel's internal data structures.

In fact, many Linux reporting tools present nicely formatted versions of the information they find in the files under /proc. As an example, a listing of /proc/modules will show you a list of currently loaded modules that's strangely reminiscent of the output from lsmod.

In a similar vein, the contents of /proc/meminfo provides more detail about the current status of the virtual memory system than you could shake a stick at, whereas tools such as vmstat and top provide some of this information in a (marginally) more accessible format. As another example, /proc/net/arp shows the current contents of the system's ARP cache; from the command line, arp -a shows the same information.

Of particular interest are the 'files' under /proc/sys. As an example, the setting under /proc/sys/net/ipv4/ip_forward says whether the kernel will forward IP datagrams - that is, whether it will function as a gateway. Right now, the kernel is telling us that this is turned off:

# cat /proc/sys/net/ipv4/ip_forward 
0

It gets much more interesting when you discover that you can write to these files, too. Continuing our example:

# echo 1 > /proc/sys/net/ipv4/ip_forward

...will turn on IP forwarding in the running kernel.

Instead of using cat and echo to examine and modify the settings under /proc/sys, you can also use the sysctl command:

# sysctl net.ipv4.ip_forward 
net.ipv4.ip_forward = 0

Which is equivalent to:

# cat /proc/sys/net/ipv4/ip_forward 
0

And:

# sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1

...is the same as

# echo 1 > /proc/sys/net/ipv4/ip_forward

Notice that the pathnames you supply to sysctl use a full stop (.) to separate the components instead of the usual forward slash (/), and that the paths are all relative to /proc/sys.

Be aware that settings you change in this way only affect the current running kernel - they will not survive a reboot. To make settings permanent, put them into the file /etc/sysctl.conf. At boot time, sysctl will automatically re-establish any settings it finds in this file.

A line in /etc/sysctl.conf might look like this:

net.ipv4.ip_forward=1

Performance tuning

The writeable parameters under /proc/sys have spawned a whole sub-culture of Linux performance tuning. Personally, I think this is overrated, but here are a few examples should you wish to try it.

The installation instructions for Oracle 10g ask you to set a number of parameters, including:

kernel.shmmax=2147483648

...which sets the maximum shared memory segment size to 2GB. (Shared memory is an inter-process communication mechanism that enables a memory segment to be visible within the address space of multiple processes.)

The IBM 'Redpaper' on Linux performance and tuning guidelines (www.redbooks.ibm.com/abstracts/redp4285.html) makes many suggestions for adjusting parameters under /proc/sys, including this:

vm.swappiness=100

This parameter controls how aggressively memory pages are swapped to disk.

Some parameters may be adjusted to improve security.

net.ipv4.icmp_echo_ignore_broadcasts=1

...which tells the kernel not to respond to broadcast ICMP ping requests, making your network less vulnerable to a type of denial-of-service attack known as a Smurf attack.

Here's another example:

net.ipv4.conf.all.rp_filter=1

That tells the kernel to enforce sanity checking, also called ingress filtering or egress filtering. The point is to drop a packet if the source and destination IP addresses in the IP header don't make sense when considered in light of the physical interface on which it arrived.

So, is there any documentation on all these parameters? Well, the command

# sysctl -a

will show you all their names and current values. It's a long list, but it gives you no clue what any of them actually do. So what else is there? As it turns out, O'Reilly has published a book, written by Olivier Daudel and called /proc et /sys. Oui, mes amis, it's in French, and we're not aware of an English translation.

Another useful reference is the Red Hat Enterprise Linux Reference Guide, which devotes an entire chapter to the subject. The definitive book about the Linux kernel is Understanding the Linux Kernel by Bovet and Cesati (O'Reilly), but be aware that this is mainly about kernel internals and is probably more of interest to wannabe kernel developers and computer science students rather than system administrators.

It's also possible to configure and build your own kernel. For this, you might try Greg Kroah-Hartman's Linux Kernel in a Nutshell, an O'Reilly title that makes a delightful but presumably unintended play on words. But, of course, you have to be nuts to make a kernel.

Is performance tuning worth it?

My father's first car was a Wolseley 1500, registration 49 RNU, though how I come to remember such an obscure and ancient detail is beyond me. Anyway, he loved tinkering, and spent hours making minute adjustments to things like the ignition timing and mixture setting.

Occasionally he'd remove the spark plugs and adjust the gaps. While the plugs were out, he'd pour redex into the cylinders as part of some mysterious process of colonic irrigation. After this, the car would produce satisfyingly robust clouds of black smoke out the back as the redex burned off.

The trouble was that he had no objective way of measuring what improvements his efforts had made. He kept meticulous records of petrol purchases and mileages and calculated fuel consumption to several decimal places, and there was a specific hill he'd drive up in third gear to "see how it went", but it wasn't what you'd call scientific.

Many Linux system administrators find themselves in a similar position. They know there are all kinds of parameters they can tweak that might improve performance, but have little idea of what most of them do and no good way to measure performance. So, our advice is: unless you know what you're doing, and/or have a way to measure performance, leave these settings alone!