StormByte++: Decreasing production server downtime with kexec

When managing a production server, one of the most important thing is the tradeoff between server downtime and keeping server's software updated.

While most of the updates can be applied from little to no downtime, a kernel update is always problematic since it requires typically a full reboot, and a significant downtime. To prevent that, many servers do not issue kernel updates as often as they should, specially those cheap rented servers.

On the other hand, there are servers which like to presume of having a high uptime. While that might look good, it is in fact, quite the opposite: a high uptime in a server means they might not have updated their server's software!

So I will introduce kexec and a benchmark to show how it can reduce downtime by reducing reboot time. But first, let's look how a unix-like system boots and shutdowns.

In a typical boot/shutdown action, this are (aproximatelly) the steps that will be made by the machine:

Boot

BIOS stage
Bootloader load
Kernel load
INIT

Kernel init
Hardware initialisation
Checking and mounting partitions

Start services

Shutdown

Stop services
Sync discs
Unmount partitions
Hardware stop
Hardware power off

By using kexec, some of those steps are skipped, since it will change kernel from a running system. These are (aproximatelly) the steps for kexec reboot:

INIT

Kernel init
Checking and (re)mounting partitions

Start services

To prove that reboot time decreased I created a little bash script to measure downtime (testTime.sh) and tested in my personal server running a Gentoo system:

To use provided script, you must run it after apache have been stopped with:

time ./testTime.sh SERVER_WWW_URI 2&>1 > /dev/null

The commands I used for this benchmark are (via SSH):

Normal Reboot: /etc/init.d/apache2 stop && echo "Now you can exec time measurement script" && reboot

kexec reboot: kexec -l KERNELIMAGE --reuse-cmdline && /etc/init.d/apache2 stop && echo "Now you can exec time measurement script" && kexec -e

These are the results I got:

Full reboot:
real    1m21.996s
user    0m3.241s
sys     0m2.833s
kexec reboot:
real    0m31.415s
user    0m1.872s
sys     0m1.684s

So to sum up, despite it still takes time to perform kernel update, it is reduced significantly, so for most servers out there, now that is not an excuse to have system not updated anymore!

Wednesday, April 04, 2012

Decreasing production server downtime with kexec

No comments:

Followers

Contact Form

Translate