Wednesday, April 04, 2012

Decreasing production server downtime with kexec

    When managing a production server, one of the most important thing is the tradeoff between server downtime and keeping server's software updated.

    While most of the updates can be applied from little to no downtime, a kernel update is always problematic since it requires typically a full reboot, and a significant downtime. To prevent that, many servers do not issue kernel updates as often as they should, specially those cheap rented servers.

    On the other hand, there are servers which like to presume of having a high uptime. While that might look good, it is in fact, quite the opposite: a high uptime in a server means they might not have updated their server's software!

    So I will introduce kexec and a benchmark to show how it can reduce downtime by reducing reboot time. But first, let's look how a unix-like system boots and shutdowns.

    In a typical boot/shutdown action, this are (aproximatelly) the steps that will be made by the machine:

  • Boot
    1. BIOS stage
    2. Bootloader load
    3. Kernel load
    4. INIT
      1. Kernel init
      2. Hardware initialisation
      3. Checking and mounting partitions
    5. Start services
  • Shutdown
    1. Stop services
    2. Sync discs
    3. Unmount partitions
    4. Hardware stop
    5. Hardware power off
    By using kexec, some of those steps are skipped, since it will change kernel from a running system. These are (aproximatelly) the steps for kexec reboot:

  1. INIT
    1. Kernel init
    2. Checking and (re)mounting partitions
  2. Start services
    To prove that reboot time decreased I created a little bash script to measure downtime (testTime.sh) and tested in my personal server running a Gentoo system: 

    To use provided script, you must run it after apache have been stopped with:
time ./testTime.sh SERVER_WWW_URI 2&>1 > /dev/null
    The commands I used for this benchmark are (via SSH):
Normal Reboot: /etc/init.d/apache2 stop && echo "Now you can exec time measurement script" && reboot
kexec reboot: kexec -l KERNELIMAGE --reuse-cmdline && /etc/init.d/apache2 stop && echo "Now you can exec time measurement script" && kexec -e

    These are the results I got:
Full reboot:
real    1m21.996s
user    0m3.241s
sys     0m2.833s
kexec reboot:
real    0m31.415s
user    0m1.872s
sys     0m1.684s
    So to sum up, despite it still takes time to perform kernel update, it is reduced significantly, so for most servers out there, now that is not an excuse to have system not updated anymore!

No comments: