Roman Sandals

July 29, 2008

Super-user excuses

Filed under: musing, sysadmin, technology — Craig Lawton @ 1:12 pm

System administrators always get super-user access. Third parties, increasingly located wherever, are often granted super-user access as well, usually to smooth project implementations. Super-user access is thrown around willy-nilly because it’s a hell of lot easier than documenting privileges which is really, really boring work.

This leads to poor outcomes: downtime, systems in “undocumentable” states, security holes etc.

The horrible truth is that somebody somewhere must be able to gain super-user access when required. It can’t be avoided.

The other horrible truth is that when you allow super-user access only because properly defining a particular role is hard, you are in effect, giving up control of your environment. This is amplified when more than one team shares super-user access. It only takes one cowboy, or an innocent slip-up, to undermine confidence in an environment.

In this increasingly abstracted IT world, where architecture mandates shared, re-usable applications, where global resourcing mandates virtual remotely-located teams, where IT use and and server numbers exponentially increase and where businesses increasingly interact through gateways, security increasing looks like a feature tacked on at the last minute.

Security costs a lot and adds nothing to the bottom line – though lack of it can and will lead to some big bottom line subtractions.

The mainframe guys had this licked ages ago. The super-user excuse is looking rather thin.

The Age of Authorization is upon us…

Update: An amazing story from San Francisco, which outlines how lack of IT knowledge at the top of an organisation, and too much power devolved to to few IT staff, can cause much grief.

May 23, 2008

A language for sysadmin testing

Filed under: sysadmin, test-driven sysadmin — rchanter @ 4:13 pm

This article is part of a series on Test-Driven Systems Administration.

Having decided that we’re going to test the hell out of everything, we need to settle on a language for both the tests we want to run, and the data we want to collect. In trying to building a systems test tool, there is a series of design decisions that flow from having a sytems admin perspective.

Basic Design Principles

  • Low friction: it should be simple to write tests.
  • Didactic: Examining existing tests should yield something meaningful. They should serve as a knowledge-sharing channel.
  • Language independence: it should be easy to use your language of choice to write tests.
  • Ubiquity: run-time dependencies should not be an obstacle to deploying and using a tool.
  • Safety: it will be common for tests to run with elevated privileges.

This post is about all of these things, but principally about the testing language. At the highest level, we can divide language into vocabulary and grammar. It’s worth considering these separately when we look at systems testing.

It occurs to me that most sysadmins already have a perfectly good vocabulary for systems testing: shell one-liners and scripts. There are a few … ah, let’s say … improvement opportunities there. I’ll start by confessing that I love one-liners. I would reckon that 80% of what I want to do to assess a Unix system can be done with one-liners. Preliminary health checks on most servers are done with a handful of standard commands. Putting together a 6-command pipeline usually gets me cackling with glee at my own ingenuity. So one-liners are probably going to be a big part of my testing toolkit.

As for shell scripts, there are a few classic shell scripting patterns. In general, they violate my “Low Friction” principle.

  • One-offs, ranging from a simple for-loop at the shell prompt to single-purpose throwaway scripts (say, up to a couple of dozen lines).
  • Write a 20-line shell script that basically just builds and executes one command. By the time you get to the bit that executes the command, it looks something like “$DO_CMD $CMD_OPTS $ARGS”, so you need to trace the script logic, include debug code, or insert “set -x” all over the place to figure out what it’s doing. My personal favourite is a 65-line shell script that does an rsync in a for-loop and subversion commit. This violates my “Didactic” principle.
  • Write a 1000-line shell script that is effectively 100 little scripts in one, and sends you blind if you try and maintain it. Because it’s a shell script interpreted top-to-bottom, you have to put all your functions at the top and no one can find where the main loop starts. This violates the “Sanity” principle.
  • Come to your senses, abandon the shell script, and rewrite it in Perl (Python, Ruby, whatever your systems scripting language of choice is).

In practice, shell scripting tends to involve reusing a relatively small set of common idioms over and over. If you’re lucky, you’ll have a set of common libraries. But shell libraries have a tendency to be a bit opaque and non-portable in their own right (for example, useful as some of the things in Red Hat’s /etc/init.d/functions might be, nobody in their right mind is going to use them for portable shell scripts). If you’re less lucky, you’ll have some skeleton scripts that you can plug your specifics into. If you’re less lucky still, you do it from scratch every time, so no two scripts work quite alike (or you take more of a productivity hit than you should need to to automate something). There are a few well-known problems with shell scripts in general, but I’m not necessarily going to attack them head-on just yet.

  • Portability is awful. You have GNU and POSIX variants of utilities, varying directory locations, and no guarantees about the output format (see, for example, what Red Hat aliases “ls” to by default). GNU systems tend to let you get away with Bash-isms even when called as /bin/sh. Linux doesn’t have a “real” Korn shell. Solaris paths can be ugly. The list goes on.
  • Shell quoting rules can get ugly.

There is another useful idiom for shell scripts, and that’s the “foo.d/” directory that gets executed by a “run-parts” or similar calling mechanism. That’s good, and goes a long way to solving the 1000-line script problem, but doesn’t solve the problem of writing the same 20-line script with minor variations over and over.

The general idea of putting a bunch of scripts in a directory and pointing a test-runner at it is goodness. Which leads me to another design decision: use the file system as a database. This satisfies the “Ubiquity” principle, and besides, I’m generally in the Databases-Are-Evil camp. That’s not to say that storing tests in a database couldn’t come later, but it’s certainly not necessary.

The second aspect of this vocabulary is how to interpret the results of running some command. The most obvious is exit codes. This is not without its problems (quick, what exit code does /usr/bin/host return for NXDOMAIN replies on your system?), but in a controlled environment it’s a good place to start. It’s also as applicable to more serious systems glue languages as it is to shell. There’s also the presence or absence of output, the contents of the output, whether anything gets written to STDERR, or the reply codes for application protocols. We should be able to deal with all of these.

So to summarize the “vocabulary”, we have a pretty standard set of building blocks: One-liners, shell/perl/python/whatever scripts, exit codes, command output, and protocol reply codes.

By “grammar” in a sytems testing language, I really mean the file formats and system APIs we expect to deal with.

If we’re going to move our most common logic idioms back from the test cases into the harness, we need to settle on some sort of format to describe the specifics of a test. An example might be:

command: /bin/grep/ 'my\.dns\.server' /etc/resolv.conf
  0: OK I have the right nameserver
  1: FAIL my.dns.server missing from resolv.conf
  2: FAIL something went wrong with grep

This example encapsulates pretty much everything we want to check, with no program logic or environment variables getting in the way.

The most likely candidate for the test description format is a data-serialisation format, like YAML, JSON, Perl’s Storable or Data::Dumper, or, God Forbid, XML. Remember, since we’re abstracting all the logic out into the harness, all we need in the test description is the thing to be tested (usually a command to run) and a little bit of metadata, such as how to interpret the results. For concise representation of not-too-deeply-nested data structures, YAML seems like the best fit to me as a starting point:

  • It’s better for human consumption than XML, so it’s a good choice for human-editable inputs.
  • It’s not executable (unlike JSON or the native Perl serialisation formats), which is in line with the “Safety” principle.
  • It’s relatively ubiquitous; there are good-quality YAML libraries for all the popular systems glue languages

The other thing a test harness needs to do is produce output you can use. Pass/fail counters on STDOUT are fine, but not so useful for examining the test output in more detail, for audit trails, for capturing trends, or for tarting up into web pages. So I want something that can produce different styles of output for the different presentation contexts. The same set of data serialisation formats I mentioned above would be a good start, along with syslog and pretty(-ish) STDOUT output. There are also specific testing protocols like TAP, TET, or others which would be useful to implement.

Next post: test formats.

February 14, 2008

A basic systems testing toolkit

Filed under: sysadmin, test-driven sysadmin — rchanter @ 9:41 am

Looking around for examples of test-driven sysadmin, all I can find is people recommending, rather sensibly, that you test systems changes before deploying them to production. I’m interested in both a broader and more narrow view of a systems testing toolkit.

Broader in the sense that I want to test many more things than just planned change. If we’re talking about a test-driven approach (and, ultimately, a behaviour-driven approach), then we should apply a testing mindset to all the activities of systems administration; incident management, problem management, change management. More specific repeated activities like simple health-checks, verifying the correctness of data changes (as opposed to configuration changes), and so on.

Narrower in the sense that I want a simple, flexible toolkit that lets me express tests with a common language and collect results in a common format.

In all of this, simplicity and ubiquity are key considerations. They influence choice of language (and even of coding style), choice of file formats, and infrastructure design (hint: there isn’t any). I’ll go into more detail in my next post.

Anyway, developers have long had unit testing toolkits available to them: JUnit, Test::Harness, RSpec, the list goes on. While none of them are a great fit for systems testing, there is plenty of inspiration we can take from looking at them.

January 15, 2008

Thinking About Test-Driven Systems Administration

Filed under: sysadmin, test-driven sysadmin — Tags: , — rchanter @ 1:59 pm

So I’ve been thinking about, and working on, this for a little while.

It was prompted mainly by the title of a paper Geoff Halprin gave at last year’s SAGE-AU conference. Not having had the time to attend the conference itself last year, I have no idea whether my approach bears any resemblance at all to Geoff’s.

Broadly speaking, systems administration consists of two main tasks: managing planned change to systems, and managing unplanned incidents on systems. Everything else we do is just arranging affairs so that change is simpler and more deterministic, and incidents are shorter and less frequent.

How, then, can a test-driven approach help with this? It seems to me that we need two things:

  • A test-first workflow that makes sense for systems management
  • A language and toolkit for expressing tests and collecting the output.

This seems like a pretty simple task. I’ll explore it a bit in the next few weeks.

Neat systems management hack using clamav

Filed under: sysadmin — Tags: , — rchanter @ 1:56 pm

From the clamav-users list, last year (yes, I’m behind on my reading):

Clamav was used in Debian to discover copies of statically linked
copies of zlib that needed a security update.

Create a free website or blog at