Roman Sandals

May 23, 2008

A language for sysadmin testing

Filed under: sysadmin, test-driven sysadmin — rchanter @ 4:13 pm

This article is part of a series on Test-Driven Systems Administration.

Having decided that we’re going to test the hell out of everything, we need to settle on a language for both the tests we want to run, and the data we want to collect. In trying to building a systems test tool, there is a series of design decisions that flow from having a sytems admin perspective.

Basic Design Principles

  • Low friction: it should be simple to write tests.
  • Didactic: Examining existing tests should yield something meaningful. They should serve as a knowledge-sharing channel.
  • Language independence: it should be easy to use your language of choice to write tests.
  • Ubiquity: run-time dependencies should not be an obstacle to deploying and using a tool.
  • Safety: it will be common for tests to run with elevated privileges.

This post is about all of these things, but principally about the testing language. At the highest level, we can divide language into vocabulary and grammar. It’s worth considering these separately when we look at systems testing.
Vocabulary

It occurs to me that most sysadmins already have a perfectly good vocabulary for systems testing: shell one-liners and scripts. There are a few … ah, let’s say … improvement opportunities there. I’ll start by confessing that I love one-liners. I would reckon that 80% of what I want to do to assess a Unix system can be done with one-liners. Preliminary health checks on most servers are done with a handful of standard commands. Putting together a 6-command pipeline usually gets me cackling with glee at my own ingenuity. So one-liners are probably going to be a big part of my testing toolkit.

As for shell scripts, there are a few classic shell scripting patterns. In general, they violate my “Low Friction” principle.

  • One-offs, ranging from a simple for-loop at the shell prompt to single-purpose throwaway scripts (say, up to a couple of dozen lines).
  • Write a 20-line shell script that basically just builds and executes one command. By the time you get to the bit that executes the command, it looks something like “$DO_CMD $CMD_OPTS $ARGS”, so you need to trace the script logic, include debug code, or insert “set -x” all over the place to figure out what it’s doing. My personal favourite is a 65-line shell script that does an rsync in a for-loop and subversion commit. This violates my “Didactic” principle.
  • Write a 1000-line shell script that is effectively 100 little scripts in one, and sends you blind if you try and maintain it. Because it’s a shell script interpreted top-to-bottom, you have to put all your functions at the top and no one can find where the main loop starts. This violates the “Sanity” principle.
  • Come to your senses, abandon the shell script, and rewrite it in Perl (Python, Ruby, whatever your systems scripting language of choice is).

In practice, shell scripting tends to involve reusing a relatively small set of common idioms over and over. If you’re lucky, you’ll have a set of common libraries. But shell libraries have a tendency to be a bit opaque and non-portable in their own right (for example, useful as some of the things in Red Hat’s /etc/init.d/functions might be, nobody in their right mind is going to use them for portable shell scripts). If you’re less lucky, you’ll have some skeleton scripts that you can plug your specifics into. If you’re less lucky still, you do it from scratch every time, so no two scripts work quite alike (or you take more of a productivity hit than you should need to to automate something). There are a few well-known problems with shell scripts in general, but I’m not necessarily going to attack them head-on just yet.

  • Portability is awful. You have GNU and POSIX variants of utilities, varying directory locations, and no guarantees about the output format (see, for example, what Red Hat aliases “ls” to by default). GNU systems tend to let you get away with Bash-isms even when called as /bin/sh. Linux doesn’t have a “real” Korn shell. Solaris paths can be ugly. The list goes on.
  • Shell quoting rules can get ugly.

There is another useful idiom for shell scripts, and that’s the “foo.d/” directory that gets executed by a “run-parts” or similar calling mechanism. That’s good, and goes a long way to solving the 1000-line script problem, but doesn’t solve the problem of writing the same 20-line script with minor variations over and over.

The general idea of putting a bunch of scripts in a directory and pointing a test-runner at it is goodness. Which leads me to another design decision: use the file system as a database. This satisfies the “Ubiquity” principle, and besides, I’m generally in the Databases-Are-Evil camp. That’s not to say that storing tests in a database couldn’t come later, but it’s certainly not necessary.

The second aspect of this vocabulary is how to interpret the results of running some command. The most obvious is exit codes. This is not without its problems (quick, what exit code does /usr/bin/host return for NXDOMAIN replies on your system?), but in a controlled environment it’s a good place to start. It’s also as applicable to more serious systems glue languages as it is to shell. There’s also the presence or absence of output, the contents of the output, whether anything gets written to STDERR, or the reply codes for application protocols. We should be able to deal with all of these.

So to summarize the “vocabulary”, we have a pretty standard set of building blocks: One-liners, shell/perl/python/whatever scripts, exit codes, command output, and protocol reply codes.
Grammar

By “grammar” in a sytems testing language, I really mean the file formats and system APIs we expect to deal with.

If we’re going to move our most common logic idioms back from the test cases into the harness, we need to settle on some sort of format to describe the specifics of a test. An example might be:

command: /bin/grep/ 'my\.dns\.server' /etc/resolv.conf
return-codes:
  0: OK I have the right nameserver
  1: FAIL my.dns.server missing from resolv.conf
  2: FAIL something went wrong with grep

This example encapsulates pretty much everything we want to check, with no program logic or environment variables getting in the way.

The most likely candidate for the test description format is a data-serialisation format, like YAML, JSON, Perl’s Storable or Data::Dumper, or, God Forbid, XML. Remember, since we’re abstracting all the logic out into the harness, all we need in the test description is the thing to be tested (usually a command to run) and a little bit of metadata, such as how to interpret the results. For concise representation of not-too-deeply-nested data structures, YAML seems like the best fit to me as a starting point:

  • It’s better for human consumption than XML, so it’s a good choice for human-editable inputs.
  • It’s not executable (unlike JSON or the native Perl serialisation formats), which is in line with the “Safety” principle.
  • It’s relatively ubiquitous; there are good-quality YAML libraries for all the popular systems glue languages

The other thing a test harness needs to do is produce output you can use. Pass/fail counters on STDOUT are fine, but not so useful for examining the test output in more detail, for audit trails, for capturing trends, or for tarting up into web pages. So I want something that can produce different styles of output for the different presentation contexts. The same set of data serialisation formats I mentioned above would be a good start, along with syslog and pretty(-ish) STDOUT output. There are also specific testing protocols like TAP, TET, or others which would be useful to implement.

Next post: test formats.

Advertisements

St. George and the IT dragon

Filed under: business, musing — Craig Lawton @ 3:08 pm

Over the last few months I’ve read a couple articles in the AFR relating to IT spend in M&A activity.

It’s amazing to consider that about half of the business integration costs for the proposed merger between Westpac and St. George will be in IT (0.5 * $451,000,000).

Consider that the Commonwealth Bank is planning on spending $580,000,000 to re-engineer its aging platforms (to me this means cleaning out all the legacy crap), and NAB is looking at doing the same.

A merged Westpac/St. George would be $225,500,000 behind the eight-ball, before it could even contemplate a project of this scale.

Also, to make the merger more attractive, or because of uncertainty, either side could be tempted to put off required upgrades, lay off staff (possibly key staff), and run-down maintenance.

Accenture recently concluded a survey of 150 CIOs and found that poor IT integration was the leading cause of failure to meet the stated objectives of a merger or acquisition (38%).

It makes you wonder if this whole “IT thing” is going to collapse under the weight, and expense, of its own complexity!

Frustrating in-house systems

Filed under: musing, technology — Craig Lawton @ 1:25 pm

I’m constantly amazed at the crappy performance of in-house applications at the places I’ve worked. Customer-facing applications must perform, or business is lost. In-house applications are never tuned for performance it seems, and this makes work that much harder.

This difficulty is related to the level of brain-memory you are using for your current task. Very short term memory is great, and necessary, when you are flying through a well understood task. But short system interruptions (usually involving the hour glass) force you to use more extended memory times, making the effort that much larger, and less enjoyable.

There are other types of interruptions of course, which have a similar effect, such as people-interruptions (“What are you doing this weekend?”) and self-inflicted-interruptions (such as twitter alerts).

If your system hangs for long enough you may start a new task altogether (so as not to look stoned at your desk) and therefore lose track completely of where you were at.

This forces unnecessary re-work and brain exhaustion!

I see lots of people with “notepad” or “vi” open constantly so they can continually record their work states. This is a good idea but takes practice and is an overhead.

It comes down to this. I want a system which can keep up with me! :-)

And is that unreasonable, with gazillions of hertz and giga-mega-bits of bandwidth available?

May 1, 2008

Going with the cloud

Filed under: management, musing, technology — Craig Lawton @ 5:01 pm

Really interesting article on the Reg’ which should put data centre fretters’ feet firmly back on the ground. It seems the “thought leaders” don’t see data centres disappearing anytime soon because:

  • Security – “… there are data that belongs in the public cloud and data that needs to go behind a firewall. … data that will never be put out there. Period. Not going to happen. Because no matter how you encrypt it, no matter how you secure it, there will be concerns.”
  • Interoperability- “…figure out ways for systems that are … behind the firewall … to interoperate with systems that are in the public cloud”
  • Application licensing complexity.
  • Wrangling code to work over the grid – getting any code written that exploits parallel infrastructure seems to be very difficult.
  • Compliance – “What happens when government auditors come knocking to check the regulatory complicity of an application living in the cloud?”

Also they didn’t cover jurisdictional issues, such has, who do you take to court, and in what country, when there is an issue with data mis-use “in the cloud”.

It makes you wonder about why cloud computing will be any different to grid computing, or thin desktop clients. A great idea, but not enough inertia to overcome ingrained corporate behaviour.

Blog at WordPress.com.