Server login considered harmful

Why limit continuous improvement to planning or programming? Inspired by the DevOps movement and a new breed of server configuration management tools, Stephan Eggermont (playing Dev) and yours truly (playing Ops) set out to see if automating the hell out of a server build would benefit us. We learnt a few things. One of them: each time we logged in to the server was a system administration smell. Something in our approach to automation had failed. Time to ask why a couple of times…

We constructed a web and mail server for an early stage startup. Stephan wrote the application for it together with Diego Lont, and they were looking for a smart way to deploy it. I had looked into automating systems administration tasks with scripting tools such as puppet and chef before, had tried puppet a year before and failed to earn back the time I put in.

This time our goal was to be able to build a new server quickly, because the startup did not have the funds for more expensive disaster recovery options. We also wanted to not think separately about writing documentation: if left as a a separate task, documentation is often just left.

Stephan was somewhat new to linux systems administration and wanted to learn, I have done linux administration on the side for almost ten years now, but many things I still don’t really understand (e-mail, firewalls), even though I administer a couple of servers that seem to be running ok.

We hoped that documenting ourselves in the form of scripts would help us understand, and that writing these scripts in a Domain Specific Language designed to describe servers (puppet in this case, but it could have easily been another tool) would force us to be precise in describing our understanding.

So why did we need to log in?

Our web server would not start, but puppet did not pass the output of the web server (lighttpd) on to us.

Why did our web server not start?

We made errors writing its configuration file.

Why did we make errors in writing the configuration file?

We used lighttpd, a web server I hadn’t used in a couple of years. Its configuration is not that hard, but you have to use exactly the right lines. Especially with secure sockets (a bit tricky on most servers) this turned out to be harder than we thought.

Why…

We don’t have automated tests yet for the configuration, Stephan and Diego’s software has automated tests on the inside, but the mail and web server stuff does not have automated tests. I have a virtual machine that is, with the puppet scripts, roughly a copy of the production machine, but learning puppet and figuring out how to automate tests for the environment was too big a step.

There are many more whys. Sometimes we were in a hurry, so we didn’t sit together to think through the next step, do it, and reflect on it. Sometimes we still did things directly on the server, because we could not figure out how to quickly do it in puppet. The first steps in automation required a lot more patience than I expected. The object oriented database used does not come with a ubuntu package, but needed to be installed with a couple of shell scripts that required manual intervention. We did not have the courage or understanding to rewrite the scripts as (automated) puppet recipes or an APT package.

It didn’t occur to us at first that logging in was a smell. We just started to notice after a while, that whenever we logged in to the production server, there was something we had overlooked. We broke something, or could not see whether something was running properly or not. Sometimes we took action to prevent having to log in the next time, sometimes we thought it was too much work for now, and accepted that our automation is not yet ‘perfect’.

If you made it this far, congratulations! You may want to hear more. If so, join us this Sunday at 15:00 In the configuration management devroom during FOSDEM 2011 in Brussels (no registration required). We’ll be talking about configuration management for developers. Learn from our mistakes, so you can make different ones ;)

Thanks to Stephan for coming up with the first title. And Marc Evers for shaping the second one.

3 Responses to “Server login considered harmful”

  1. Lindsay Holmwood Says:

    Interesting post – wish I could have been at the FOSDEM session!

    Your point on automated test of the infrastructure is interesting. I wonder if configuration management would matter as much if you had some high-level behavioural tests of your system? That way it really doesn’t matter if you configure them by hand or through a config management system, as long as the system does what you expect.

    Have you looked at Cucumber for outside-in testing? It sounds like it fits the paradigm of what you’re trying to achieve quite well.

    (Disclaimer: I work on a Cucumber based project called cucumber-nagios, for plugging Cucumber scenarios into Nagios.)

  2. Willem Says:

    Hi Lindsay,

    thank you for the feedback. I was thinking of cucumber specs when I wrote the post :) . Forgot to mention it. Helps with outside-in thinking as well.
    I was just busy writing a cucumber spec for something else :) (payment gateway integration in Drupal), with cucumber and selenium-webdriver I can test outside my application and back again :) .

    Next time I deploy someone else’s application I’ll add cucumber specs to the acceptance criteria. I still feel it matters to script the installation, as it forces deeper understanding of what it is we are deploying, and how the various bits (including permissions and users, settings I easily forget) fit together.

  3. Patrick Connolly Says:

    Woooooo Drupal!

    Sorry, I still go a bit fan-girl when I stumble across Drupal references in my perusing of the interwebs :)

    Great article, by the way. The idea that “logging into the server is a smell” is something that really hit home for me as I’m working my way through learning Chef. Honestly, it the sort of simple observation that reaffirms why config management is worth getting right. Thanks man.