Home > Operations > Operations magic cure: nightly server restarts

Operations magic cure: nightly server restarts

November 24, 2009 Leave a comment Go to comments

I hate to admit it, but it’s a well known fact that some people arrive at high availability by frequently rebooting their servers. As a developer I always abhored this idea. Good software should be able to stay up for a long time.

At some point early in my 2 year tenure as CTO at Angel.com I could no longer fight the obvious: trying to keep the system up for long periods of time simply made us less reliable.

It was at lunch with a CIO friend of a local SAAS company thayt he shared his dirty little secret: “we restart our servers every night. That’s why we get a lot less alerts than you seem to be getting”.

If you think about it, though, this practice is harder than it seems. You need:

* your restarts to be mostly transparent to your users. This probably implies stateless and horizontal partitioning.
* an automated restart procedure. This probably implies a certain degree of script-based automation
* a person in charge of the restarts. This implies a staffed 24/7 rotation.

So all in all, for my money, not a bad attack vector after all, if your goal is to improve uptime, as you will get procedural improvements along the way and peacefully sleeping admins as a bonus.


Categories: Operations Tags: , ,
  1. D.J. Capelis
    November 25, 2009 at 1:24 am | #1

    Seriously? Staffed restarts?

    Your servers can’t restart themselves? That is definitely a bug.

    • Sam
      November 25, 2009 at 1:40 am | #2

      Hey DJ, thanks for the comment. I guess I meant that a failed restart, for whatever reason, needs an emergency response, as forcing another restart will clearly not solve the issue.

      So it’s not so much that somebody must perform the restart, as somebody must be ready to respond to a restart exception.

  2. November 25, 2009 at 4:58 am | #3

    Let me guess : you are still running windows, are you not ?

    • Sam
      November 25, 2009 at 8:34 am | #4

      When I learnt this, the system we were managing had several hundred servers, and there was a mix of Windows and Linux. In my experience, you may need the restart because of the OS, but it could also come because of leaky server software, leaky DB drivers, etc.

    • GreenDowntime
      November 25, 2009 at 5:01 pm | #5

      Better find out in a controlled manner that the last udev update broke something, than to find it out when you _have to_ restart the server fast. So yes: even restart a Linux server now and then — but agree: might be bad for your uptime-penis script.

      .02

  3. pointernil
    November 25, 2009 at 5:24 am | #6

    One of the major failures of the it industry i’d say.

    It’s a down spiral: you simply restart “to cure” as system, that way the bugs/issues are not problematic, that why they are not handled, that why there is no learning, that why there is more of them over time.

    Operations ppl don’t have to deal with the system too much, as all they need to know is how to restart it, or code those scripts to automate the restarts.

    Java and .Net stacks both provide even automatic scheduled “recycling” (isn’t that a nice word for an app *restart from scratch*) features.

    But hey! Overall, that little more silicon you need plus restarting, is cheaper than to pay those brains to do it right. Right?

    • Sam
      November 25, 2009 at 8:39 am | #7

      I used to think along your lines for a looong time. I simply refused to do it on the basis that restarts == we are bad developers. It took bad uptime numbers to make me realize, though, that I was better off being less paradigmatic and focusing my energy on the root causes and living with the restarts.

      For me restarts are like pain medication, they make you feel better but they don’t make you healthier. Just because they treat the symptom doesn’t mean they’re not useful.

  4. KCP
    November 25, 2009 at 10:44 am | #8

    I remember back in my days as a CNE, other CNE’s were taking snapshots of their server’s uptime and wearing them as badges of honor. My servers never had an uptime greater than a week, because I bounced them weekly. I never had the same problems that the other engineers complained about either.

    Today, running a company IT efforts with a mix of Windows and Linux servers, I still mandate to my system admins that they try and bounce these boxes at least monthly. I cant think of a good argument NOT to do it. My uptime is very high…I am convinced that regular restarts are a contributing factor in it.

  5. November 25, 2009 at 8:32 pm | #9

    I think it’s one of the things where the journey is the destination. Once you do the infrastructure work you outlined above, so that restarts are transparent to users, you can keep the whole system up for orders of magnitude longer.

    We restart our servers only a few times a year and haven’t had a minute of system downtime this year, but it all depends on what components you’re using. Our app servers go up and down all the time (mostly intentional ;) , but everything’s stateless and load balanced, so users don’t know the difference.

  6. mike may
    December 1, 2009 at 6:50 am | #10

    As application service providers, we have less controls over OS related software bugs that cause systems to “erode”, leading to eventual failure. I used to depend on large, very expensive SMP servers to run apps, and lived through the massive crash, started by a routine that measures fan speed, caused the disk sub system to go down, resulting in a cpu panic, after the cache was allowed to go stale. I called this the space shuttle syndrome – very expensive systems that can be brought down by a chunk of ice, falling on a piece of foam, creating a 1 inch crack, that brings the whole thing down, with deadly results.

    Now all systems are designed to be as stateless as possible, partitioned to run on small servers, so that we can rotate our bounces when the warning indicators tell us that conservative thresholds have been met. Failure never arrives without some warning, but our admins need to know what to look for.

  1. November 25, 2009 at 11:30 am | #1
  2. November 25, 2009 at 12:28 pm | #2