12 Responses to “Operations magic cure: nightly server restarts”

Comments

Read below or add a comment...

  1. D.J. Capelis

    Seriously? Staffed restarts?

    Your servers can’t restart themselves? That is definitely a bug.

    • Sam

      Hey DJ, thanks for the comment. I guess I meant that a failed restart, for whatever reason, needs an emergency response, as forcing another restart will clearly not solve the issue.

      So it’s not so much that somebody must perform the restart, as somebody must be ready to respond to a restart exception.

  2. Let me guess : you are still running windows, are you not ?

    • Sam

      When I learnt this, the system we were managing had several hundred servers, and there was a mix of Windows and Linux. In my experience, you may need the restart because of the OS, but it could also come because of leaky server software, leaky DB drivers, etc.

    • GreenDowntime

      Better find out in a controlled manner that the last udev update broke something, than to find it out when you _have to_ restart the server fast. So yes: even restart a Linux server now and then — but agree: might be bad for your uptime-penis script.

      .02

  3. pointernil

    One of the major failures of the it industry i’d say.

    It’s a down spiral: you simply restart “to cure” as system, that way the bugs/issues are not problematic, that why they are not handled, that why there is no learning, that why there is more of them over time.

    Operations ppl don’t have to deal with the system too much, as all they need to know is how to restart it, or code those scripts to automate the restarts.

    Java and .Net stacks both provide even automatic scheduled “recycling” (isn’t that a nice word for an app *restart from scratch*) features.

    But hey! Overall, that little more silicon you need plus restarting, is cheaper than to pay those brains to do it right. Right?

    • Sam

      I used to think along your lines for a looong time. I simply refused to do it on the basis that restarts == we are bad developers. It took bad uptime numbers to make me realize, though, that I was better off being less paradigmatic and focusing my energy on the root causes and living with the restarts.

      For me restarts are like pain medication, they make you feel better but they don’t make you healthier. Just because they treat the symptom doesn’t mean they’re not useful.

  4. KCP

    I remember back in my days as a CNE, other CNE’s were taking snapshots of their server’s uptime and wearing them as badges of honor. My servers never had an uptime greater than a week, because I bounced them weekly. I never had the same problems that the other engineers complained about either.

    Today, running a company IT efforts with a mix of Windows and Linux servers, I still mandate to my system admins that they try and bounce these boxes at least monthly. I cant think of a good argument NOT to do it. My uptime is very high…I am convinced that regular restarts are a contributing factor in it.

  5. I think it’s one of the things where the journey is the destination. Once you do the infrastructure work you outlined above, so that restarts are transparent to users, you can keep the whole system up for orders of magnitude longer.

    We restart our servers only a few times a year and haven’t had a minute of system downtime this year, but it all depends on what components you’re using. Our app servers go up and down all the time (mostly intentional ;) , but everything’s stateless and load balanced, so users don’t know the difference.

  6. mike may

    As application service providers, we have less controls over OS related software bugs that cause systems to “erode”, leading to eventual failure. I used to depend on large, very expensive SMP servers to run apps, and lived through the massive crash, started by a routine that measures fan speed, caused the disk sub system to go down, resulting in a cpu panic, after the cache was allowed to go stale. I called this the space shuttle syndrome – very expensive systems that can be brought down by a chunk of ice, falling on a piece of foam, creating a 1 inch crack, that brings the whole thing down, with deadly results.

    Now all systems are designed to be as stateless as possible, partitioned to run on small servers, so that we can rotate our bounces when the warning indicators tell us that conservative thresholds have been met. Failure never arrives without some warning, but our admins need to know what to look for.

Leave A Comment...