The Evolution of Monitoring for IT
Monitoring within IT is a necessity. Â Let’s just get that out of the way now. Â IT employees and management need it in order to survive. Â Without it, the word “proactive” never enters in to the vocabulary, and a death spiral ensues as users and customers alike beat on IT for unexpected and unplanned outages.
An old friend of mine and I worked at the same company back in the early 90’s, and we learned a lot through trial and fire while there.  It was a great time to be in IT.  I liken it to being on the original NCC-1701 Enterprise going “where no man has gone before.” (ironically, we had a machine called “ncc” which we built to do all of our monitoring, and for those not in our department we called it the “network control center” when in fact, it was an alias to the host name “ncc1701, but I digress)  Everything was new and waiting to be molded into shape.  While our careers have gone in different paths, we have both remained within the technical side of IT to some degree.  His recent posting http://everythingsysadmin.com/2013/11/stop-monitoring-if-service-is-up.html was great and it got me thinking as to how to contribute to the picture he started to paint (hence, this article).
In my eyes, he described the panacea of monitoring, starting from a bare slate by approaching it with a top-down design by understanding what you need to monitor (aka: get the requirements) and then setting your monitoring to look for it. Â Unfortunately, monitoring is rarely ever built that way and it’s near impossible to nuke it and start over. Â Right or wrong, it’s commonly built organically and driven by need, and the requirements are forever evolving, changing, and being chased after.
His article has it right in that too many administrators rely on basic monitoring, the up/down of a service, and call it a day. Â Instead, administrators should consider that type of monitoring as the foundation, the “catch all”, and build the real monitoring out from there. Â It’s not an end point, but a starting point.
My career has a focus on the small to mid-sized companies, typically focused on that transitional period from when a small company is “crossing the chasm” to become a bigger player in their space. Â The article he wrote is great for established companies where monitoring has been in place for some time. Â However, from what I have seen over the years, it takes time and the right attention to get there, as I attempt to outline below:
The small “mom-n-pop” company (incubation mode)
- Step 1: IT (or that one computer in the closet that runs the mail server, I think) is managed by someone doing payroll, facilities, development, or some other primary task. Â Everyone else manages their own machines.
- Step 2: Outages of any and all services happen on a frequent basis, but it is just faster to restart the service and get everyone, including the part-time IT person, back to work than it is to figure out what caused the problem in the first place.
- Step 3: If something happens overnight, it’ll just have to wait until the morning.
- Step 4: Who has time for monitoring anyway?
The true small company (Goodbye “Mom-n-Pop”)
- Step 1: There is a true “IT person” or “persons” responsible for maintaining the corporate equipment and the production equipment.
- Step 2: Issues are brought to the attention of that person(s) either by:
- Stopping by their desk
- They are rarely there, or they are on the phone.
- Stopping them in the hallway (very common)
- E-mailing them directly (the primary interaction)
- So much unsorted and unpriortized email leads to a lot of stuff falling to the floor unattended.
- Calling them
- Voicemail box is nearly always full.
- Stopping by their desk
- Step 3: Problems are handled as they come up, and they come up frequently enough to:
- Justify the IT person(s) existence in the company
- Keep that person (or persons) busy every second of the day (even company sponsored social events are an avenue to report problems)
- Step 4: Production and Development are pretty much the same at this point. Â Code is pushed directly to production typically without the IT persons knowledge, but when it all breaks, the call goes to the IT person. (can we say “out of left field?”)
- Step 5: The IT person (or persons) are typically seen as running around either with their own hair on fire in search of a fire extinguisher, or with an extinguisher in hand putting out random fires throughout the building.
- Step 6: Someone, usually outside of IT, writes a small script to send a daily “ping report” of what’s up and what’s down. Â Management cheers while the IT folks wonder where the benefit is.
The small, but established, company
- Step 1: There is more than one person responsible for IT. Â “Jack-of-all-trades” types are required in order to have some overlap (everyone needs a vacation at some point).
- Step 2: The frequent outages in the past have started to be recognized as hitting the bottom line. Â Management is now asking to “get a handle on things”.
- Step 3: IT folks are perceived as being part of the problem as they are reactive as opposed to being proactive. Â It’s the beginning of buzzword bingo, but it is still healthy at this point. Â The perception of a “blame game” ruffles some feathers at this point.
- Step 4: Rudimentary monitoring is put in to place with a focus on the most basics (is the machine up?). Â This effectively alerts the administrators a few seconds before the users, customers, or developers recognize an issue (about the amount of time it takes to craft that email saying “Is the webserver down?” but enough time to set the perception that IT is ahead of the game).
- Step 5: Administrator fixes the problem as fast as possible in order to get back to watching the Walking Dead.
- Step 6: Uptime is improved, management is a little happier. That same ping report is used to show the better uptime. Life is grand.
Nearing mid-sized
- Step 1: Management is still hearing grumblings from users & customers alike that the services are not stable. Â Therefore, management is a bit miffed that IT is still in “react” mode and baffled as to how to get IT in to “proactive” mode. Â Printouts of “Buzzword Bingo” boards start appearing on printers alongside of the resumes.
- Step 2: My motto here is that “necessity is the mother of invention”. Â Administrators should be motivated to NOT get that call at 2am about some service, and the push begins to start delving deeper in to the monitoring.
- Step 3: It is time to get motivated about monitoring as it is now recognized that there is too much time lost on the inefficiencies of putting out fires.  Nagios or Icinga is stood up somewhere because one IT person happened to read an article about it.  No one outside of IT, and even within some IT departments, can correctly pronounce the name of the tool, but they generally understand what it is supposed to do.
- Step 4: The very basics are now configured in Nagios (disk space, uptime, CPU load) and general constraints, typically the defaults, are accepted for alerting. Â This is the beginning of understanding the need for trending analysis and being proactive.
- Step 5: Alerts go crazy for a period of time, making everyone numb to the emails and SMS messages. Â The number of outages spikes, ticking off Management, but is explained away as an “evolving process” which is somewhat true. Â The monitoring is adjusted to turn the volume down a bit.
- Step 6: IT pats itself on the back, even puts up a monitor in a visible location to all to show the dashboard effectively saying “See? Â We’re watching everything!”
- Step 7: Pager rotation duty comes in to play if it didn’t exist already. Â That one person holding the place together breathes a sigh of relief that he can now have Thanksgiving dinner this year without the worry of being on-call, but he still twitches constantly expecting the phone to vibrate.
- Step 8: Fires still happen, chaos still exists, but it is all under the veneer of a monitoring dashboard (“Yes, we see that’s a problem, too. Â We are working to resolve it now.”) and having eyes on it in order to react faster than ever before.
- Step 9 (optional): Self-healing script, like event handlers, are crafted to restart troublesome processes automatically.
Mid-sized
- Step 1: IT is still recognized as being reactive. Â Not good.
- Step 2: Efforts are really focused on being proactive. Â IT needs to understand more of the innards of the applications, be it infrastructure related like Apache and DNS, or production related like a custom application, an EMR (electronic medical records) application, or an off-the-shelf product like Oracle. Â IT doesn’t need to know it like the back of their hand, but they need to reach in to the applications far enough to figure out how to monitor it to look for the signs of an impending failure or problem.
- Step 3: Building on the existing monitoring solution, more event handlers are created to resolve issues when noticed, and new more custom scripts are built to handle problems when the symptoms are noticed.
- Step 4: Downtime is reduced (but not completely gone). Â Management recognizes that some R&D is required for IT to keep the company going.
And beyond
In my eyes, any mid-sized company without real monitoring of a proactive nature is standing on a precipice of a catastrophe.  At that point, be sure to study The Tao of Backup carefully.
The infrastructure will be both old enough and mature enough to require more sophisticated monitoring than up/down. Â With IT always looked upon as an overhead expense item on the balance sheet, it needs to be as efficient as possible. Â Adding headcount to continue to manage the fires is not something that is sustainable over the long term.
For companies Mid-sized and beyond, the hope is that the machine of industry has started spinning the monitoring wheel in to its processes. Â Similar to backups, it has to be considered at the start of the projects, not as an afterthought. Â It is IT’s job to constantly remind and insert itself, politely, in to the process early in the game in order to be successful.