Systemd, NetworkManager, and dhclient

Systemd definitely has its benefits with starting processes in parallel and handling complex dependencies, but NetworkManager seems to thwart that process a bit.  NetworkManager does exactly what it says: it manages the network connection, setting the ip, bringing up the interface, starting dhclient, or any number of other things.  It still has the stigma of being a piece of pretty eye candy around an already good and solid paradigm (the “network” service), but, alas, NetworkManager is here to stay so the quest is to make it place nicely with all of the other services in the systemd sandbox.

NetworkManager brings up the interfaces by kicking off other sub processes, such as dhclient, to complete the initialization process of the network interfaces.  This is done for performance reasons: enabling the system to boot much faster without everything coming to a grinding halt waiting for a DHCP server response.  However, an artifact of this design is that systemd is under the belief that NetworkManager has completed and the processes dependent upon it can now be started.

The catch is that while NetworkManager has indeed completed, dhclient is still finishing up its tasks.  This causes a bit of a race condition for processes dependent upon the interface to have internet connectivity, a race that usually results in the IP address being assigned to the interface after the dependent process has started, failed, and exited.

The solution is to enable an additional NetworkManager service as described on freedesktop.org:

systemctl enable NetworkManager-wait-online.service

Using Puppet to perform “yum update”

Puppet is a great tool for configuration management.  Chef, slightly younger than Puppet, has matured to be a very good option when choosing a configuration management tool.  In this particular case, the focus is on Puppet.

The Puppet community frowns heavily on using puppet as the tool for global package management.  At least, that’s the perception. I have personally witnessed comments like “puppet is not the right tool for this” and “there are better tools out there for it,” (a good example is found in the comments on Stack Overflow http://stackoverflow.com/questions/18087104/yum-updatesand-work-arounds-using-puppet), but I believe those who respond that way just don’t get it.. they don’t see the bigger picture.

Puppet?

Puppet is a configuration management tool for managing your system configurations.  Why is it that those configurations stop at the individual package and related configuration files only? The configuration of the host is arguably the entire configuration of the host.  Certainly, one can go very far overboard and maintain the specifics about ever one of the 800+ packages installed via custom puppet modules, yet we all must admit that would be ludicrous. However, one of the primary goals with a configuration tool like puppet is to create completely reproducible systems for disaster recovery, and there has to be a more efficient way than writing a module for every package on the system.

Unix/Linux is built upon the foundation of layering; break a large job down to its most simplistic parts, build tools to handle the individual pieces, and then layer on another tool to manage the individual tools, and so on.

Managing updates via yum (or apt) for your systems via puppet is not a bad ideadespite what others seem to say. For it to be successful, it does require you to understand the risks you are taking, what is acceptable, and what your organizations limits are.

Maintain Your Own Repositories

This cannot be highlighted enough.  Mirror all of the repositories that are used to make systems at your organization and your clients must be configured to use your mirrors only.  The reason for this is simple: you want control of your systems, and this includes the software that resides on them.  By mirroring all of the repositories that are used to build systems at your organization, you control when the get updated and you have direct access to a ‘frozen’ version of those repositories.  “Frozen?”  Yes! Repositories are, in some cases, made up of thousands of packages, all maintained by different people, all being updated at different times.  In order to maintain the control of your systems, you have to control the updates.  You don’t want your repository being ever so slightly different with each system build as that goes against what you are trying to accomplish with Puppet in the first place: repeatable processes.

The mirrors should be refreshed by your determination. You maintain control.  Therefore, you sync the repositories to your local mirror on your schedule.  To support this, I built a simple bash script to perform a “reposync” (from the yum-utils package) of all of the appropriate repositories followed by a “createrepo” (from the createrepo package):

...
# Let's start in the top level of our repository tree
cd /var/www/html/repo

# Grab the EPEL tree
reposync --arch=x86_64 --newest-only --download_path=EPEL/6/ --repoid=epel --norepopath --delete

# Make sure our Redhat repo is up to date, both for the Server and the optional
# trees.  There's no need to keep syncing the OS repo as that never changes.
reposync --arch=x86_64 --newest-only --download_path=RHEL/6/Server-Optional/ --repoid=rhel-6-server-optional-rpms --norepopath --delete
reposync --arch=x86_64 --newest-only --download_path=RHEL/6/Server/ --repoid=rhel-6-server-rpms --norepopath --delete

...

# Rebuild the repository data for EPEL.
cd /var/www/html/repo/EPEL/6/
createrepo -d .

# Rebuild the repository data for RHEL.
cd /var/www/html/repo/RHEL/6/Server-Optional/
createrepo -d .
cd /var/www/html/repo/RHEL/6/Server/
createrepo -d .
...

Using your own mirrors will also drastically decrease the amount of time it takes to build or update a machine because you are pulling all of the packages over your local LAN (high speed) vs the WAN (low speed).

Do not Automatically Reboot

This, too, cannot be highlighted enough.  Rebooting a machine must be done with someone at the helm, with someone watching and waiting to react if something goes horribly wrong.  The last thing you want is all of your servers to reboot after updates where none of them reboot properly, and you don’t discover it until 8am the following morning.  If that is a risk you are willing to take, please be sure to also have an updated version of your resume handy.

Build a SANE schedule

No one wants to update their systems constantly, and taking downtime for each update is not feasible for a successful business.  If you do not have formal maintenance windows, create informal ones.  Pick a point in time (once a month, every quarter, etc) that is workable for your organization and team (if you are a team of one, balance what you are willing to manage).  This is your “go-live” date.

Now work backwards to build your T-minus dates:

  • How long do you need to test the updates (do you even test them)?  Let’s say 3 days of testing, including just running a machine with the updates on it to make sure it doesn’t blow up.
  • How long do you need to update your mirrored versions of the repositories? between 1/2 a day to a full day might be about accurate, including some minor testing.

So, you are looking at close to about 4 days from start to finish.  This may be excessive for your organization, or it may need to be longer.  However you approach it, keep a sprinkle of sanity in the mix.

Implement in Puppet

The implementation is actually quite simple, really.  Create a basic class module, let’s call it yum::update, and set the criteria accordingly.  In the example below, the updates will run on the 6th of every month between 11:00am and 11:59am

class yum::update
{
# Run a yum update on the 6th of every month between 11:00am and 11:59am.
# Notes: A longer timout is required for this particular run,
#        The time check can be overridden if a specific file exists in /var/tmp
exec
{
"monthly-yum-update":
command => "yum clean all; yum -q -y update --exclude cvs; rm -rf /var/tmp/forceyum",
timeout => 1800,
onlyif => "/usr/bin/test `/bin/date +%d` -eq 06 && test `/bin/date +%H` -eq 11 || test -e /var/tmp/forceyum",
}
}

It probably makes more sense to create a script which does all of the date/time comparison and override functionality for you, manage that script through puppet, and reference that script here in the “command” line above.  Consider that the next step of evolution.  However you choose to proceed is up to you.

Now, you can include that class in your manifests/nodes.pp per host, globally, or however you wish, as shown below.

node 'web-dev.my.domain.com' inherits default
{
include postfix
include mysql
include syslog::internal
include openldap::client
include names
include pam::client
include wordpress::dev
include ssh
include httpd
include db::backup
include yum::update
}

Control, Efficiency, and Security

Updating a system is important to maintaining its security, but it is a careful balance to maintain your system’s availability and integrity for your end users & customers.  Automation is king here.  Understand, though, that it also makes your puppet server that much more critical, and making a mistake here can have a much wider impact.  Hence, I stress the testing period and having someone on-hand for the final phases.  Once the level of comfort grows with the process, you may want to implement more automation based upon your individual environment (eg: rebooting development machines might be fine at your organization).

The Evolution of Monitoring for IT

Monitoring within IT is a necessity.  Let’s just get that out of the way now.  IT employees and management need it in order to survive.  Without it, the word “proactive” never enters in to the vocabulary, and a death spiral ensues as users and customers alike beat on IT for unexpected and unplanned outages.

An old friend of mine and I worked at the same company back in the early 90’s, and we learned a lot through trial and fire while there.  It was a great time to be in IT.  I liken it to being on the original NCC-1701 Enterprise going “where no man has gone before.” (ironically, we had a machine called “ncc” which we built to do all of our monitoring, and for those not in our department we called it the “network control center” when in fact, it was an alias to the host name “ncc1701, but I digress)  Everything was new and waiting to be molded into shape.  While our careers have gone in different paths, we have both remained within the technical side of IT to some degree.  His recent posting http://everythingsysadmin.com/2013/11/stop-monitoring-if-service-is-up.html was great and it got me thinking as to how to contribute to the picture he started to paint (hence, this article).

In my eyes, he described the panacea of monitoring, starting from a bare slate by approaching it with a top-down design by understanding what you need to monitor (aka: get the requirements) and then setting your monitoring to look for it.  Unfortunately, monitoring is rarely ever built that way and it’s near impossible to nuke it and start over.  Right or wrong, it’s commonly built organically and driven by need, and the requirements are forever evolving, changing, and being chased after.

His article has it right in that too many administrators rely on basic monitoring, the up/down of a service, and call it a day.  Instead, administrators should consider that type of monitoring as the foundation, the “catch all”, and build the real monitoring out from there.  It’s not an end point, but a starting point.

My career has a focus on the small to mid-sized companies, typically focused on that transitional period from when a small company is “crossing the chasm” to become a bigger player in their space.  The article he wrote is great for established companies where monitoring has been in place for some time.  However, from what I have seen over the years, it takes time and the right attention to get there, as I attempt to outline below:

The small “mom-n-pop” company (incubation mode)

  • Step 1: IT (or that one computer in the closet that runs the mail server, I think) is managed by someone doing payroll, facilities, development, or some other primary task.  Everyone else manages their own machines.
  • Step 2: Outages of any and all services happen on a frequent basis, but it is just faster to restart the service and get everyone, including the part-time IT person, back to work than it is to figure out what caused the problem in the first place.
  • Step 3: If something happens overnight, it’ll just have to wait until the morning.
  • Step 4: Who has time for monitoring anyway?

The true small company (Goodbye “Mom-n-Pop”)

  • Step 1: There is a true “IT person” or “persons” responsible for maintaining the corporate equipment and the production equipment.
  • Step 2: Issues are brought to the attention of that person(s) either by:
    • Stopping by their desk
      • They are rarely there, or they are on the phone.
    • Stopping them in the hallway (very common)
    • E-mailing them directly (the primary interaction)
      • So much unsorted and unpriortized email leads to a lot of stuff falling to the floor unattended.
    • Calling them
      • Voicemail box is nearly always full.
  • Step 3: Problems are handled as they come up, and they come up frequently enough to:
    • Justify the IT person(s) existence in the company
    • Keep that person (or persons) busy every second of the day (even company sponsored social events are an avenue to report problems)
  • Step 4: Production and Development are pretty much the same at this point.  Code is pushed directly to production typically without the IT persons knowledge, but when it all breaks, the call goes to the IT person. (can we say “out of left field?”)
  • Step 5: The IT person (or persons) are typically seen as running around either with their own hair on fire in search of a fire extinguisher, or with an extinguisher in hand putting out random fires throughout the building.
  • Step 6: Someone, usually outside of IT, writes a small script to send a daily “ping report” of what’s up and what’s down.  Management cheers while the IT folks wonder where the benefit is.

The small, but established, company

  • Step 1: There is more than one person responsible for IT.  “Jack-of-all-trades” types are required in order to have some overlap (everyone needs a vacation at some point).
  • Step 2: The frequent outages in the past have started to be recognized as hitting the bottom line.  Management is now asking to “get a handle on things”.
  • Step 3: IT folks are perceived as being part of the problem as they are reactive as opposed to being proactive.  It’s the beginning of buzzword bingo, but it is still healthy at this point.  The perception of a “blame game” ruffles some feathers at this point.
  • Step 4: Rudimentary monitoring is put in to place with a focus on the most basics (is the machine up?).  This effectively alerts the administrators a few seconds before the users, customers, or developers recognize an issue (about the amount of time it takes to craft that email saying “Is the webserver down?” but enough time to set the perception that IT is ahead of the game).
  • Step 5: Administrator fixes the problem as fast as possible in order to get back to watching the Walking Dead.
  • Step 6: Uptime is improved, management is a little happier. That same ping report is used to show the better uptime. Life is grand.

Nearing mid-sized

  • Step 1: Management is still hearing grumblings from users & customers alike that the services are not stable.  Therefore, management is a bit miffed that IT is still in “react” mode and baffled as to how to get IT in to “proactive” mode.  Printouts of “Buzzword Bingo” boards start appearing on printers alongside of the resumes.
  • Step 2: My motto here is that “necessity is the mother of invention”.  Administrators should be motivated to NOT get that call at 2am about some service, and the push begins to start delving deeper in to the monitoring.
  • Step 3: It is time to get motivated about monitoring as it is now recognized that there is too much time lost on the inefficiencies of putting out fires.  Nagios or Icinga is stood up somewhere because one IT person happened to read an article about it.  No one outside of IT, and even within some IT departments, can correctly pronounce the name of the tool, but they generally understand what it is supposed to do.
  • Step 4: The very basics are now configured in Nagios (disk space, uptime, CPU load) and general constraints, typically the defaults, are accepted for alerting.  This is the beginning of understanding the need for trending analysis and being proactive.
  • Step 5: Alerts go crazy for a period of time, making everyone numb to the emails and SMS messages.  The number of outages spikes, ticking off Management, but is explained away as an “evolving process” which is somewhat true.  The monitoring is adjusted to turn the volume down a bit.
  • Step 6: IT pats itself on the back, even puts up a monitor in a visible location to all to show the dashboard effectively saying “See?  We’re watching everything!”
  • Step 7: Pager rotation duty comes in to play if it didn’t exist already.  That one person holding the place together breathes a sigh of relief that he can now have Thanksgiving dinner this year without the worry of being on-call, but he still twitches constantly expecting the phone to vibrate.
  • Step 8: Fires still happen, chaos still exists, but it is all under the veneer of a monitoring dashboard (“Yes, we see that’s a problem, too.  We are working to resolve it now.”) and having eyes on it in order to react faster than ever before.
  • Step 9 (optional): Self-healing script, like event handlers, are crafted to restart troublesome processes automatically.

Mid-sized

  • Step 1: IT is still recognized as being reactive.  Not good.
  • Step 2: Efforts are really focused on being proactive.  IT needs to understand more of the innards of the applications, be it infrastructure related like Apache and DNS, or production related like a custom application, an EMR (electronic medical records) application, or an off-the-shelf product like Oracle.  IT doesn’t need to know it like the back of their hand, but they need to reach in to the applications far enough to figure out how to monitor it to look for the signs of an impending failure or problem.
  • Step 3: Building on the existing monitoring solution, more event handlers are created to resolve issues when noticed, and new more custom scripts are built to handle problems when the symptoms are noticed.
  • Step 4: Downtime is reduced (but not completely gone).  Management recognizes that some R&D is required for IT to keep the company going.

And beyond

In my eyes, any mid-sized company without real monitoring of a proactive nature is standing on a precipice of a catastrophe.  At that point, be sure to study The Tao of Backup carefully.

The infrastructure will be both old enough and mature enough to require more sophisticated monitoring than up/down.  With IT always looked upon as an overhead expense item on the balance sheet, it needs to be as efficient as possible.  Adding headcount to continue to manage the fires is not something that is sustainable over the long term.

For companies Mid-sized and beyond, the hope is that the machine of industry has started spinning the monitoring wheel in to its processes.  Similar to backups, it has to be considered at the start of the projects, not as an afterthought.  It is IT’s job to constantly remind and insert itself, politely, in to the process early in the game in order to be successful.