Embracing SSSD in Linux

When RHEL6 came around and I saw that sssd was a new way to sync up to the LDAP server, I cringed in horror. The PADL tool set that was shoehorned into RHEL5 was painful, and it took a lot to get it working correctly.  The last thing I wanted to do was to re-open that wound.

Caching and why you want to do it
With the advent of centralized services, the IT world took a great step forward.  It wasn’t perfect, though. It still required the client to be able to talk to the service (ldap, hosts, automount maps, etc) in real time.  If the centralized service were down, clients would break as well.  Introducing redundancy in to the environment (active/passive LDAP pairs) sort of solved this problem but not entirely.  Those with laptops taking their machines off the wire either to go home or to wander to a new location, would lose access.  Individual services might cache credentials locally, but those eventually have to phone home, fail, and then lock the user out.

The goal, really, is centralized management. Make it easier to manage the complex linux environment.  Everything from user accounts and permissions, to domain name resolution.

Single Sign-On SSO
As we get closer to a Single-Sign-On (SSO) environment for Linux, funneling all requests through a master intermediary which would handle the connection to the various backend solutions and cache the data locally looked to be a decent option. The PADL tools were the beginnings of that, but their implementation was clunky and hack-ish.  Despite that, the results were worth the headache.

With enough griping, when RHEL5 hit, it was all re-wrapped in to nslcd.  Slightly updated from previous incarnations, those familiar with the configuration files of old were able to easily implement nslcd.  Combined with nscd, you had a workable solution yet still filled with headaches of its own.

System Security Services Daemon (SSSD)
RHEL6 came in to place and introduced sssd. Aside from the awful name, it was yet another hideous beast fighting to control. Like many others, I had reluctantly embraced nslcd because it was close to the old ldap.conf file, the configurations were generally the same, and it took a short amount of time to get implemented.  SSSD was a completely different beast that requires some time to learn and understand before diving fully in.

Unlike other people, I didn’t hold an angry grudge against the PADL tools.  It was a means to an end, and, when all was said and done, it delivered what it promised.  I had a bunch of integration to OpenLDAP through PAM with caching with nslcd. Things worked.

Why implement SSSD?
Despite what the administrator in you feels (“do it once, do it right” creedo), the Admin/Engineer’s world is about doing things better. SSSD combines the functionality of nslcd and nscd without the array of bugs, without the odd “third wheel” product support, and it expands the scope of what can be managed easily.

Admit it: the LDAP world is changing, and Single Sign On (SSO) is continually evolving. The tools to support that have grown in functionality and scope, and new options are out there. For example, one of the key centralized backend tools, OpenLDAP, is admittedly unfriendly on a good day.  Alternatives have come along, such as the Fedora Directory Server (FDS) – now known as 389-ds -, and have matured significantly.  When bundled with SSSD and IPA, you have the makings of the Windows Active Directory equivalent in Linux.  389-ds is the supported RedHat tool, and it is actively being developed in the Fedora world, too.  IPA is a RedHat only option, but there is always FreeIPA for the rest of us.

In other words: you have gone as far as you can go with nslcd/nscd, and implementing sssd prepares you for the future.

BEWARE of documentation caveats
SSSD is still growing and evolving. Changes to the content and format of the sssd.conf file are happening quite frequently, and here are a few items to be aware of:

  • SSSD fires off a separate daemon handler for each service. If you want to debug a service, you cannot turn debugging on in the [sssd] stanza, but you must define it within the service stanza. For example, “debug_level = 9” would go under “[autofs]” to debug autofs.
  • The RedHat documentation is good but it gets confusing with regard to setting autofs attributes. To clarify, those settings must be made at the [domain/default] stanza otherwise they will be quietly ignored.
  • Debug settings are no longer numbers from 0-10.  They are now a crazy scheme of 0x4000 type of numbering.
  • SSSD’s debugging is a bit painful. It doesn’t always log what you want where you want it to. If you get close to the end of your rope, it is very helpful to run sssd in the foreground in one window while testing in another to watch the output live.
    sssd -d 9 -c /etc/sssd/sssd.conf -i

The actual migration
Migrating from nslcd to sssd can be a bit of a pain as it is likely you already have a number of dependencies in place.  For me, my edge hosts only needed to talk back to the LDAP servers for local authentication and automount tables.  Documentation from sites such as RedHat are very complete yet are still lacking in some key areas.

Migrating basic authentication is rather easy:

  1. yum -y install sssd
  2. Build your /etc/sssd/sssd.conf:
    [sssd]
    config_file_version = 2
    reconnection_retries = 3
    services = nss, pam, autofs, sudo
    # SSSD will not start if you do not configure any domains.
    # Add new domain configurations as [domain/] sections, and
    # then add the list of domains (in the order you want them to be
    # queried) to the "domains" attribute below and uncomment it.
    domains = LDAP
    #debug_level = 10
    
    [nss]
    filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
    reconnection_retries = 3
    
    [pam]
    reconnection_retries = 3
    
    [autofs]
    
    [sudo]
    
    [domain/LDAP]
    ldap_tls_reqcert = never
    # Note that enabling enumeration will have a moderate performance impact.
    # Consequently, the default value for enumeration is FALSE.
    # Refer to the sssd.conf man page for full details.
    enumerate = true
    auth_provider = ldap
    # ldap_schema can be set to "rfc2307", which stores group member names in the
    # "memberuid" attribute, or to "rfc2307bis", which stores group member DNs in
    # the "member" attribute. If you do not know this value, ask your LDAP
    # administrator.
    #ldap_schema = rfc2307bis
    ldap_schema = rfc2307
    ldap_search_base = dc=ourdomain,dc=com
    ldap_group_member = uniquemember
    id_provider = ldap
    ldap_id_use_start_tls = False
    chpass_provider = ldap
    ldap_uri = ldap://ldap1.ourdomain.com/,ldap://ldap2.ourdomain.com/
    ldap_chpass_uri = ldap://ldap.ourdomain.com/
    # Allow offline logins by locally storing password hashes (default: false).
    cache_credentials = True
    ldap_tls_cacertdir = /etc/openldap/cacerts
    entry_cache_timeout = 600
    ldap_network_timeout = 3
    autofs_provider = ldap
    ldap_autofs_search_base = dc=ourdomain,dc=com
    ldap_autofs_map_object_class = automountMap
    ldap_autofs_map_name = ou
    ldap_autofs_entry_object_class = automount
    ldap_autofs_entry_key = cn
    ldap_autofs_entry_value = automountInformation
    sudo_provider = ldap
    ldap_sudo_search_base = ou=Sudoers,dc=ourdomain,dc=com
    ldap_sudo_full_refresh_interval=86400
    ldap_sudo_smart_refresh_interval=3600
    # Enable group mapping otherwise only the user's primary group will map correctly. Without this
    # defined group membership won't work
    ldap_group_object_class = posixGroup
    ldap_group_search_base = ou=group,dc=ourdomain,dc=com
    ldap_group_name = cn
    ldap_group_member = memberUid
    
    # An example Active Directory domain. Please note that this configuration
    # works for AD 2003R2 and AD 2008, because they use pretty much RFC2307bis
    # compliant attribute names. To support UNIX clients with AD 2003 or older,
    # you must install Microsoft Services For Unix and map LDAP attributes onto
    # msSFU30* attribute names.
    ; [domain/AD]
    ; id_provider = ldap
    ; auth_provider = krb5
    ; chpass_provider = krb5
    ;
    ; ldap_uri = ldap://your.ad.example.com
    ; ldap_search_base = dc=example,dc=com
    ; ldap_schema = rfc2307bis
    ; ldap_sasl_mech = GSSAPI
    ; ldap_user_object_class = user
    ; ldap_group_object_class = group
    ; ldap_user_home_directory = unixHomeDirectory
    ; ldap_user_principal = userPrincipalName
    ; ldap_account_expire_policy = ad
    ; ldap_force_upper_case_realm = true
    ;
    ; krb5_server = your.ad.example.com
    ; krb5_realm = EXAMPLE.COM
    
  3. Start sssd
    systemctl enable sssd; systemctl start sssd
  4. Update your nsswitch.conf file by replacing all references to ldap with sss
  5. RESTART your services.  I cannot stress this one enough. Most of the services do some form of internal connection caching on their own and will still be attempting to communicate to the back-end service as it was previously defined in nsswitch.conf.
  6. Stop and disable nscd:
    systemctl stop nscd; systemctl disable nscd
  7. Test each service
    users accounts: getent passwd
    user authentication: ssh atestuser@localhost
    POP3/IMAP access via dovecot: fire up a mail client (your phone, a web client) to see if you can access your email
    Sendmail authentication via SASL: make sure you can SEND an email
    Email filtering via MailScanner: this should work because it is not connected to users, but it won't hurt it to restart
    Web authentication via LDAP or tools such as owncloud, roundcube, etc: try logging in
    Filesystem mounts via NFS: cd /home/atestuser
  8. Stop and disable nslcd:
    systemctl stop nslcd; systemctl disable nslcd

Systemd, NetworkManager, and dhclient

Systemd definitely has its benefits with starting processes in parallel and handling complex dependencies, but NetworkManager seems to thwart that process a bit.  NetworkManager does exactly what it says: it manages the network connection, setting the ip, bringing up the interface, starting dhclient, or any number of other things.  It still has the stigma of being a piece of pretty eye candy around an already good and solid paradigm (the “network” service), but, alas, NetworkManager is here to stay so the quest is to make it place nicely with all of the other services in the systemd sandbox.

NetworkManager brings up the interfaces by kicking off other sub processes, such as dhclient, to complete the initialization process of the network interfaces.  This is done for performance reasons: enabling the system to boot much faster without everything coming to a grinding halt waiting for a DHCP server response.  However, an artifact of this design is that systemd is under the belief that NetworkManager has completed and the processes dependent upon it can now be started.

The catch is that while NetworkManager has indeed completed, dhclient is still finishing up its tasks.  This causes a bit of a race condition for processes dependent upon the interface to have internet connectivity, a race that usually results in the IP address being assigned to the interface after the dependent process has started, failed, and exited.

The solution is to enable an additional NetworkManager service as described on freedesktop.org:

systemctl enable NetworkManager-wait-online.service

Using Puppet to perform “yum update”

Puppet is a great tool for configuration management.  Chef, slightly younger than Puppet, has matured to be a very good option when choosing a configuration management tool.  In this particular case, the focus is on Puppet.

The Puppet community frowns heavily on using puppet as the tool for global package management.  At least, that’s the perception. I have personally witnessed comments like “puppet is not the right tool for this” and “there are better tools out there for it,” (a good example is found in the comments on Stack Overflow http://stackoverflow.com/questions/18087104/yum-updatesand-work-arounds-using-puppet), but I believe those who respond that way just don’t get it.. they don’t see the bigger picture.

Puppet?

Puppet is a configuration management tool for managing your system configurations.  Why is it that those configurations stop at the individual package and related configuration files only? The configuration of the host is arguably the entire configuration of the host.  Certainly, one can go very far overboard and maintain the specifics about ever one of the 800+ packages installed via custom puppet modules, yet we all must admit that would be ludicrous. However, one of the primary goals with a configuration tool like puppet is to create completely reproducible systems for disaster recovery, and there has to be a more efficient way than writing a module for every package on the system.

Unix/Linux is built upon the foundation of layering; break a large job down to its most simplistic parts, build tools to handle the individual pieces, and then layer on another tool to manage the individual tools, and so on.

Managing updates via yum (or apt) for your systems via puppet is not a bad ideadespite what others seem to say. For it to be successful, it does require you to understand the risks you are taking, what is acceptable, and what your organizations limits are.

Maintain Your Own Repositories

This cannot be highlighted enough.  Mirror all of the repositories that are used to make systems at your organization and your clients must be configured to use your mirrors only.  The reason for this is simple: you want control of your systems, and this includes the software that resides on them.  By mirroring all of the repositories that are used to build systems at your organization, you control when the get updated and you have direct access to a ‘frozen’ version of those repositories.  “Frozen?”  Yes! Repositories are, in some cases, made up of thousands of packages, all maintained by different people, all being updated at different times.  In order to maintain the control of your systems, you have to control the updates.  You don’t want your repository being ever so slightly different with each system build as that goes against what you are trying to accomplish with Puppet in the first place: repeatable processes.

The mirrors should be refreshed by your determination. You maintain control.  Therefore, you sync the repositories to your local mirror on your schedule.  To support this, I built a simple bash script to perform a “reposync” (from the yum-utils package) of all of the appropriate repositories followed by a “createrepo” (from the createrepo package):

...
# Let's start in the top level of our repository tree
cd /var/www/html/repo

# Grab the EPEL tree
reposync --arch=x86_64 --newest-only --download_path=EPEL/6/ --repoid=epel --norepopath --delete

# Make sure our Redhat repo is up to date, both for the Server and the optional
# trees.  There's no need to keep syncing the OS repo as that never changes.
reposync --arch=x86_64 --newest-only --download_path=RHEL/6/Server-Optional/ --repoid=rhel-6-server-optional-rpms --norepopath --delete
reposync --arch=x86_64 --newest-only --download_path=RHEL/6/Server/ --repoid=rhel-6-server-rpms --norepopath --delete

...

# Rebuild the repository data for EPEL.
cd /var/www/html/repo/EPEL/6/
createrepo -d .

# Rebuild the repository data for RHEL.
cd /var/www/html/repo/RHEL/6/Server-Optional/
createrepo -d .
cd /var/www/html/repo/RHEL/6/Server/
createrepo -d .
...

Using your own mirrors will also drastically decrease the amount of time it takes to build or update a machine because you are pulling all of the packages over your local LAN (high speed) vs the WAN (low speed).

Do not Automatically Reboot

This, too, cannot be highlighted enough.  Rebooting a machine must be done with someone at the helm, with someone watching and waiting to react if something goes horribly wrong.  The last thing you want is all of your servers to reboot after updates where none of them reboot properly, and you don’t discover it until 8am the following morning.  If that is a risk you are willing to take, please be sure to also have an updated version of your resume handy.

Build a SANE schedule

No one wants to update their systems constantly, and taking downtime for each update is not feasible for a successful business.  If you do not have formal maintenance windows, create informal ones.  Pick a point in time (once a month, every quarter, etc) that is workable for your organization and team (if you are a team of one, balance what you are willing to manage).  This is your “go-live” date.

Now work backwards to build your T-minus dates:

  • How long do you need to test the updates (do you even test them)?  Let’s say 3 days of testing, including just running a machine with the updates on it to make sure it doesn’t blow up.
  • How long do you need to update your mirrored versions of the repositories? between 1/2 a day to a full day might be about accurate, including some minor testing.

So, you are looking at close to about 4 days from start to finish.  This may be excessive for your organization, or it may need to be longer.  However you approach it, keep a sprinkle of sanity in the mix.

Implement in Puppet

The implementation is actually quite simple, really.  Create a basic class module, let’s call it yum::update, and set the criteria accordingly.  In the example below, the updates will run on the 6th of every month between 11:00am and 11:59am

class yum::update
{
# Run a yum update on the 6th of every month between 11:00am and 11:59am.
# Notes: A longer timout is required for this particular run,
#        The time check can be overridden if a specific file exists in /var/tmp
exec
{
"monthly-yum-update":
command => "yum clean all; yum -q -y update --exclude cvs; rm -rf /var/tmp/forceyum",
timeout => 1800,
onlyif => "/usr/bin/test `/bin/date +%d` -eq 06 && test `/bin/date +%H` -eq 11 || test -e /var/tmp/forceyum",
}
}

It probably makes more sense to create a script which does all of the date/time comparison and override functionality for you, manage that script through puppet, and reference that script here in the “command” line above.  Consider that the next step of evolution.  However you choose to proceed is up to you.

Now, you can include that class in your manifests/nodes.pp per host, globally, or however you wish, as shown below.

node 'web-dev.my.domain.com' inherits default
{
include postfix
include mysql
include syslog::internal
include openldap::client
include names
include pam::client
include wordpress::dev
include ssh
include httpd
include db::backup
include yum::update
}

Control, Efficiency, and Security

Updating a system is important to maintaining its security, but it is a careful balance to maintain your system’s availability and integrity for your end users & customers.  Automation is king here.  Understand, though, that it also makes your puppet server that much more critical, and making a mistake here can have a much wider impact.  Hence, I stress the testing period and having someone on-hand for the final phases.  Once the level of comfort grows with the process, you may want to implement more automation based upon your individual environment (eg: rebooting development machines might be fine at your organization).