Consistent security controls and high reliability are common expectations for any systems administrator. How do you deliver both on a network with thousands of servers supporting thousands of engineers? Most off-the-shelf solutions require a compromise in at least one of these areas — and we refused to accept this.
Most systems administrators use the industry-standard Secure Shell (SSH) for accessing systems, and yet many of its special features are not widely leveraged. At Facebook, we take advantage of those features to use SSH in a way that is reliable, secure, and manageable. SSH, more specifically OpenSSH, has a great way to provide both the security and reliability we require: signed certificates with principals.
Signed certificates
When systems administrators install a Linux server, it's common to create a couple of accounts with passwords; eventually, a few users are given sudo access to escalate privileges. Local account management works well with a few servers, but as a company grows, central authentication, like LDAP and/or Kerberos, is often used to avoid manually managing accounts on every server. With continued growth, systems administrators may come to realize that central authentication is a single, and potentially devastating, point of failure. If it goes down, they will lose access to everything unless they allow root logins to gain access to systems. Getting locked out of your own system is one of the worst things that can happen during an incident. For example, if a service outage brought down your authentication infrastructure, you would not be able to log in and fix it.
In addition to the failure risks of central management, opting for public key authentication over passwords for your users means having to manage their public keys across all your servers. If the administrative headache weren't enough, it's not uncommon to compromise your own security by having unknown public keys in the authorized_keys
files. Furthermore, authorized_keys
requires defining trust by individual key pair, which does not scale.
We recognized that authentication with signed certificates provides a single point of trust with no dependency on any third-party infrastructure. We configured our SSH servers to trust our certificate authority (CA) and everything it signs. We borrowed this from a simple and widely used concept that others use across the internet for HTTPS traffic, and we adopted it in the SSH world.
When designing a security system, regardless of purpose or protocol, you need to think of authentication and authorization separately. Authentication securely verifies that the client is who it claims to be but does not grant any permissions. After the successful authentication, authorization decides whether or not the client can perform a specific action. In addition, accounting tracks the entire process, in case you need to investigate later.
When you have a large set of users, you need a central directory to manage them. To avoid a single point of failure, our production systems only have local accounts in /etc/passwd
. The most common account used is root
. The SSH servers only accept root
logins if the client certificate has some very specific capabilities given by the CA, and the CA itself determines who can use SSH and where. Using local accounts introduces a problem, though. The standard Unix login accounting (utmp
, wtmp
) is of no use here, and the last
command can only show local accounts that have logged in. However, OpenSSH can give you detailed information on which certificate was used to authenticate, providing enough information for rich accountability.
We also have a centralized syslog infrastructure, collecting logs in real time from the entire fleet. We collect a large quantity of logs and the infrastructure required to support it can quickly become complex. On the aggregators, we have parsers interpreting and transforming logs into tabular data that send those logs to long-term retention databases, like Hive. This creates a central place for login accounting. Even though the local account we use for login is root
, for any other local account, the accounting system enables us to easily identify who logged in where and perform statistical analysis on this data.
Logging in as root
may sound a bit counterintuitive at first. However, with a trusted accounting infrastructure, it actually improves security and reliability because you can better enforce accountability for users' actions. You need to have a robust system that enforces access based on established business rules. We do this by defining security domains according to those rules.
Security domains
In typical environments, engineers have direct SSH access to production once they are in some trusted network. In our infrastructure, engineers don't have direct SSH access to production systems, enforced by network firewalls. Instead, we use a bastion host to reach production. Engineers can authenticate to those bastion hosts only from trusted networks. These hosts use centralized LDAP and Kerberos installations to share account information, and they require two-factor authentication to protect against password leakage. Once users are properly authenticated and authorized, a background process contacts our internal CA to request a signed SSH certificate. That certificate contains all principals allowed for that specific engineer. From the bastion host, having obtained the appropriate certificate, engineers can SSH into production machines as either root
or other lower privileged users. This way, we make sure that no engineer has more access than he or she requires.
We do not place SSH certificates on laptops because it is difficult to control everything that individual employees run on them. Even though they are centrally managed, laptops are more prone to vulnerabilities than the bastion servers. Therefore, we do not trust them with SSH private keys.
Implementation
We covered authorization and authentication philosophies. Now let's move on to implementation. What I described above may sound daunting, but it's not that complicated.
Let's walk through a few examples on how to set up a rudimentary certificate authority to sign certificates and hypothetical production servers to trust it, accepting only certain principals.
First, create your own CA, which is essentially just a normal key pair:
$ umask 77 # you really want to protect this :-)
$ mkdir ~/my-ca && cd ~/my-ca
$ ssh-keygen -C CA -f ca
You can decide if you want to have a pass phrase to protect the private key.
At this point, you have two files, ca
(the private key) and ca.pub
(the public key). You need to distribute ca.pub
to your entire fleet. Remember, this is meant to be public, so complete access lockdown isn't the goal. For this example, let's place it in /etc/ssh/ca.pub
. You can change the permissions of ca.pub
to 0644.
Now configure your SSH servers to trust it with this single line change in /etc/ssh/sshd_config
:
TrustedUserCAKeys /etc/ssh/ca.pub
Now that you have a chain of trust, you can start generating certificates. Ideally, your CA should be a very secure server that only the security team can access. In small places, it may be OK to have a human being signing the certificate requests. For larger deployments, a system that does all that automatically and programmatically is preferred. As you can imagine, we have the latter at Facebook. Under the hood, our complex CA infrastructure simply receives a public key, runs all security checks, invokes ssh-keygen
to sign it with the CA's private key, and returns the signed certificate back to the client.
One very important security practice is that private keys should never leave the systems where they've been generated, no matter how secure the transport is.
In a user terminal, not on the CA server, let's generate a key for them:
$ ssh-keygen -t ecdsa # or -t rsa, up to you
Generating public/private ecdsa key pair.
Enter file in which to save the key (/home/mfdutra/.ssh/id_ecdsa):
Enter passphrase (empty for no passphrase):***
Enter same passphrase again:***
Your identification has been saved in /home/mfdutra/.ssh/id_ecdsa.
Your public key has been saved in /home/mfdutra/.ssh/id_ecdsa.pub.
...
In your .ssh/
directory, you'll see id_ecdsa
and id_ecdsa.pub
. Copy the latter to the CA server and get it signed. Because this is public information, the transport isn't important. You can copy and paste, or fax it; just don't copy id_ecdsa
anywhere.
On the CA server:
$ ssh-keygen -s ca -I mfdutra -n root -V +1w -z 1 id_ecdsa.pub
The ssh-keygen
man page has a great explanation for each argument used. Basically, we're signing id_ecdsa.pub
with ca
. The certificate ID will be mfdutra
and the only principal it has will be root
. It's valid for one week and has the serial number 1
. You should have id_ecdsa-cert.pub
now. Copy this back to the user terminal and place it under .ssh/
. Remember, this is public information, and this certificate doesn't work without its respective private key.
Since we haven't configured the servers to require a specific set of principals, the default sshd
configuration will allow this certificate to log in as any user in its principal list. Since I used -n root
to create the certificate, I can log in as root
in any system that trusts the CA. If you don't have a special authorization schema in your organization, this is probably enough. However, you may want to have people logging in as another local user and then you can use sudo
to escalate by putting any other user in the -n
argument, which takes a comma-separated list of principals.
You can inspect a certificate with this command:
$ ssh-keygen -Lf id_ecdsa-cert.pub
id_ecdsa-cert.pub:
Type: ecdsa-sha2-nistp256-cert-v01@openssh.com user certificate
Public key: ECDSA-CERT ...
Signing CA: ECDSA ...
Key ID: "mfdutra"
Serial: 1
Valid: from 2016-01-13T15:26:00 to 2016-01-20T15:27:00
Principals:
root
Critical Options: (none)
Extensions:
permit-X11-forwarding
permit-agent-forwarding
permit-port-forwarding
permit-pty
permit-user-rc
Now that you have the certificate along with the private key, you can SSH anywhere as root
.
$ ssh root@any-system-that-trusts-my-ca
Once you're logged in there, take a look at the authentication log. You'll see something like:
Accepted publickey for root from 1.2.3.4 port 123 ssh2: ECDSA-CERT ID mfdutra (serial 1) CA ECDSA fingerprint...
You can see that even when logging in directly as root
, the server could still identify the certificate used to authenticate — in this case with ID mfdutra
. This means using a correct -I
with ssh-keygen
is very important because it will identify the certificate. Using a unique serial number is also recommended, so you can identify each individual certificate issued. In fact, if you want to use revocation lists, unique serial numbers are a requirement.
Next, we'll define different security zones. For this example, let's create three zones: zone-webservers
, zone-databases
and root-everywhere
.
Now, we'll configure one of our servers to accept only certain principals. Add this line to /etc/ssh/sshd_config
:
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
Populate the principals file:
$ mkdir /etc/ssh/auth_principals
$ echo -e 'zone-webservers\nroot-everywhere' > /etc/ssh/auth_principals/root
You can control access to any local user by creating those files under /etc/ssh/auth_principals
.
Reload the SSH server and try to SSH into it using the certificate you generated before. You'll be denied access, and the following line will show up in the authentication log on the server:
error: Certificate does not contain an authorized principal
Why create a zone called root-everywhere
? Because it's convenient when certain users need to be able to log in everywhere. It's a better option than putting all possible principals in the user certificate, which is difficult to manage long-term.
You can now go back to the CA server and generate a new certificate for this user, with a different principal:
$ ssh-keygen -s ca -I mfdutra -n zone-webservers -V +1w -z 2 id_ecdsa.pub
With this new certificate back in the user terminal, you can SSH into the system successfully, because there will be at least one intersection between the list of principals you have in your certificate and the list of principals the server accepts.
Get another server and configure it in the zone zone-databases
. Play with the certificates to see how it goes. Your imagination is the limit!
Next steps
The above covers the approach Facebook has taken to control authorization on production systems. We have invested time in building secure zones and a sophisticated access control management system. Various organizations will need different designs and your design should reflect your organization’s needs; there is no one-size-fits-all model.
Want to learn more? Check out the ssh-keygen
man page, especially the section about special options (-O). You can do some pretty cool stuff with certificates, such as having one that only allows running a specific command, or attaching a certificate to a specific host.
You can also use certificates related to host keys. With that, the SSH clients can automatically trust all hosts with a host certificate signed by the CA, eliminating the need to manually accept every new host you SSH into.
A few parting words of advice: When you build your CA, be it a small script or a complex system, make sure you keep track of all certificates you issue. If you find yourself in the unfortunate situation of having a compromised certificate (and its respective private keys) and you don't know how to revoke them, your last resort is to rotate the entire CA. If you end up having a programmatic CA, consider having short-lived certificates, e.g., 24 hours. This shortens the window of opportunity for an attack if you experience a compromise.
Above all, protect your CA private key and consider rotating it regularly.