Secret Management with Vault

This post is the first of a two-part series on using Vault in production. Both posts are slightly redacted forms of internal documentation. This post will cover why we chose our specific workflow, and the second post will cover day-to-day usage of Vault.

Problems

Sensitive credentials and keys are stored in certain code repositories (Github).

Anyone with access to Github has access to these credentials.
Anyone who has checked out code has these sensitive credentials on their hard drive.
Key rollovers are a very difficult, manual process.

Sensitive credentials and keys are stored in plain text.

Anyone who can see these credentials can use them.

Shared credentials and keys are used in numerous places.

Generating a meaningful audit log is difficult.

Goals

Encrypt sensitive credentials and keys at rest.
Store sensitive credentials and keys in a central, remote, network accessible location.
Gate and audit access to sensitive credentials and keys.
Provide a unique identifier to each user/agent (per auditing purposes).

Solution

Vault

By leveraging Vault, we can meet all of our goals.

1. Encrypt sensitive credentials and keys at rest.

Vault encrypts data all stored data at rest.

2. Store sensitive credentials and keys in a central, remote, network accessible location.

Vault is a highly available secret management solution that is network accessible via its HTTP API or via running a local client.

3. Gate and audit access to sensitive credentials and keys. 4. Provide a unique identifier to each user/agent (per auditing purposes).

Vault allows for per user, per machine, or per app credentials controlling access as granularly as needed or desired. In addition, all requests and key usages are recorded in Vault’s logs or syslog which can be shipped to a centralized logging solution.

Implementation Strategy

While Vault provides the primitives and tools, we still need to form a process that understands and works with SeatGeek both now and in the future. With encryption and auditing handled, our job is to store and provide access to secrets as well as manage tokens.

NOTE: The following assumes knowledge about specific Vault features, general AWS knowledge, and SeatGeek’s Base AMI.

Storing Secrets

At SeatGeek (and most other software shops), the two most common types of secrets are the following:

Per Enviroment

This includes secrets that the same for every machine or application, but differ based on the current environment. They are also commonly or can be used by all machines or applications, which is important to note.

Examples: New Relic, PagerDuty
Per Application

This includes secrets that differ between applications, where an application is the combination of itself and the environment in which in runs. This also includes secrets that are not common to every application, regardless if one value is always used.

Examples: Braintree Token, Spreedly Key, Sentry DSN

To address these two use cases, we will be using Vault’s generic secret backend.

The reasons for using this backend are simplicity and flexibilty. It allows for arbitrary key-value pairs to be stored, encrypted, and retrieved from Vault without the need or use of third party services.

The generic secret backend allows for key-value pairs to be written under the namespace secret, and can be associated with various ACL’s. The currently used schema is of the following form:

secret/ENVIRONMENT/APP/KEY value=VALUE

Here, the top level under the secret namespace is ENVIRONMENT, with each APP getting its own bucket per ENVIRONMENT in which KEYs are written. Vault KEYs can contain a dictionary of key-value pairs themselves, and so the secret VALUE is written to the key value.

NOTE: bucket == namespace

The following environments exist:

production
staging
management
test

Each app will have a bucket created when it is configured to launch in a given environment. Additionally, for our per environment secrets there is a common bucket under each ENVIRONMENT namespace.

Examples of secrets in the wild:

Staging New Relic key
secret/staging/common/NEW_RELIC value=THISITHEKEY

Production API Spreedly Token
secret/production/api/SPREEDLY_TOKEN value=THISISTHETOKEN

Accessing Secrets

The basic premise here is a client authenticates and is granted a token. That token, among other things, is associated with a role and corresponding set of authorizations in the form of policies or permissions.

Authentication

At SeatGeek (for the time being), there are two Vault clients we need to worry about:

Developers

These are people who write code at SeatGeek. Developers should be granted enough access to be able to do their jobs while keeping our sensitive information secure and our applications running.
Machines

These includes any servers running with SeatGeek infrastructure. Machines should be able to self-authenticate in order to retrieve necessary secrets for provisioning and running applications.

NOTE: This workflow differs for Admins who are granted root tokens, no permission restrictions here.

To provide these levels of access, two different Vault authentication strategies will be used specifically github authentication and app-id authentication.

The github authentication strategy was chosen here as we are already using it as a means of authenticating people for internal applications, and so some user grouping has already been done.

The app-id authentication strategy is used for roughly the same reasons as the generic secret backend. It is the simplest and most flexible to implement without relying on other systems.

Successful authentication via either of these methods results in a Vault token, which can be used to retrieve secrets.

Our github authentication includes simply allowing anyone in the SeatGeek Github organization on the team-developers team to be able to request and retrieve a Vault token. This is done by making a Vault login request with a Github personal access token. While this does not include everyone who writes code, it handles the majority of users for now.

Our app-id strategy reserved for machine authentication is highly dependent on AWS and our newer infrastructure strategies. When an AWS machine boots up, it can be configured to run with an IAM Role. This role is unique per application per environment, and also includes an id which can be retrieved from an instance’s metadata on the machine itself. Using this information, all SeatGeek IAM roles are whitelisted within Vault against their matching app and associated with a IP Range that corresponds environment’s VPC IP Range. This is our user-id in Vault terms. Machines can then make a Vault login request with the app they are responsible for running (applied during configuration management) and their IAM Role Instance Profile ID (attachment id). Assuming all pieces line up (IP address, app id, IAM Role Instance Profile ID), a Vault token is granted.

Additionally and only for machine authentication, there is a ENVIRONMENT-base-ami role that all machines can authenticate as. This allows for all machines on boot to be able to retrieve environment secrets via Vault’s app-id strategy without knowing which app is to be deployed. This is/would primarily be used to be able to test the Base AMI in isolation in our environments.

In both of these app-id authentication scenarios, the user-id is the machine IAM Role Id. However, when applications authenticate, the user-id is app-IAM_ROLE_ID. user-id’s must be unique, and this allows for us to have two user-id’s for a give IAM Role along with the appropriate configuration.

In the latest release of Vault, the app-id strategy has been deprecated in favor a new app-role strategy. Ultimately we will migrate from app-id to app-role with roughly the same implementation but are currently held back by the version of Vault (0.6.0) and the vault-ruby (0.6.0) gem we are using.

Authorization

Vault implements authorization via its own ACL’s or policies. These provide a set of permissions which can be scoped to various operations within Vault, typically indicated by namespaces. In the case of obtaining secrets, that namespace is secret. Additionally, these ACL’s can be associated with the various authentication strategies. A more generic way to think of it is a client authenticates and is granted a token. That token, among other things, is associated with a role and corresponding set of policies (same as other authentication/authorization strategies).

The current policies are used to control access to Vault secrets:

staging-read-only
testing-read-write
ENVIRONMENT-APP-read-only
ENVIRONMENT-common-read-only

As far as developer authorization, all Github users are granted staging-read-only and testing-read-write, which if not obvious, means that any secret under the staging namespace can be read, and free reign with the testing namspace. production read-only access will be granted on a per application bases to service owners, and be implemented via Github teams.

As far as machine authorization, machines are granted the ENVIRONMENT-APP-read-only and ENVIRONMENT-common-read-only. As such, machines can access the common bucket and their app bucket within their ENVIRONMENT, nothing else. Cross ENVIRONMENT and cross app secret access is currently disabled and discouraged, although this might be revisited in the future.

Important to note here is the inability for non-Admins to write or update anything in Vault. These permissions are currently restricted to members of the Operations team, but this will surely be revisited in the future.

Token Managment

As of now, Vault tokens last forever once granted. This is a temporary measure that allows for simplicity of use, but additionally tooling will allow for this be changed.

Causes for Concern

Admins are granted root tokens
Developer authentication and authorization is reliant on Github
Machine credentials can be used on other machines within an IP Range
Assumptions are made around machines running a single application
Tokens last forever and be reused if retrieved
Vault is not using TLS
Metrics are not currently sent anywhere
No ui solution for managing secrets
Not possible to easily assume an application’s environment

Strategic Improvements

Admin Tokens

Currently, Admins are granted root tokens without permission restricitons. The latest version of Vault (0.6.2) has changed the ways in which root tokens are created/used, and as such, these could be substitued for Admin tokens or tokens with equivalent or slightly less permissions granted.

Developer Authentication/Authorization

With a centralized login system, developers would be able to authenticate with means other than Github potentially being more flexible and less dependent on a 3rd party. Permission granularity could also be provided on a per user basis allowing for trusted production access (ex: service owner access).

Machine Authentication

While we are already leveraging AWS for machine authentication, there are improvements in Vault to make this simpler and more secure. This integration would tie us tighter to AWS infrastructure, but it is doubtful we would run servers elsewhere, and if so we have an existing strategy.

These improvements involve allowing machines to one time authenticate with AWS dynamic metadata, addressing the issue of credential (re)use on different machines. Machines can be currently whitelisted by IAM Role or AMI.

App Authentication

We currently have a decent strategy for machine authentication, but our application authentication lacks flexibility. Specifically we assume that a single machine is running a single application and as such has a single IAM Role with the appropriate permissions for that application. This does not work if multiple applications coexist on a single machine, or if an application is broken up into tiers.

A way to combat this is to have application authentication use a different mechanism than machine authentication. This will require a revisit but will most likely leverage Vault’s Cubbyhole to multi-application scoped tokens via one time tokens.

Token Management

Tokens last forever currently, and should have leases and TTL’s. This would involve additional work to renew token leases as necessary.

TLS

TLS is disabled on our Vault cluster as it is addressed only within our internal network. With the requirement of TLS for all HTTP 2.0 connections, this will be revisited in the future and most likely with Vault serving as an internal CA.

Metrics and Monitoring

We are still in the early stages of adoption and use, but Vault has support for shipping application stats via a few means including StatsD.

Web UI

Either writing or adopting an existing open source solution would be extremely beneficial, as it would remove the burden of managing secrets from the Operations team while also allowing developers more control over how their applications are configured.

Locally Assuming App Roles

There is currently no way to run a command locally using the credentials in staging/production for a given application. Something like a .env file writer or a foreman-style command runner for our application manifests could go a long way in allowing developers to run services locally while simulating an environment.

Vault Configuration

https://www.vaultproject.io/docs/config/index.html

Below lists our current Vault configuration, which takes into account the following conditions:

Vault is running within our internal network and is not publicly accessible.
Consul is already being used

backend "consul" {
  address = "127.0.0.1:8500"
  path = "vault"
}

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = 1
}

Vault differentiates itself from other secret management services with its high availabilty option, and we leverage the Consul backend to deliver that. The Consul client is already configured to run on all of our machines (with default port mappings), with our Vault servers being no different. This also means that all data is stored encrypted in Consul, and so the Consul install should also be highly available.

As Vault is run within our internal network (and for other reasons), TLS is disabled. While this is desireable, we need to do additional work to make internal TLS usage a reality. Vault is also running on the standard default port of 8200 and listening on all network interfaces.

If you think these kinds of things are interesting, consider working with us as an Infrastructure Engineer at SeatGeek. Or, if infrastructure isn’t your thing, we have other openings in engineering and beyond!

Reference: https://sreeninet.wordpress.com/2016/10/01/vault-use-cases/

Code, Design, and Growth at SeatGeek

Jobs at SeatGeek