This post is the first of a two-part series on using Vault in production. Both posts are slightly redacted forms of internal documentation. This post will cover why we chose our specific workflow, and the second post will cover day-to-day usage of Vault.
Sensitive credentials and keys are stored in certain code repositories (Github).
- Anyone with access to Github has access to these credentials.
- Anyone who has checked out code has these sensitive credentials on their hard drive.
- Key rollovers are a very difficult, manual process.
Sensitive credentials and keys are stored in plain text.
- Anyone who can see these credentials can use them.
Shared credentials and keys are used in numerous places.
- Generating a meaningful audit log is difficult.
- Encrypt sensitive credentials and keys at rest.
- Store sensitive credentials and keys in a central, remote, network accessible location.
- Gate and audit access to sensitive credentials and keys.
- Provide a unique identifier to each user/agent (per auditing purposes).
By leveraging Vault, we can meet all of our goals.
1.Encrypt sensitive credentials and keys at rest.
Vault encrypts data all stored data at rest.
2.Store sensitive credentials and keys in a central, remote, network accessible location.
Vault is a highly available secret management solution that is network accessible via its HTTP API or via running a local client.
3.Gate and audit access to sensitive credentials and keys.
4.Provide a unique identifier to each user/agent (per auditing purposes).
Vault allows for per user, per machine, or per app credentials controlling access as granularly as needed or desired. In addition, all requests and key usages are recorded in Vault’s logs or
syslog which can be shipped to a centralized logging solution.
While Vault provides the primitives and tools, we still need to form a process that understands and works with SeatGeek both now and in the future. With encryption and auditing handled, our job is to store and provide access to secrets as well as manage tokens.
NOTE: The following assumes knowledge about specific Vault features, general AWS knowledge, and SeatGeek’s Base AMI.
At SeatGeek (and most other software shops), the two most common types of secrets are the following:
This includes secrets that the same for every machine or application, but differ based on the current environment. They are also commonly or can be used by all machines or applications, which is important to note.
Examples: New Relic, PagerDuty
This includes secrets that differ between applications, where an application is the combination of itself and the environment in which in runs. This also includes secrets that are not common to every application, regardless if one value is always used.
Examples: Braintree Token, Spreedly Key, Sentry DSN
To address these two use cases, we will be using Vault’s generic secret backend.
The reasons for using this backend are simplicity and flexibilty. It allows for arbitrary key-value pairs to be stored, encrypted, and retrieved from Vault without the need or use of third party services.
The generic secret backend allows for key-value pairs to be written under the namespace
secret, and can be associated with various ACL’s. The currently used schema is of the following form:
Here, the top level under the
secret namespace is
ENVIRONMENT, with each
APP getting its own bucket per
ENVIRONMENT in which
KEYs are written. Vault
KEYs can contain a dictionary of key-value pairs themselves, and so the secret
VALUE is written to the key
NOTE: bucket == namespace
The following environments exist:
1 2 3 4
Each app will have a bucket created when it is configured to launch in a given environment. Additionally, for our per environment secrets there is a
common bucket under each
Examples of secrets in the wild:
The basic premise here is a client authenticates and is granted a token. That token, among other things, is associated with a role and corresponding set of authorizations in the form of policies or permissions.
At SeatGeek (for the time being), there are two Vault clients we need to worry about:
These are people who write code at SeatGeek. Developers should be granted enough access to be able to do their jobs while keeping our sensitive information secure and our applications running.
These includes any servers running with SeatGeek infrastructure. Machines should be able to self-authenticate in order to retrieve necessary secrets for provisioning and running applications.
NOTE: This workflow differs for Admins who are granted root tokens, no permission restrictions here.
To provide these levels of access, two different Vault authentication strategies will be used specifically
github authentication and
github authentication strategy was chosen here as we are already using it as a means of authenticating people for internal applications, and so some user grouping has already been done.
app-id authentication strategy is used for roughly the same reasons as the generic secret backend. It is the simplest and most flexible to implement without relying on other systems.
Successful authentication via either of these methods results in a Vault token, which can be used to retrieve secrets.
github authentication includes simply allowing anyone in the SeatGeek Github organization on the
team-developers team to be able to request and retrieve a Vault token. This is done by making a Vault login request with a Github personal access token. While this does not include everyone who writes code, it handles the majority of users for now.
app-id strategy reserved for machine authentication is highly dependent on AWS and our newer infrastructure strategies. When an AWS machine boots up, it can be configured to run with an IAM Role. This role is unique per application per environment, and also includes an id which can be retrieved from an instance’s metadata on the machine itself. Using this information, all SeatGeek IAM roles are whitelisted within Vault against their matching app and associated with a IP Range that corresponds environment’s VPC IP Range. This is our
user-id in Vault terms. Machines can then make a Vault login request with the app they are responsible for running (applied during configuration management) and their IAM Role Instance Profile ID (attachment id). Assuming all pieces line up (IP address, app id, IAM Role Instance Profile ID), a Vault token is granted.
Additionally and only for machine authentication, there is a
ENVIRONMENT-base-ami role that all machines can authenticate as. This allows for all machines on boot to be able to retrieve environment secrets via Vault’s
app-id strategy without knowing which app is to be deployed. This is/would primarily be used to be able to test the Base AMI in isolation in our environments.
In both of these
app-id authentication scenarios, the
user-id is the machine IAM Role Id. However, when applications authenticate, the
user-id’s must be unique, and this allows for us to have two
user-id’s for a give IAM Role along with the appropriate configuration.
In the latest release of Vault, the
app-id strategy has been deprecated in favor a new
app-role strategy. Ultimately we will migrate from
app-role with roughly the same implementation but are currently held back by the version of Vault (0.6.0) and the
vault-ruby (0.6.0) gem we are using.
Vault implements authorization via its own ACL’s or policies. These provide a set of permissions which can be scoped to various operations within Vault, typically indicated by namespaces. In the case of obtaining secrets, that namespace is
secret. Additionally, these ACL’s can be associated with the various authentication strategies. A more generic way to think of it is a client authenticates and is granted a token. That token, among other things, is associated with a role and corresponding set of policies (same as other authentication/authorization strategies).
The current policies are used to control access to Vault secrets:
1 2 3 4
As far as developer authorization, all Github users are granted
testing-read-write, which if not obvious, means that any
secret under the
staging namespace can be read, and free reign with the
production read-only access will be granted on a per application bases to service owners, and be implemented via Github teams.
As far as machine authorization, machines are granted the
ENVIRONMENT-common-read-only. As such, machines can access the
common bucket and their
app bucket within their
ENVIRONMENT, nothing else. Cross
ENVIRONMENT and cross
app secret access is currently disabled and discouraged, although this might be revisited in the future.
Important to note here is the inability for non-Admins to write or update anything in Vault. These permissions are currently restricted to members of the Operations team, but this will surely be revisited in the future.
As of now, Vault tokens last forever once granted. This is a temporary measure that allows for simplicity of use, but additionally tooling will allow for this be changed.
Causes for Concern
- Admins are granted root tokens
- Developer authentication and authorization is reliant on Github
- Machine credentials can be used on other machines within an IP Range
- Assumptions are made around machines running a single application
- Tokens last forever and be reused if retrieved
- Vault is not using TLS
- Metrics are not currently sent anywhere
- No ui solution for managing secrets
- Not possible to easily assume an application’s environment
Currently, Admins are granted root tokens without permission restricitons. The latest version of Vault (0.6.2) has changed the ways in which root tokens are created/used, and as such, these could be substitued for Admin tokens or tokens with equivalent or slightly less permissions granted.
With a centralized login system, developers would be able to authenticate with means other than Github potentially being more flexible and less dependent on a 3rd party. Permission granularity could also be provided on a per user basis allowing for trusted production access (ex: service owner access).
While we are already leveraging AWS for machine authentication, there are improvements in Vault to make this simpler and more secure. This integration would tie us tighter to AWS infrastructure, but it is doubtful we would run servers elsewhere, and if so we have an existing strategy.
These improvements involve allowing machines to one time authenticate with AWS dynamic metadata, addressing the issue of credential (re)use on different machines. Machines can be currently whitelisted by IAM Role or AMI.
We currently have a decent strategy for machine authentication, but our application authentication lacks flexibility. Specifically we assume that a single machine is running a single application and as such has a single IAM Role with the appropriate permissions for that application. This does not work if multiple applications coexist on a single machine, or if an application is broken up into tiers.
A way to combat this is to have application authentication use a different mechanism than machine authentication. This will require a revisit but will most likely leverage Vault’s Cubbyhole to multi-application scoped tokens via one time tokens.
Tokens last forever currently, and should have leases and TTL’s. This would involve additional work to renew token leases as necessary.
TLS is disabled on our Vault cluster as it is addressed only within our internal network. With the requirement of TLS for all HTTP 2.0 connections, this will be revisited in the future and most likely with Vault serving as an internal CA.
Metrics and Monitoring
We are still in the early stages of adoption and use, but Vault has support for shipping application stats via a few means including StatsD.
Either writing or adopting an existing open source solution would be extremely beneficial, as it would remove the burden of managing secrets from the Operations team while also allowing developers more control over how their applications are configured.
Locally Assuming App Roles
There is currently no way to run a command locally using the credentials in staging/production for a given application. Something like a
.env file writer or a foreman-style command runner for our application manifests could go a long way in allowing developers to run services locally while simulating an environment.
Below lists our current Vault configuration, which takes into account the following conditions:
- Vault is running within our internal network and is not publicly accessible.
- Consul is already being used
1 2 3 4 5 6 7 8 9
Vault differentiates itself from other secret management services with its high availabilty option, and we leverage the Consul backend to deliver that. The Consul client is already configured to run on all of our machines (with default port mappings), with our Vault servers being no different. This also means that all data is stored encrypted in Consul, and so the Consul install should also be highly available.
As Vault is run within our internal network (and for other reasons), TLS is disabled. While this is desireable, we need to do additional work to make internal TLS usage a reality. Vault is also running on the standard default port of 8200 and listening on all network interfaces.
If you think these kinds of things are interesting, consider working with us as an Infrastructure Engineer at SeatGeek. Or, if infrastructure isn’t your thing, we have other openings in engineering and beyond!