Code, Design, and Growth at SeatGeek

Jobs at SeatGeek

We are growing fast, and have lots of open positions!

Explore Career Opportunities at SeatGeek

Chasing Signatures: Verifying ChatGPT Requests in Kubernetes Gateway API

At SeatGeek, security and trust are as critical as speed and scale. When integrating with external systems, especially ones as widely used as AI Agents, we need to ensure that every incoming request is authentic and untampered. That means validating cryptographic signatures at high throughput, without sacrificing latency or reliability.

Why This Matters: Security First

When our API receives requests from the ChatGPT Agent, we are not merely managing application traffic; we are creating a secure trust boundary.

Without signature verification, anyone could impersonate the ChatGPT Agent by crafting requests that look correct but are forged. This opens the door to:

  • Spoofing — malicious actors pretending to be ChatGPT Agent.
  • Replay attacks — reusing valid requests to trigger actions again.
  • Tampering — altering request data in transit.

Signature validation ensures that bad actors can’t impersonate trusted agents, while legitimate requests from ChatGPT pass through reliably. This distinction is what lets us protect our systems from abuse without blocking real usage.

ChatGPT uses Ed25519 signatures, which offer a secure and swift way to verify message signatures. HTTP Message Signatures, utilizing Ed25519 cryptography, provide:

  • Proof of origin — the request can only be signed by a holder of the private key.
  • Integrity — any modification breaks the signature.
  • Replay protection — timestamps in signatures let you reject stale requests.

By validating these signatures at the gateway:

  1. You block fake requests before they hit your backend.
  2. You reduce the attack surface by enforcing cryptographic trust.
  3. You shift security left — stopping bad traffic earlier in the stack.

This is a critical security measure, not just a formality. It directly ensures system and data integrity, preventing significant vulnerabilities and potential breaches. Its purpose is to fortify our defenses against threats.

First Stop: Trying at the Edge (Fastly)

Our first instinct was to verify signatures as early as possible — right at the edge with Fastly — so that invalid requests never even reached our infrastructure. Fastly’s VCL provides cryptographic functions for hashing and HMAC, but it currently doesn’t support Ed25519, the signing algorithm used by ChatGPT Agent.

Supporting Ed25519 at the edge would require moving to Compute@Edge with a custom WebAssembly crypto library. While possible, this path would add operational complexity. Fastly’s three-restart limit per request is already partially utilized by existing features, leaving less capacity for new implementations.

Given these constraints, we shifted the verification to Kubernetes Gateway API (Kong), where Ed25519 is already supported through the bundled OpenSSL. This lets us avoid extra moving parts while keeping verification close to the origin. To make those cryptographic calls directly from Lua, we used FFI (Foreign Function Interface), a LuaJIT feature that lets us call C functions like OpenSSL directly from Lua without writing a native module.

Understanding HTTP Message Signatures (RFC 9421)

HTTP Message Signatures (RFC 9421) defines a standard way to sign requests by building a canonical string from specific components, then verifying it with a public key.

The process works by:

  1. Selecting specific HTTP components (e.g., method, path, and certain headers).
  2. Building a canonical string from those components in a precise format.
  3. Signing that string with a private key on the sender side.
  4. Verifying the signature with the corresponding public key on the receiver side.

Example headers from a ChatGPT request:

1
2
3
Signature: 'sig1=:HMum+uuZ2Xy3H/v+W1V5+YFhT9rOEefm5/MJOPaon2Spib8KQRdSwHpz+rS8jj9A4viGptVWLGFUvFGxgdlwDw==:'
Signature-Input: 'sig1=("@authority" "@method" "@path" "signature-agent");created=1754904031;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754907631;nonce="e4DAqAsy0-wYYNXqJhtW4fWySSbigxcU6nuqaBQnzky8vMtci4g6WYHSdnSWMS4oiOCxx5s9wBq5Q9ipbqdJZg";tag="web-bot-auth";alg="ed25519"'
Signature-Agent: '"https://chatgpt.com"'

The corresponding canonical string would be:

1
2
3
4
5
"@authority": 'api.seatgeek.com'
"@method": 'GET'
"@path": '/2/events'
"signature-agent": '"https://chatgpt.com"'
"@signature-params": '("@authority" "@method" "@path" "signature-agent");created=1754904031;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754907631;nonce="e4DAqAsy0-wYYNXqJhtW4fWySSbigxcU6nuqaBQnzky8vMtci4g6WYHSdnSWMS4oiOCxx5s9wBq5Q9ipbqdJZg";tag="web-bot-auth";alg="ed25519"'

For successful verification, the Signature-Input definition requires an exact match in every newline, quote, and component order.

Once the RFC was understood, the next step was to build a Kong plugin that could parse headers, construct this canonical string, and verify the Ed25519 signature.

Implementation Challenges and Debugging Process

During development, subtle but critical issues emerged:

  1. Incorrect quoting of component names — must wrap all names in double quotes.
  2. Path handling errors@path must exclude query parameters.
  3. Double-quoting signature-agent — the header already contains quotes.
  4. Static component assumptions — must parse the order from Signature-Input.

To validate correctness, we used Cloudflare’s web-bot-auth as a reference implementation. By feeding real ChatGPT Agent request data into web-bot-auth and comparing the generated canonical string to the Kong implementation, mismatches could be quickly identified and resolved.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import { verify } from "web-bot-auth";
import { verifierFromJWK } from "web-bot-auth/crypto";

// Public key retrieved from OpenAI's https://chatgpt.com/.well-known/http-message-signatures-directory
const OPEN_AI_KEY = {
    kid: "otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg",
    crv: "Ed25519",
    kty: "OKP",
    x: "7F_3jDlxaquwh291MiACkcS3Opq88NksyHiakzS-Y1g",
    use: "sig",
    nbf: 1735689600,
    exp: 1754656053
};

// Construct a synthetic Request object for the demo
const signedRequest = new Request("https://api.seatgeek.com/2/events?client_id=123&datetime_utc.gt=2025-08-29&geoip=false&lat=56.97205&lon=24.15423&q=Rick%20Feds&range=50mi", {
    method: "GET",
    headers: {
        "Signature": "sig1=:HMum+uuZ2Xy3H/v+W1V5+YFhT9rOEefm5/MJOPaon2Spib8KQRdSwHpz+rS8jj9A4viGptVWLGFUvFGxgdlwDw==:",
        "Signature-Input": 'sig1=("@authority" "@method" "@path" "signature-agent");created=1754904031;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754907631;nonce="e4DAqAsy0-wYYNXqJhtW4fWySSbigxcU6nuqaBQnzky8vMtci4g6WYHSdnSWMS4oiOCxx5s9wBq5Q9ipbqdJZg";tag="web-bot-auth";alg="ed25519"',
        "Signature-Agent": '"https://chatgpt.com"'
    }
});

try {
    await verify(signedRequest, await verifierFromJWK(OPEN_AI_KEY));
    console.log("Signature verification successful");
} catch (err) {
    console.error("Signature verification failed: ", err.message);
}

Once the canonical string matched exactly, the only remaining step was cryptographic verification.

Choosing the Crypto Backend

Three options were considered for Ed25519 verification inside Kong:

Method No FFI Native Libs Works in Kong Safe for Production? Notes
FFI + libsodium Fast, portable, and uses libsodium’s own Ed25519 implementation — but requires shipping extra native libs with Kong.
FFI + OpenSSL (direct to Kong’s bundled OpenSSL) Leverages Kong’s bundled, pinned OpenSSL with guaranteed Ed25519 support — no extra dependencies and consistent behavior in all official builds.
resty.openssl.pkey (system libcrypto) Handles the digest parameter inconsistently for Ed25519, causing unpredictable failures.

Decision: FFI + OpenSSL ensures consistent availability. Benchmarks indicate that Ed25519 can perform approximately 70,000 verifications per second per core. This translates to about 20,000 verifications per second per Kong Gateway instance using LuaJIT FFI.

Code Implementation: Message Construction + Verification

The final implementation has two key steps:

  1. Build the canonical message string from Signature-Input following RFC 9421 rules.
  2. Verify the signature using OpenSSL’s Ed25519 API via FFI.

Rather than embedding full snippets here, we’ve published an open-source version of the Kong plugin: seatgeek/kong-chatgpt-validator.

The repository contains:

The GitHub repository contains the complete implementation and examples for a production-ready plugin. This plugin offers configuration options for key IDs and Public Key, and is suitable for those interested in experimenting with, validating, or adapting this approach for their own Gateway setup.

Production Considerations

Signature verification in production should be considered one component within a comprehensive defense-in-depth strategy, rather than a standalone solution. It’s crucial to maintain your existing security measures, including Web Application Firewall (WAF) rules, rate limits, schema validation, IP/ASN reputation checks, and anomaly detection. Even with valid signatures, a compromised or poorly implemented agent could still issue malicious or fraudulent requests. Therefore, signature verification should enhance, not replace, these established controls.

From a security standpoint, production implementation will adopt a “fail-closed” security posture by rejecting requests with invalid signatures (missing, expired, unverifiable, or mismatched fields like origin, method, or path). Performance is another key consideration – for ensuring predictable latency, we will be caching public keys (JWKs), respecting cache headers, pinning by kid, and warming the cache during deployment/key rotation. Crypto operations can add precious milliseconds in the hot path of a request, and it is best to perform crypto operations only if necessary. Quick checks like enforcing public key/signature lengths (e.g., 32-byte Ed25519 public key, 64-byte signature), validating encoding, and cap header sizes will be able to filter out requests that are obviously spurious. Slice data by kid/partner/origin and configure alerts for spikes or sustained failure rates to identify abuse and integration issues (key rotations, clock skew, header regressions) before outages.

Impact

This work made it possible to reliably identify and verify traffic from the ChatGPT agent without blocking or discarding requests prematurely. In production, this gave our stakeholders the ability to analyze how fans are actually using the agent to interact with our platform, creating new visibility into adoption and engagement patterns. Just as importantly, signature verification did not replace our existing defenses; it complemented them. By keeping our shield stack in place, we ensured that even correctly signed requests could still be stopped if they looked malicious or abusive.

From a performance standpoint, signature verification added about 600 - 900 μs per ChatGPT request, which is roughly 3% of average gateway latency. Regular fan traffic was untouched — verification only runs when a request carries both Signature and Signature-Agent headers. In production, the extra step stayed well below our service-level budgets. This allowed us to strengthen trust in ChatGPT agent traffic without any measurable impact on the fan experience.

Final Thoughts

Validating ChatGPT’s HTTP Message Signatures wasn’t just an exercise in cryptography; it was about reinforcing SeatGeek’s commitment to trust, security, and reliability. AI agents are becoming first-class citizens in our platform, with behaviors, traffic patterns, and security requirements that differ significantly from fans.

To support this securely, we built an implementation of RFC 9421 verification in Gateway API, ensuring:

  • Authenticity — every request claiming to be from ChatGPT truly is.
  • Integrity — data hasn’t been altered in transit.
  • Abuse prevention — spoofed or replayed traffic is blocked before hitting core services.

Because this challenge is not unique to SeatGeek, we’ve open-sourced the core implementation as seatgeek/kong-chatgpt-validator. This allows other teams experimenting with ChatGPT agent to reuse, adapt, and improve the validator rather than reinventing it.

This capability isn’t a one-off patch; it’s part of a long-term security posture to adapt our infrastructure for new ways fans — from traditional browser navigation to AI-driven tools — use SeatGeek, without compromising on performance or user experience.

Resilient Container Builds in CI with Buildkit-Operator

TL;DR

We’ve solved the reliability challenges with our Kubernetes-based container builds by moving to ephemeral, fully-isolated BuildKit instances - and we’ve packaged our solution into an open-source Kubernetes operator called buildkit-operator! In this post, we’ll revisit the pain points from our previous architecture, walk through our new design, highlight the results, and share how you can use it too.

Recap: The Original Problem

In our previous post on building containers with BuildKit in Kubernetes, we shared how moving off EC2 and onto Kubernetes gave us faster builds and a simpler operational model, but also introduced new challenges.

While BuildKit itself worked well enough, our shared, long-lived BuildKit deployments created issues:

  • Unpredictable Failures – Up to 4% of builds failed for reasons outside the engineer’s control: cryptic errors, jobs freezing mid-build, or builds getting killed mid-execution.
  • Resource Contention – Multiple CI jobs could land on the same builder, leading to “noisy neighbor” slowdowns or crashes.
  • Aging Instances – Long-lived builders could get “wedged” into bad states, only fixed by manual restarts.
  • Autoscaling Limits – We struggled to scale builders quickly and accurately, leading to either bottlenecks or wasted resources.
  • Limited Observability – We couldn’t easily identify which CI jobs landed on which builder instance (for debugging).

This led to flaky pipelines, slower deploys, and frustrated engineers who needed frequent support. And the problems only worsened as teams leaned further into automation and AI to iterate faster.

Visualization of the delayed scaling cycle that leads to dogpiling and resource contention
Shared BuildKit deployments created noisy neighbor issues and instability in CI

Ephemeral BuildKit via an Operator

From the start, we knew our ideal future state: each CI job should get its own dedicated BuildKit instance, created on-demand and torn down automatically. No resource sharing, no manual cleanup, no lingering bad state.

Why an Operator?

We considered a few approaches first - better autoscaling, a custom pool manager, even Docker’s Kubernetes driver - but each had significant tradeoffs that wouldn’t work for us:

  • Better Autoscaling: Doesn’t solve the noisy neighbor problem or guarantee isolation.
  • Custom Pool Manager: Managing pools of builders requires extra complexity, and we’d have to build all of that (including the API, lifecycle management, and cleanup) ourselves.
  • Docker Kubernetes Driver: Requires permissions to launch arbitrary pods; no way to guarantee cleanup if the CI job fails or is canceled.

We ultimately landed on building a custom Kubernetes operator because it let us:

  • Avoid Overly Broad Permissions – CI jobs don’t need rights to launch arbitrary pods, just the ability to create Buildkit CRs.
  • Guarantee Cleanup – Using Kubernetes ownerReferences, each BuildKit pod is automatically deleted when the owning CI job terminates for any reason.
  • Enforce Defaults & Policies – The operator can inject the right configuration (including our pre-stop script for graceful shutdowns!) into every instance without CI pipelines having to manage it.

Here’s a simplified version of what a CI job creates:

1
2
3
4
5
6
7
8
9
10
apiVersion: buildkit.seatgeek.io/v1alpha1
kind: Buildkit
metadata:
  name: ci-job-123456-arm64
  namespace: gitlab-ci
  ownerReferences:
    - kind: Pod
      name: runner-job-pod-name
spec:
  template: buildkit-arm64 # The name of the BuildkitTemplate to use

The operator watches for these resources, spins up a pod based on the specified template, and updates the resource’s status.endpoint so the CI job can connect. When the CI job ends, the ephemeral BuildKit pod disappears automatically.

Solving Our Key Problems

  1. Isolation by Design Every job gets its own BuildKit pod (per CPU architecture) - no shared state, no noisy neighbors.

  2. Automatic Teardown ownerReferences ensure BuildKit pods are always cleaned up when jobs finish.

  3. Graceful Shutdowns Our pre-stop hook is automatically injected into the BuildKit container, so builds finish cleanly even during pod termination events.

  4. Security Controls CI jobs can only request a BuildKit instance via CRDs - no blanket permissions to deploy arbitrary, long-lived workloads to the cluster.

  5. Observability The operator can annotate BuildKit pods with pipeline metadata, giving us better traceability for debugging and cost attribution.

Results

Visualization of the delayed scaling cycle that leads to dogpiling and resource contention
Build reliability has been remained consistently high since deploying this in early August!

Since rolling out the operator, we’ve seen significant improvements:

  • Higher Reliability – Over 99.97% of builds complete successfully without unexpected failures.
  • Faster Builds – No contention means more predictable performance - even when accounting for the overhead of creating new pods on-the-fly.
  • Fewer Wasted Resources – BuildKit pods are only running when needed, and are tuned to the resource requirements of the build, thus reducing idle resource costs.
  • Happier Engineers – Fewer support requests and less time spent on flaky builds.

Try It Yourself

We’ve open-sourced our buildkit-operator so other teams can benefit from our approach. You can install it into your own cluster via our Helm chart, define your templates, and start provisioning ephemeral BuildKit instances for your CI workloads in minutes!

Full documentation and examples are available in the GitHub repo.

Final Thoughts

This journey started with the simple desire for reliable container builds in CI - and ended with a lightweight, reusable operator that solves the problem for us and (hopefully) for you too.

If your team is struggling with BuildKit reliability on Kubernetes, give our operator a try. We’d love to hear how it works for you and what ideas you have for making it even better.

Checkpoint: When IAM Breaks Developer Experience

This is part 1 of a three-part series on Checkpoint, our internal IAM automation platform. In this post, we explore the problem we faced with traditional IAM approaches and how we reframed it to build a better developer experience. In part 2, we will dive into the technical details of Checkpoint’s architecture and features. Part 3 will cover the impact it had on our teams and processes.

Access control defines how a company works. We only noticed ours when it stopped working.

At SeatGeek, we quietly hit a wall. Our company had grown past 1,200 employees. We were operating across dozens of AWS accounts. Teams shifted often. Roles were fluid. Ownership was shared. But our approach to Identity and Access Management (IAM) remained stagnant and could no longer keep up with the pace of change.

Engineers were getting blocked daily. New hires spent days waiting for the right permissions. Requesting access was unpredictable. Sometimes it went through IT. Sometimes through Slack. Other times to an endless Jira backlog and unfortunately sometimes not at all. Support teams could not reach the tools they needed. Managers had no visibility into who had access to what. The system was unclear, and it was breaking in subtle but painful ways.

“ClickOps” became the norm. Teams made manual changes in the AWS console. There were no approvals, no expiry dates, and no reliable records. Risk increased. Accountability decreased. And auditing it all was a nightmare.

We realized this was not only a Security issue, it was a broken developer experience that was impacting the entire company.

The Hidden Cost of “Getting Unblocked”

Most IAM systems rely on control and enforcement. Engineers get blocked. They file a ticket. Someone else makes a decision. Eventually, access is granted. Sometimes it is too narrow. Sometimes it is too broad. It depends on who asks, who approves, and how well they understand each other.

When we audited existing AWS access across the company, we found something startling: there were more users with AWS administrator privileges than we ever would have guessed. This was not a one-off. It was the standard.

Not because people were reckless, but because getting access was so hard. People guessed what they needed. They copied roles from teammates. They granted broad access because narrow, precise access took too long to figure out.

Onboarding was one of the clearest signs something was wrong. New engineers sometimes waited a full week to get into the tools they needed. A single missing permission could block progress. There was no visibility into what was missing or how to request updates cleanly.

When access is slow or unpredictable, people work around it. They don’t mean to be unsafe, they’re just trying to get their work done. Users ask for admin access not to gain power, but to stop being stuck.

That’s where the real risk starts. Not with a single permission, but with a culture of “lets make it work”.

The Root of the Problem

We wanted to understand the full scope of the problem so we created a feedback wall and started interviewing folks across the company about their experience getting new access. We were shocked to learn that even folks who had been at the company for 5+ years were still struggling with the process. How could a newcomer expect to get access quickly when even veterans were confused?

Feedback Wall

The responses came quickly:

It took four days in one instance, seven days in another, just to get the correct access.

I had to create a support ticket to reach an AWS account. I was coming from Azure and had no idea where anything was.

There is no clear way to request access. Sometimes it goes through IT, sometimes Slack, sometimes it’s a mystery.

As a manager I can’t even tell what my team members have access to.

I had full production access, I wasn’t even sure why.

The themes were consistent, IAM had finally stopped being invisible and became a real source of friction.

Checkpoint: Designing for Experience, Not Enforcement

What if IAM worked like a developer facing product with a delightful experience? Not something hidden in the background, but something people could see, use, and trust.

We decided to build a system for fixing our biggest problems, introducing Checkpoint!

The wall of sticky notes became our blueprint. We no longer wanted to patch the old model, instead we set out to rebuild it into something that supported engineers rather than slowed them down.

Principles of Checkpoint

These principles combine our design philosophy and must-have requirements. They guided every decision we made.

Principle 1: Make the Secure Path the Easiest Path

Developers gravitate towards the easiest way to get their work done. If the secure path has too much friction, they will find a workaround even if they know it is unsafe. Checkpoint is embedded directly in Slack so it meets people where they already work, with an experience that is fast, self-service, and easy to understand.

Principle 2: Just in Time, Not Just in Case

Most tasks need hours of access, not weeks. Traditional IAM access is permanent and assigned at the team level, creating long-term exposure. In Checkpoint, access is temporary by default, automatically removed when it expires, and easy to renew if necessary.

Principle 3: Self-Service by Default

No one likes waiting for access. Anyone in the company can request permissions without the Platform or Security teams constantly gatekeeping. Low-risk requests are auto-approved, medium-risk requests go to the service owner, and high-risk requests escalate to Security. This keeps the Platform team out of the critical path while still maintaining oversight.

Principle 4: Services are durable, teams are fungible

Permissions are tied to durable microservices instead of short-lived internal teams. We bundle all access for a service once and reuse it for anyone who needs it. This scales across teams and accounts while reducing repetitive policy management.

Principle 5: Break Silos, Respect Boundaries

Owning teams define bundles, lifespan, and approval rules, not the Platform team. Checkpoint is auditable and compliant by design, while still allowing engineers to request scoped, temporary access without escalation. This enables faster cross-team collaboration while respecting service owner boundaries.

Coming Up Next

In Part Two, we’ll explore how Checkpoint works under the hood. From Slack workflows and just-in-time access to permission bundles and incident response, we’ll show how we built a system that is fast, safe, and scalable across every department.

We’ll share what we learned after Checkpoint launched, and how we are iterating to make it better!

Spoiler:

  • Access requests now complete in seconds
  • Persistent admin access dropped to zero
  • Everyone from engineering to finance uses it on a daily basis

But first, we had to reframe the problem. To make it real, we connected Okta for identity, AWS permission sets for per-user access, and simple interfaces like a website and a slackbot that handles approvals where our customers already work. In Part Two, we will show how we bundled permissions around the services themselves, built a risk classification model to automatically approve low risk requests instantly, and routed higher risk requests directly to the service owners who know their products best so that approvals are fast, distributed, and safe.

Open Sourcing Fastly TLS Operator: Automate Fastly TLS Certificate Management

Announcing Fastly TLS Operator

We’re excited to announce our latest open source project: Fastly TLS Operator!

Fastly TLS Operator is a Kubernetes operator that makes it easier to use Custom TLS Certificates within Fastly. It acts as the glue between cert-manager and Fastly, allowing you to define your TLS certificates in Kubernetes and keep them synchronized to Fastly.

To start, the implementation is limited to managing Custom TLS Certificates. We may add additional custom resources and controllers in the future if they align with the project’s overall goal of bridging the gap between Kubernetes and Fastly.

If you are a Fastly customer, you’re able to use Fastly TLS Operator!

Custom TLS Certificates

Unlike Fastly TLS Subscriptions, where you delegate the certificate details and control to Fastly, Custom TLS Certificates let you own the certificate details and push them to Fastly.

The API for Custom TLS Certificates helps you manage:

  • private keys
  • certificate details
  • TLS activations

The challenge is that any automation to leverage Custom TLS Certificates is left as an exercise for the reader.

Our solution takes care of all this for you so that you don’t need to implement any of the fairly complex logic needed to keep multiple certificates synchronized to Fastly.

Architecture

Fastly TLS Operator aims to provide value without reinventing any wheels.

Since cert-manager is already the standard approach for managing certificates on Kubernetes clusters, we choose to build on top of it.

Depicted below is the high level architecture of where our solution sits in the overall stack:

Fastly TLS Operator architecture diagram
Fastly TLS Operator coordinates cert-manager certificates

Custom Resource

The operator defines and reconciles a new custom resource: FastlyCertificateSync. Instances of FastlyCertificateSync point to existing TLS certificates and provide the instructions for how to sync to Fastly.

Here is an example resource:

1
2
3
4
5
6
7
8
9
10
apiVersion: platform.seatgeek.io/v1alpha1
kind: FastlyCertificateSync
metadata:
  name: example-cert-sync
  namespace: default
spec:
  certificateName: example-cert
  tlsConfigurationIds:
  - "your-fastly-tls-config-id-1"  # Replace with your actual TLS configuration ID
  - "your-fastly-tls-config-id-2"  # Optional: activate against multiple TLS configurations

Flow

In the above example, we will look for the Certificate of name: example-cert in the same namespace as the FastlyCertificateSync.

The following actions will be performed:

  • upload private key from the certificate’s secret
  • upload certificate details
  • activate each of the certificate’s domains against the provided TLS configurations
  • clear unused private keys

When certificates renew during their renewal window, we’ll pick up on the updated details and ensure that the latest values are pushed to Fastly. This operation is seamless and involves zero downtime on Fastly’s end.

Status Updates

When reconciling FastlyCertificateSync resources, we’ll reflect state changes via the status field on the resource. The status.conditions will include information reflecting each state transition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
status:
  conditions:
  - lastTransitionTime: "2025-07-29T12:52:26Z"
    message: Private key has been successfully uploaded to Fastly
    reason: PrivateKeyUploaded
    status: "True"
    type: PrivateKeyReady
  - lastTransitionTime: "2025-07-29T12:52:26Z"
    message: Certificate is up-to-date and synced with Fastly
    reason: CertificateSynced
    status: "True"
    type: CertificateReady
  - lastTransitionTime: "2025-07-29T12:52:26Z"
    message: All TLS activations are properly configured
    reason: TLSActivationsSynced
    status: "True"
    type: TLSActivationReady
  - lastTransitionTime: "2025-07-29T12:52:26Z"
    message: No unused private keys found
    reason: NoCleanupNeeded
    status: "False"
    type: CleanupRequired
  - lastTransitionTime: "2025-07-29T12:52:26Z"
    message: FastlyCertificateSync is ready and all components are synchronized
    reason: FastlySyncComplete
    status: "True"
    type: Ready

Why we built Fastly TLS Operator

ACME Contention

When creating certificates, the Certificate Authority (also referred to as Issuer in cert-manager) will need to perform what is called ACME validation.

This is a process that proves to the CA that you own the domain that the certificate is being created for.

Some systems like cert-manager will integrate with your DNS provider and create the record for you. Other systems like Fastly will tell you about the record that needs to be created and expect you to create it yourself. This can be problematic if you are creating certificates for a single domain in multiple systems. For us at SeatGeek, we use cert-manager to create TLS certificates for termination in our API Gateway layer. Additionally, we were asking Fastly to create certificates in their system so that we could terminate TLS at the edge.

When certificates are renewed (typically every 2 months), the ACME challenge process takes place. Each system requires a DNS record at the same location, but with different values. So, we experienced a challenge where we would have to manually update dns records for each verified domain each time either of the two systems needed to solve the ACME challenge. Not only was this not automated, but in the absence of proper observability - it was all to easy to accidentally let a certificate expire.

Below is a rough diagram of cert-manager and Fastly competing for the _acme-challenge.seatgeek.com record:

ACME Challenge contention
ACME challenge contention between cert-manager and Fastly

Consolidating certificate management

The idea here is simple - if we were to consolidate all of our certificate management under one system, we would be able to categorically remove the issue described above.

For us, this meant continuing to use cert-manager to manage all certificates within Kubernetes. We’d then shift from using Fastly’s TLS Subscriptions over to Custom TLS Certificates. We then created automation to take cert-manager certificates and upload them to Fastly.

Initially, this solution was built internally using proprietary closed-source systems. After running without issue for ~6 months, we demo’d this solution to Fastly and mutually agreed that other Fastly customers might be able to take advantage of this automation.

Fastly TLS Operator was then born! We rebuilt our automation from the ground up as an open source project, and then validated the new solution internally. SeatGeek now uses Fastly TLS Operator to manage all of our TLS certificates in Fastly.

Open Source

We are hoping that others are able to benefit from this work!

We welcome suggestions, improvements, and fixes to the solution and are interested to see how this project evolves over time.

Please don’t hesitate to create an issue or pull request on the Fastly TLS Operator repository!

Shielding the Core: Scaling Resilience with a Multi-Layered Approach

At SeatGeek, we are obsessed with delivering a fast, reliable, and scalable ticketing experience. Our platform handles millions of users searching, interacting with listings, and making purchases every day, so it must be resilient, particularly during extreme traffic spikes. This post will cover our resilience strategy, including how we utilize Fastly for CDN caching and shielding, how Kong API Gateway rate limits protect our upstream services, and how we validate this strategy with k6 load tests.

CDN Caching with Fastly: More Than Just the Edge

Caching is one of the most effective ways to improve performance and scalability. By serving responses from the edge instead of going all the way to our infrastructure, we reduce latency, save compute cycles, and increase reliability.

Shielding Explained

Fastly’s caching architecture is hierarchical. When a user makes a request, it hits the closest Point of Presence (POP). If the response is not already cached there, Fastly does not immediately reach upstream; it first checks a shield POP.

Think of the shield POP as a designated regional cache layer between the edge and your origin. We configure a specific POP (e.g., IAD in Ashburn) to act as the shield for all other POPs. Here is how it works:

  1. User Request → Hits local POP (e.g., LHR in London).
  2. Cache Miss → Instead of contacting our backend, LHR POP forwards the request to the shield POP (IAD).
  3. Shield POP Check:
    • If IAD has the response, it sends it back to LHR.
    • If it does not, IAD fetches it from the origin, caches it, and then returns it to LHR.

This response is then cached both at the shield POP (IAD) and the original edge POP (LHR), reducing future latency and origin load.

One key advantage of shielding is that it reduces origin traffic and protects your infrastructure from redundant requests. Even if multiple edge POPs experience simultaneous cache misses, only the shield POP will contact the origin.

Another significant benefit is that traffic between POPs, including between the edge and the shield, is routed over Fastly’s private backbone, rather than the public internet. This backbone is optimized for speed and reliability, offering lower latency and consistent regional performance.

Cache Policy

To maximize efficiency, we define cache policies based on content volatility and sensitivity. Our current strategy includes:

  • Static assets (images, CSS, JS): These utilize a long Time-To-Live (TTL) and do not require revalidation.
Stadium field during a match Trophy displayed on a pedestal
  • Dynamic content with stable responses (e.g., images of individual rows in an arena): These are assigned a short TTL and employ a stale-while-revalidate strategy for optimal balance.
Arena seat configuration map with blue row highlights Arena seat configuration map with red row highlights
  • API responses with cacheable payloads: These are selectively cached using surrogate keys and Fastly’s custom VCL.

Soft purging is also employed to update data, ensuring cache continuity. We leverage Fastly’s capabilities to cache content based on headers, query parameters, and cookies (though the latter is used with caution).

Protecting Upstream Systems with Rate-Limiting

Caching is powerful, but only if your origin services stay healthy. A service that fails under pressure is one of the biggest threats to cache efficiency. In most cases, failed responses (non-2XX) are not cached, which leads to a dangerous feedback loop:

  • More failures → fewer cacheable responses.
  • Fewer cache hits → more origin traffic.
  • More traffic → more failures.

Here is what that feedback loop looks like in practice:

Request volume spike without rate limiting Response breakdown showing high error rate without rate limiting

At SeatGeek, this pattern emerges with services like venue maps. When demand surges, for example during a high-profile onsale, if the venue maps service starts to fail, those failed responses are not cached. As a result, every new user request bypasses the cache and hits the already-overloaded service again. The result is a degraded experience: users cannot view the venue layout to choose their preferred section, increasing frustration and potentially hurting conversion during a critical moment in the purchase journey.

This is where rate limiting becomes essential, and where the API Gateway plays a critical role in the architecture. As the Ingress point for public traffic, the Gateway (in our case, Kong) sits between Fastly and our backend services. It acts as a safeguard, enforcing traffic policies and rate limits to protect sensitive systems.

The request flow diagram below illustrates this layered architecture—from the edge POP to the shield POP, through Kong, and finally to the origin. Each component plays a role in preserving service health and maximizing cache efficiency.

Request flow from edge to origin via shielding and Kong

Kong API Gateway in Action

We use Kong as our API Gateway to handle ingress traffic. Kong allows us to define rate-limiting policies per service or route, protecting sensitive APIs and stabilizing behavior under load.

We typically apply:

  • Token bucket rate limits for general APIs.
  • Per-consumer limits for apps, bots, and partners.
  • Circuit-breaking thresholds for vulnerable services.

Kong acts as a gatekeeper: when requests exceed the configured thresholds, they are throttled (e.g., returning 429 responses). This helps ensure upstream systems stay responsive for legitimate, sustainable traffic.

The result? More successful 2XX responses, which are cacheable, improve the cache hit ratio and reduce load on the origin. The cache naturally warms up as the system remains healthy, and rate limiting prevents overload.

Observing the Effect

Graph 1 illustrates a high initial percentage of 429 errors during a traffic surge, indicating that Kong’s rate-limiting effectively shielded the upstream system from excessive load. The subsequent decline in 429s shows system stabilization as clients adjusted or backed off.

Graph 2 presents an inverse pattern: the percentage of 200 responses begins low but progressively rises. This trend signifies enhanced efficiency due to cache warming, rather than recovery from a failure. As rate limiting manages the load and upstream responses stabilize, more requests are served from the cache. This reduces the need to access the upstream, leading to a greater proportion of successful, fast, and cacheable 200 responses.

Graph 1, 429 responses spike then drop Graph 2, 200 responses rise over time

Graph 3 demonstrates the cache hit rate, a key indicator of this shift. As the cache populates, the hit rate increases, diminishing upstream dependency, enhancing latency, and elevating the success rate seen in Graph 2.

Graph 3, rising cache hit rate over time

Load Testing with k6: Validating Cache Behavior Under Pressure

We use k6 to test our caching and rate-limiting strategies under realistic conditions. While synthetic benchmarks have their place, we prefer tests that replay production-like traffic in staging environments.

Simulating Real Requests

We simulate real production-like traffic patterns using this approach:

  1. Capture a pool of production requests (method, path, headers, etc.).
  2. Sanitize production requests to ensure we are not moving sensitive data between environments.
  3. Add a randomized cacheBuster query param to each request to force an initial miss.
    • e.g., /events/123?cacheBuster=abc123
  4. Replay the traffic using k6 at a controlled RPS (Requests Per Second).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import http from 'k6/http';
import { check } from 'k6';
import { SharedArray } from 'k6/data';
import { parse } from 'https://jslib.k6.io/papaparse/5.1.1/index.js';

// Load and parse a CSV file with request paths (e.g., /events/123)
// SharedArray ensures the data is loaded once during init, not per VU
const requests = new SharedArray("GET URLs", () => {
 const csvData = open('./requests.csv');
 const rawData = parse(csvData, { header: false }).data;
 const baseUrl = 'https://staging.com';

 return rawData.map((row) => {
   const path = row[1];
   // Append a unique query parameter to force a cache miss on the first request
   const cacheBuster = `cacheBuster=${Date.now()}-${Math.floor(Math.random() * 10000)}`;
   const separator = path.includes('?') ? '&' : '?';
   const cacheBusterUrl = `${baseUrl}${path}${separator}${cacheBuster}`;

   return {
     path,
     cacheBusterUrl,
   };
 });
});

export const options = {
 stages: [
   { duration: '30s', target: 500, interactions: 5 }, // Ramp up to 500 VUs
   { duration: '5m', target: 500, interactions: 5 },  // Sustained load
   { duration: '30s', target: 0 },                    // Ramp down
 ],
};

export default function () {
 // Pick a random request from the pool to simulate diverse traffic
 const index = Math.floor(Math.random() * requests.length);
 const request = requests[index];

 const response = http.get(request.cacheBusterUrl);

 // Basic check to verify response status is within the 2XX range
 check(response, {
   'status is 2XX': r => r.status >= 200 && r.status < 300,
 });
}

Each request has a unique cacheBuster query parameter appended to it to force a cache miss on the first run. This simulates a cold cache scenario where the CDN must fetch responses from the origin. As the system returns successful 2XX responses, Fastly’s shield POPs begin caching them, followed by the edge POPs. Over time, this leads to fewer origin requests and a higher cache hit ratio.

By using SharedArray, the CSV is loaded and transformed once during test initialization, ensuring efficient memory usage across virtual users. This setup allows us to simulate realistic traffic patterns and observe the system’s behavior under load:

  • The initial origin load is high (all cache misses).
  • Shield POPs begin caching.
  • Origin traffic decreases over time.
  • The cache hit ratio rises.
  • System stays within limits; no failure cascades.

It also gives us the ability to validate key behaviors:

  • Cache fill timeline and efficiency.
  • Kong’s rate-limiting performance in protecting upstream services.
  • Overall system stability under pressure.

This method has proven particularly effective for short-lived caching of dynamic images, geo-personalized content, and non-volatile API responses.

Final Thoughts

Our caching and rate-limiting strategy is built on a simple principle: successful requests today become fast responses tomorrow. By combining Fastly’s shielding architecture with well-defined cache policies and Kong’s rate-limiting controls, we create a self-reinforcing loop that reduces load, improves reliability, and scales with demand.

One of the key benefits of this approach is that the cache naturally warms up over time. Rate limiting plays a critical role here: by protecting the system from overload, it ensures a steady stream of successful responses that can be cached and distributed across Fastly’s edge and shield POPs. The more the system stays within safe limits, the faster and more cache-efficient it becomes.

This is not just about infrastructure efficiency, it directly impacts the fan experience. During high-traffic moments, rate limiting helps avoid slowdowns, errors, or degraded service. Instead of risking outages or broken flows, we ensure that as many fans as possible receive fast, reliable access. In that sense, it is a strategy designed not just for system health, but to deliver the best possible experience at scale.

By contrast, when a system fails under load, the impact is immediate and compounding: fans experience delays or errors, and the missed opportunity to populate the cache puts further pressure on downstream systems. Resilience is not just about surviving a spike; it is about staying healthy long enough to let the cache take over.

We validate this entire approach through load testing with k6, ensuring we are not just hoping our systems perform under pressure — we are proving it, under production-like conditions.

At SeatGeek, we are redefining live event engagement through innovative technology, personalized services, and a fan-first mindset. From discovery to post-event, we aim to create seamless, memorable, and immersive experiences for every attendee.

The Transactional Outbox Pattern: Transforming Real-Time Data Distribution at SeatGeek

On the Data Platform team at SeatGeek, our goal is to make producing and consuming data products a delightful experience. To achieve this, we transformed part of our data stack to be more real-time, performant, and developer-centric.

This post explores a modern approach to the transactional outbox pattern, highlighting key design decisions, challenges we addressed, and how Apache Kafka®, Postgres, Debezium, and schemas enabled seamless real-time data distribution.

Why real-time data matters to SeatGeek

Leveraging real-time data is crucial in the live events industry. From schedule announcements to last-minute ticket on-sales, things move quickly, and agility is key. Companies that react swiftly can deliver a superior fan experience. Real-time data enables SeatGeek to personalize the fan journey, provide live, actionable insights to stakeholders, and optimize operations by dynamically responding to demand signals.

A new approach to data distribution

Our vision for data distribution is to establish a continuous loop – from production to consumption, enrichment, and feedback to the source. This article focuses on one part of that loop: data production.

data distribution as a continuous loop

Goals

To guide our efforts and stay aligned with the overall vision, we set the following goals.

  1. Guarantee data integrity and consistency.
  2. Achieve scalable, predictable performance.
  3. Encourage domain-driven, reusable data that supports diverse uses.
  4. Foster shared ownership through domain-driven design and data contracts.
  5. Prioritize developer experience.

Existing approaches for producing data

Before introducing a new approach, we evaluated the existing approaches to producing data to ensure we weren’t duplicating work.

1. Polling Consumers

In this approach, consumers periodically check for updates in the data, through an API or direct database access. This pattern is common in ETL (Extract, Transform, Load) workflows. A notable example is Druzhba, our open-source tool for data warehouse ETL.

Key Limitations

  • Introduces inherent latency, especially when updates are frequent.
  • Places unpredictable load on source systems.
  • Direct database access violates service boundaries.
  • Reintegration overhead for each new consumer.

2. Direct Publish

The next strategy involves publishing data directly to consumers. A classic example of this is publishing user activity data, also known as clickstreams, to Kafka. In this case, the publish step does not need to be transactional, however, there are instances where it must be. For example, when an order is created and the inventory needs to be updated, ensuring that both actions occur together is crucial, so you want updates to be all-or-nothing. In such cases, partial failures can lead to inconsistencies between systems, undermining trust in the data.

the dual-write problem

Key Limitations

  • Dual-write problem: Lack of transactional guarantees introduces data inconsistencies.
  • Ensuring consistency across systems is hard, and techniques like distributed transactions/event sourcing add unnecessary complexity.
  • Increased operational complexity to address consistency between systems.

3. Change Data Capture

Lastly, Change Data Capture allows us to subscribe to all database updates and propagate those changes downstream. Because changes are automatically captured at the database level, we avoid inconsistencies that can arise from dual writes.

Key limitations

  • Shifts complexity to consumers.
  • Varying implementations across consumers lead to inconsistencies.
  • Data format changes are not detected, increasing the likelihood of downstream breakages.
  • This is a variation of direct database access and hence violates service boundaries.

In the end, we concluded that none of the existing approaches would help us achieve our goals, so we sought a new solution.


Picking a solution

We began by drafting an RFC (Request for Comments) document to outline the proposed approach and evaluate alternatives – a standard practice at SeatGeek that enables us to gather stakeholder feedback and make informed decisions. For our real-time data distribution needs, we focused on four key areas as shown in the image below.

four key areas

After reviewing the alternatives for each area, we settled on the transactional outbox pattern for its simplicity and effectiveness in addressing the dual-write problem while ensuring data integrity. We opted to have applications drive the event publishing to take advantage of domain events, which are best defined at the source. For relaying messages, we chose transaction log-tailing with Debezium, an established tool that efficiently captures changes from the database. Finally, we selected Kafka as our message broker primarily for its reliability as a log store, which enables us to reuse data effectively. Additionally, since we were already using Kafka in our infrastructure, it made sense to leverage it. We also decided to enforce the usage of schemas to promote shared ownership of the data.

A modern twist on the transactional outbox pattern

Before diving into the twist, let’s briefly recap the traditional transactional outbox pattern with an example. Imagine an application needs to publish an event related to a customer order. This process might involve updates to the sales, inventory, and customers tables. We’ll also assume the use of Postgres, Debezium, and Kafka. The following steps occur, as illustrated in the diagram below.

  1. Construct a domain event: The application creates a domain event that includes relevant information about the order, such as the number of items purchased, and customer details.
  2. Insert the event into the outbox table: The domain event is written to a dedicated “outbox” table, a temporary storage location for events. This step occurs within the same database transaction as the updates to the sales, inventory, and customers tables.
  3. Commit the transaction: Once the transaction is committed successfully, the changes are recorded in the Postgres write-ahead log (WAL).
  4. Relay and publish: Debezium captures all changes to the outbox table, so when it detects a new entry in the outbox table, it relays that event to Kafka for downstream consumers to process.

outbox flow

Challenges with the single outbox table

Early during the RFC process, we identified potential challenges with using a single outbox table, prompting us to explore alternatives to improve performance and scalability.

Performance impact

  • A single outbox table for an entire database can become the bottleneck, especially with high write throughput.
  • Lock contention is also a significant risk, as multiple concurrent writes compete for access to the table.

Complexity with table clean-up

  • Managing the size of the outbox table over time is crucial and requires a separate, external cleanup process.
  • Having an aggressive cleanup process risks lock contention between inserts and deletes.
  • Conservative cleanup could lead to an ever-growing outbox table, which increases the likelihood of performance degradation.

Bypassing the outbox table

Fortunately, we discovered that Postgres has a really neat feature: skip the outbox table and write to the write-ahead-log (WAL) directly! The WAL is central to Postgres’s durability and performance. It logs all changes before they are applied to data files, ensuring that transactions are committed reliably. WAL entries are sequential and optimized for high write throughput.

Writing directly to the WAL is made possible through logical decoding, a mechanism that allows Postgres to stream SQL changes to external consumers, and is also how Postgres uses to replicate changes from the primary to replicas.

Postgres provides a built-in function, pg_logical_emit_message() (documented here), for writing custom data to the WAL as part of a transaction. We will later see how we leverage this functionality to emit domain events, along with metadata, from applications to Kafka.


Implementation: From Skateboard to Car

We adopted an incremental approach, starting with a simple, low-risk solution and gradually building toward a more full-fledged, production-ready system. Following this “Skateboard to Car” philosophy allowed us to experiment and validate our assumptions early on, and move with a higher velocity. This also allowed us to incorporate feedback from users, and minimize risk.

Proof of concept (aka the “Skateboard”)

Before rolling out a full-scale implementation, we wanted to uncover the unknowns and build confidence in our approach. To do this, we:

  • Enabled logical replication on one of our most active databases.
  • Set up Debezium Server.
  • Integrated pg_logical_emit_message() in a high-traffic, critical request path.
  • Simulated an on-sale scenario to load-test this setup at a much higher scale than normal.

Results

The initial results were promising:

  • Excellent performance: Debezium delivered exceptional throughput under heavy load.
  • Minimal database impact: Writing to the WAL directly introduced negligible overhead, even during high write throughput.

Challenges

However, we also uncovered some key challenges:

  • Debezium’s Outbox Event Router

    • Debezium has an Outbox Event Router that routes events from the database to specific Kafka topics.
    • It assumes the existence of an outbox table, which we were bypassing with our direct write-to-WAL approach.
    • It does not integrate with the schema registry, necessitating a custom solution to handle message routing.
  • Debezium Server JSONSchema support

    • Debezium Server did not support JSON format with schemas enabled.
    • In its absence, the default behavior was base64 encoding entire records, which exposed internal structures and added unnecessary complexity for consumers.
    • This limitation made Debezium Server unsuitable for our use case, leading us to use Kafka Connect® instead.

Rolling out a full-scale implementation (aka the “Bicycle”)

The implementation involves 4 core components:

  1. The Kafka Connect cluster(s) for Debezium
  2. Data contracts and schemas
  3. A library for applications to produce data
  4. A custom Single Message Transformation (SMT) to route messages to their appropriate destination

1. Kafka Connect and Debezium

We run Debezium as a source connector on Kafka Connect rather than using Debezium Server, because of the previously noted limitations with schemas. The Kafka Connect clusters run distributed workers on Kubernetes, to ensure high availability and scalability. Additionally, the clusters for transactional outbox are completely isolated from the rest of our Kafka Connect clusters.

2. Data contracts and schemas

Schemas play a foundational role in our data infrastructure, serving as contracts between producers and consumers. They define the structure of the data being exchanged, ensuring data integrity, compatibility, and decoupling.

We use Confluent Schema Registry for managing Kafka schemas across the company. It validates data at the source, ensuring only well-structured data is published. Schema management tools in our CI/CD workflows automate the detection, validation, and migration of schema changes. This eliminates the risk of breaking downstream consumers while maintaining seamless integration across teams.

Schema ownership has cultivated a forward-thinking culture at SeatGeek, where developers consider the evolution of data contracts when designing applications. This shared responsibility enables us to scale our systems confidently while maintaining data quality. While developers maintain responsibility for schema compatibility, the Data Platform team supports this with tools and guidance to simplify the process.

Schemas are a requirement for using the transactional outbox workflow, and are tightly integrated into the library. Messages are validated using the schema registry before writing to the WAL. Downstream consumers use the same schema for deserialization, ensuring consistency throughout the pipeline.

3. Producing data from applications

Our approach here was to create a library that applications could integrate. We started with a library for Python – the most popular language for services at SeatGeek. Since then, we’ve added support for C#, and plan to support Golang as well. The key features of the library are:

  • Schema validation and serialization using the Confluent Schema Registry.
  • Writing directly to the WAL using pg_logical_emit_message().
  • Custom metadata injection such as tracing context and message headers as part of message prefix.

The centralized nature of the library ensures that schema validation and WAL writes are standardized across the organization. Developers cannot bypass the schema registry, ensuring consistency and reliability.

Examples and code snippets

Breakdown of pg_logical_emit_message() components
Breakdown of pg_logical_emit_message() components

Example: Using the outbox library in an application

      
from transactional_outbox import Outbox

outbox = Outbox()

@app.post("/some/api/route")
async def handler(req):
  # -- Handle the request --
  #   - open a database connection
  #   - execute business logic

  # -- Publish data --
  #   - write to the WAL
  await outbox.write_to_outbox(
    session = session,             # the database connection
    kafka_topic = "...",           # The Kafka topic to write to
    kafka_key = "...",             # An optional key for the message
    message: dict = { ... },       # the message to write to the WAL
    kafka_headers: dict = { ... }  # custom Kafka headers
  )

  # -- Flush the data --
  session.commit()

  # -- End --
      
    

Snippet: Write to outbox function

      
async def write_to_outbox(
  self,
  session: Union[Session, AsyncSession],
  message: dict[str, Any],
  kafka_topic: str,
  kafka_key: Optional[str] = None,
  kafka_headers: Optional[dict[str, str]] = None,
  span_context: Optional[dict[str, str]] = None,
  commit: bool = False,
) -> None:
  # Validate that the DB session is active...
  # Serialize the message using the schema registry
  serialized_message: bytes = serialize_msg_schema_registry(
    topic_name=kafka_topic,
    message=message,
    sr_client=self.registry_client,
  )

  # Build the prefix metadata (Kafka topic, key, headers, tracing context, etc.)
  prefix_obj = MessagePrefix(
    kafka_topic=kafka_topic,
    kafka_key=kafka_key,
    kafka_headers=kafka_headers or {},
    span_context={}  # Add tracing or other metadata here if needed
  )
  serialized_prefix: str = prefix_obj.serialize()

  # Use pg_logical_emit_message() to write to the WAL
  statement = select(
    func.pg_logical_emit_message(
      True,
      serialized_prefix,
      serialized_message,
    )
  )
    await self.execute_session(
    session=session,
    prepared_statement=statement,
  )
      
    

4. Routing messages using Single Message Transforms (SMTs)

SMTs are a feature of Kafka Connect that enables users to modify or transform messages as they pass through the pipeline. We built an SMT to replace Debezium’s built-in outbox event router, with support for the schema registry.

How it works

  • The SMT distinguishes between heartbeat records and outbox records based on the Connect schema name (io.debezium.connector.common.Heartbeat and io.debezium.connector.postgresql.MessageValue, respectively).
  • Heartbeat records are passed along with no modification.
  • For outbox records, the prefix field embedded in each message contains metadata like which topic the message should be routed to.
  • The span context and headers from the metadata are moved into the output record’s headers.
  • Additional telemetry data such as end-to-end latency is emitted by the SMT based on source metadata that Debezium includes by default.
  • The content of the outbox record is preserved and emitted as raw bytes (Note: This requires the use of Kafka Connect’s ByteArrayConverter).

flowchart of the single message transform


Building a robust system (aka the “Car”)

The next step in our journey was to make the whole system fault-tolerant and easy to manage.

Adding heartbeat events

As Debezium processes logical decoding messages, it reports the last successfully consumed message back to Postgres, which tracks this in the pg_replication_slots table. WAL segments containing unprocessed messages will be retained on disk, so Debezium must remain active to prevent disk bloat.

Additionally, the number of retained WAL segments depends on overall database activity, not just the volume of outbox messages. If database activity is significantly higher, PostgreSQL retains more WAL segments, causing unnecessary disk usage.

To address this, we set up Debezium to periodically emit heartbeat events by updating a dedicated heartbeat table and consuming the resulting changes. These events are also published to a separate Kafka topic, allowing us to monitor Debezium’s progress and connectivity.

heartbeat events

Heartbeat events serve two purposes:

1. Advancing the replication slot position for WAL cleanup:

  • Heartbeats are periodically committed to the WAL alongside regular database updates.
  • When Debezium processes these heartbeats, it advances its replication slot position to reflect the latest WAL segment it has consumed.
  • This update signals Postgres that all WAL segments before this point are no longer needed, marking it for safe removal from the disk.

2. Monitoring Debezium connectivity:

  • Debezium is configured to periodically execute a query on the source database, ensuring it remains active.
  • Debezium’s health can be monitored by checking the recency of heartbeat messages in the dedicated Kafka heartbeats topic.

Automatic connector restarts

One thing we noticed was that connectors occasionally failed, either due to a network partition or expired database credentials, and had to be restarted.

To reduce the need for human intervention, we wrote a script that uses the Connect API to check the status of connectors on the pod. The script is executed as a livenessProbe on Kubernetes, and anytime the probe fails, Kubernetes restarts the container, which also restarts the connector.

Note: This had to be a livenessProbe and not readinessProbe because the Connect API doesn’t become available until the readinessProbe succeeds.

automatic restarts using Kubernetes probes

Observability

Adding observability through distributed tracing was a key part of empowering users to visualize and inspect the flow of data throughout the system. Each stage emits telemetry data that is all tied together within Datadog, the observability platform we use at SeatGeek.

At the application level, the library adds a new span to the current trace context. This context is injected as metadata into the prefix object of the WAL message. The SMT then relocates that data to the Kafka message headers. Any consumers of the topic will also inherit the trace context. When this is all put together on Datadog, we’re able to visualize the flow of data from the origin (for ex, an HTTP request), all the way down to its final destination (for ex, a Flink job).

distributed tracing


Learnings and Looking Ahead

As we reflect on this journey, here are some key insights we’ve gathered and areas for future improvement.

Key Learnings

  • Development Effort: This new approach requires time and effort to develop, set up, and maintain.
  • Dependence on Observability: We’ve also learned that we heavily rely on observability tools to help us troubleshoot and ensure everything works correctly.
  • Ease of Adoption: This approach is easier to use in new projects. Retrofitting it into existing systems is much more difficult, especially when working with snapshots.
  • Cross-Team Collaboration: Close collaboration between application teams is crucial to success.

Future Areas of Focus

  • Snapshot Handling: Enhancing support for snapshots (or backfills) will be a priority, as they are essential for data recovery and completeness.
  • Schema registry developer experience: We aim to simplify the creation and maintenance of schemas.
  • Consumption Experience: Improving the data consumption side is critical. We’re focusing on stream processing with Flink to unlock advanced use cases and make working with real-time data more seamless.
  • Automation and Tooling: Investing in automation and developer tools to simplify setup and maintenance will help reduce friction and increase adoption.

Optimizing GitLab CI Runners on Kubernetes: Caching, Autoscaling, and Beyond

Introduction: Getting the Most Out of CI

Moving CI runners to Kubernetes was a game changer for our engineering teams. By adopting the GitLab Kubernetes Executor and Buildkit on Kubernetes, we resolved many painful issues we had with our old system like resource waste, state pollution, and job queue bottlenecks. But we weren’t content to stop there. Running on Kubernetes opened the door to deeper optimizations.

In this final installment of our CI series, we explore five critical areas of improvement: caching, autoscaling, bin packing, NVMe disk usage, and capacity reservation. These changes not only enhanced the developer experience but also kept operational costs in check.

Let’s dive into the details!

Caching: Reducing Time to Pipeline Success

Caching (when it works well) is a significant performance and productivity booster. In a Kubernetes setup, where CI jobs run in ephemeral pods, effective caching is essential to avoid repetitive, time-consuming tasks like downloading dependencies or rebuilding assets. This keeps pipelines fast and feedback loops tight.

We use S3 and ECR to provide distributed caching for our runners. We utilize S3 to store artifacts across all jobs with lifecycle policy that enforces a 30 day expiration. We use ECR to store container image build caches with a lifecycle policy that auto-prunes old caches to keep us within the images-per-repo limits.

These are both used by default to significantly reduce job times while maintaining high overall reliability.

Caching
Artifact and Image Caching

Why are Builds so Slow?

One interesting issue we ran into with build caching is that when preforming multi-architecture builds our caches would alternate between architectures. For example:

  • Pipeline 1 = amd64 (cached) arm64 (no cache)
  • Pipeline 2 = amd64 (no cache) arm64 (cached)
  • Pipeline 3 = amd64 (cached) arm64 (no cache)
  • … and so on

Sometimes builds would benefit from local layer caching if they land on the right pod, in which case both architectures would build quickly, making this a tricky problem to track down.

This behavior is likely due to how we build each architecture natively on separate nodes for performance reasons (avoiding emulation entirely). There’s an open issue for buildx that explains how, for multi-platform builds, buildx only uploads the cache for one platform. The --cache-to target isn’t architecture-specific, so each run overwrites the previous architecture’s cache.

Our current workaround is to perform two separate docker buildx build calls, so the cache gets pushed from each, then use docker manifest create && docker manifest push to stitch them together.

With this in place we’re seeing up to 10x faster builds now!

Autoscaling: Working Smarter, Not Harder

One of Kubernetes’ standout features is its ability to scale dynamically. However, getting autoscaling right is a nuanced challenge. Scale too slowly, and jobs queue up. Scale too aggressively, and you burn through resources unnecessarily.

Scaling CI Runners

We used the Kubernetes Horizontal Pod Autoscaler (HPA) to scale our runners based on saturation: the ratio of pending and running jobs to the total number of available job slots. As the saturation ratio changes, we scale the number of runners up or down to meet demand:

Runner Autoscaling
The total capacity autoscales based on the number of concurrent CI jobs

But this wasn’t as simple as turning it on and walking away - we had to fine-tune the scaling behavior to avoid common pitfalls:

  • Scale-Up Latency: If many jobs come in around the same time, it can take a bit for the runners to scale up enough to meet that demand. We’re currently targeting a saturation ratio of 70%. When exceeded, the system is allowed to double its capacity every 30 seconds if needed, with a stabilization window of 15 seconds.
  • Over-Aggressive Scale-Downs: To avoid thrashing from scaling down too much (and/or too fast), we scale down cautiously - removing up to 30% of available slots every 60 seconds, and waiting for a 5-minute stabilization window before taking action.

The result? Our CI runners now scale seamlessly to handle peak workloads while staying cost-efficient during quieter times.

Job queue times
Can you guess when we migrated CI to Kubernetes?

Scaling Buildkit

In our previous post, we shared how we run Buildkit deployments on Kubernetes to build container images. We also leverage an HPA to scale Buildkit deployments to try and match real-time demand.

Unfortunately, Buildkit doesn’t expose any metrics for the number of active builds, and our cluster doesn’t yet support auto-scaling on custom metrics, so we had to get creative. We ended up autoscaling based on CPU usage as a rough proxy for demand. This hasn’t been perfect, and we’ve had to tune the scaling to over-provision more than we’d like to ensure we can handle spikes in demand.

Buildkit Autoscaling
Ideally we’d have a perfect 1:1 ratio of jobs to pods, but scaling based on CPU gets us close enough

We’d eventually like to shift our strategy to use ephemeral Buildkit pods that are created on-demand for each build and then discarded when the build is complete. This would allow us to scale more accurately and avoid over-provisioning, but at the cost of some additional latency. This would also help solve some issues we’ve been having with flaky builds and dropped connections that may be due to resource contention or state pollution.

Bin Packing: Maximizing Node Utilization

Kubernetes’ scheduling capabilities gave us the tools we needed to improve how jobs were placed on nodes, making our cluster more efficient. This is where bin packing came into play.

We defined dedicated node pools for CI workloads with sufficient resources to handle multiple concurrent jobs with ease. With this we gave CI jobs dedicated access to fast hardware with NVMe disks, and opted-out of using spot instances to guarantee high reliability for the pipelines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  apiVersion: karpenter.sh/v1beta1
    kind: NodePool
    metadata:
      name: linux-nvme-amd64-ci
  spec:
    disruption:
      consolidationPolicy: "WhenEmpty"
      consolidateAfter: "10m"
      expireAfter: "Never"
      budgets: { nodes : "1" }
    template:
        metadata: {}
        spec:
          nodeClassRef:
            name: "linux-nvme"
          requirements:
            - key: karpenter.sh/capacity-type
              operator: In
              values:
                - on-demand
            - key: karpenter.k8s.aws/instance-local-nvme
              operator: Gt
              values:
                - "1000"
          taints:
            - key: dedicated
              value: "ci"
              effect: NoSchedule

To ensure that Karpenter wouldn’t disrupt CI jobs before they finished, we configured pod-level disruption controls to ensure the jobs (and their underlying nodes) wouldn’t get rug-pulled.

1
2
3
4
5
spec:
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "true"

We then set informed defaults for resource requests, along with reasonable limits, to efficiently pack CI jobs onto nodes without them becoming “noisy neighbors”. We also allowed developers to set their own elevated requests and limits for resource-intensive jobs, ensuring fast execution. This fine-tuning reduced fragmentation and avoided over-provisioning resources.

The payoff was significant. We saw higher utilization across our CI node pools, using fewer hosts, and without the instability that can come from overloading nodes.

EC2 Hosts
This compares the host count during a typical week. We’re able to run more workloads on Kubernetes with fewer hosts for significant cost savings!

NVMe Disk Usage: Turbocharging I/O

Disk I/O often becomes a bottleneck for CI workloads. Leveraging NVMe storage improved our build times by reducing disk read/write latency.

Unfortunately, Bottlerocket doesn’t support using the local NVMe drive for ephemeral storage out-of-the-box, so we adapted this solution to use a bootstrap container to configure that ephemeral storage on node startup.

Capacity Reservation: Ensuring Spare Nodes for CI Workloads

Autoscaling is powerful, but waiting for nodes to spin up during usage spikes can cause frustrating delays. That’s why we implemented capacity reservation to keep spare nodes ready for CI jobs, even during off-peak hours.

We did this by over-provisioning the cluster with a few idle pods with high resource requests in the CI namespace. If Kubernetes needs to schedule a CI job but lacks available nodes, the higher-priority CI job will cause the lower-priority “idle” pod to immediately get preempted (evicted) to make room for the job, allowing it to start immediately. Kubernetes will then spin up a new node for that idle pod, ensuring the cluster has spare capacity for any additional jobs.

These pods also have init containers that simply pre-pull frequently used container images. This ensures that new nodes can start running CI jobs immediately without waiting for those images to download.

The result? CI jobs start immediately, with no waiting around for new nodes to spin up. Developers are happy, and our cluster stays responsive, even during peak hours.

Conclusion: Fine-Tuning CI for Developer Happiness

By leveraging caching, autoscaling, bin packing, NVMe disks, and capacity reservation, we’ve significantly improved both developer experience and operational efficiency. The outcome of this, along with the overall migration of runners to Kubernetes, can be summarized with the following metrics:

  • Job Queue Time: The average time for a pending job to be picked up by a runner has dropped from 16 seconds to just 2 seconds. Even more impressive, the p98 queue time has gone from over 3 minutes to under 4 seconds. Developers get faster feedback loops so they can focus on getting shit done.

  • Cost Per Job: By optimizing resource utilization and scaling intelligently, we’re now spending 40% less per job compared to our previous setup. That’s a huge win for keeping our CI pipelines cost-effective as we continue to scale.

The journey to perfecting CI is iterative, and every improvement brings us closer to a system that’s faster, more reliable, and more cost-efficient. These improvements showcase how a well-architected and finely tuned CI system can deliver substantial value; not just in raw performance metrics but also in terms of real measurable developer productivity.

Building Containers on Kubernetes with Buildkit

In our last post, we discussed why we decided to migrate our continuous integration (CI) infrastructure to Kubernetes. Now, we’re going to tackle the next major challenge: enabling engineers to build and push Docker container images with a solution that is 100% Kubernetes-native. Our solution needed to work seamlessly on Kubernetes, be reliable, and integrate easily with our existing CI/CD workflows. This shift brought unique challenges, unexpected obstacles, and opportunities to explore new ways to improve the developer experience.

Before diving into the technical details, let’s cover how we used to facilitate Docker builds on the previous infrastructure and why we chose to change our approach.

The Past: Building Images on EC2

Before Kubernetes, each CI job was scheduled onto a dedicated EC2 instance and could fully utilize any/all of that node’s resources for its needs. Besides the usual CPU, memory, and disk resources, these nodes also had a local Docker daemon that could be used for building images with simple docker build or docker buildx commands.

Visualization of the Nomad host running CI jobs alongside an exposed Docker socket
On Nomad, each host exposed its own Docker daemon to CI jobs

While this setup worked fine most of the time, it had some major limitations:

  • State Pollution: One of the biggest issues was that Docker’s settings could be affected by previous builds. For instance, if a build changed certain global settings (perhaps by running docker login commands), it could impact subsequent builds. The absence of true isolation meant that any state left behind by one build could influence the next, which created unpredictability and made troubleshooting difficult.

  • Difficult to Maintain: Every aspect of Docker image building was tightly coupled to CI, Nomad, and these EC2 instances. This made it difficult to maintain, troubleshoot, and modify as our needs evolved. Changes were risky which led to a reluctance to make improvements, and so the system became increasingly fragile over time.

  • Lack of Multi-Architecture Support: As we pivoted production workloads to Kubernetes at spot instances, we had a desire to build multi-arch images for both x86 and ARM CPU architectures. Unfortunately, our old setup only supported this via emulation, which caused build times to explode. (We did have a handful of remote builders to support native builds, but these ended up being even more fragile and difficult to manage).

These challenges led us to ask: how can we modernize our CI/CD pipeline to leverage Kubernetes’ strengths and address these persistent issues? We needed a solution that would provide isolation between builds, multi-architecture support, and reduced operational overhead - all within a more dynamic and scalable environment.

Evaluating Options

Knowing we wanted to leverage Kubernetes, we evaluated several options for building container images in our CI environment. We avoided options like Docker-in-Docker (DinD) and exposing the host’s containerd socket due to concerns around security, reliability, and performance. It was important to have build resources managed by Kubernetes in an isolated fashion. We therefore narrowed our choices down to three tools: Buildkit, Podman, and Kaniko.

All three offered the functionality we needed, but after extensive testing, Buildkit emerged as the clear winner for two key reasons:

  • Performance: Buildkit seemed to be significantly faster. In our benchmarks, it built images approximately two or three times faster. This speed improvement was crucial - time saved during CI builds translates directly into increased developer productivity. The faster builds enabled our developers to receive feedback more quickly, which improved the overall development workflow. Buildkit’s ability to parallelize tasks and effectively use caching made a substantial difference in our CI times.

  • Compatibility: Our CI jobs were already using Docker’s buildx command, and remote Buildkit worked seamlessly as a drop-in replacement. This made the migration easier, as we didn’t need to rewrite CI jobs and build definitions. The familiar interface also reduced the learning curve for our engineers, making the transition smoother.

Buildkit Architecture: Kubernetes Driver vs. Remote Driver

After selecting Buildkit, the next decision was how to run it in Kubernetes. There were two main options - a Kubernetes driver that creates builders on-the-fly, and a remote driver that connects to an already-running Buildkit instance.

We ultimately opted to manage Buildkit Deployments ourselves and connect to them with the remote driver. Here’s why:

  • Direct Control: Using the remote driver allowed us to have more direct control over the Buildkit instances. We could fine-tune resource allocations, manage scaling, and monitor performance more effectively.

  • Security: The Kubernetes driver needs the ability to create and manage arbitrary pods in the cluster, a privilege we wanted to avoid granting to the whole CI system. Using the remote driver avoids this because the Buildkit instances are not managed by the docker CLI.

  • Prior Art: We knew some other organizations were leveraging the remote driver in Kubernetes, which gave us confidence that it was a viable approach. We were able to learn from their experiences and best practices which helped us avoid some pitfalls.

The Pre-Stop Script: Handling Graceful Shutdowns

One unexpected problem we encountered relates to how Buildkit handles shutdowns. By default, Buildkit terminates immediately upon receiving a SIGTERM signal from Kubernetes instead of waiting for ongoing builds to finish. This behavior caused issues when Kubernetes scaled down pods during deployments or when autoscaling - sometimes terminating builds in the middle of execution and leaving them incomplete! This was not acceptable for our CI/CD pipelines, as incomplete builds lead to wasted time and developer frustration.

Visualization of pod termination in Kubernetes without a pre-stop script
Kubernetes sends SIGTERM immediately, causing active builds to get killed

To address this, we implemented this Kubernetes pre-stop hook. The pre-stop script waits for active network connections to drain before allowing Kubernetes to send the SIGTERM signal. This change significantly reduced the number of failed builds caused by premature termination, making our system more reliable.

Visualization of pod termination in Kubernetes
Kubernetes waits until the preStop script completes before sending the SIGTERM

Implementing the pre-stop hook involved some trial and error to determine the appropriate waiting period, and it’s still not perfect, but it ultimately provided a significant boost to build stability. This solution allows Buildkit to complete its work gracefully, ensuring that we maintained the integrity of our build process even during pod terminations.

Reflecting on the New System: Wins, Challenges, and Lessons Learned

Reflecting on our journey, there are several clear wins and other important lessons we learned along the way.

What Went Well

Moving to Buildkit has been a major success in terms of performance! Builds are as fast as ever, and using Kubernetes has allowed us to simplify our infrastructure by eliminating the need for dedicated EC2 hosts. Kubernetes provided us with the scalability we needed, enabling us to add or remove capacity as demand fluctuated. And Buildkit’s support for remote registry-based caching further optimizes our CI build times.

Challenges with Autoscaling

One area where we’re still refining our approach is autoscaling. We’d ideally like to autoscale in real-time based on connection count so that each Buildkit instance is handling exactly one build. Unfortunately the cluster we’re using doesn’t support custom in-cluster metrics just yet, so we’re using CPU usage as a rough proxy. We’re currently erring on the side of having too many (unused) instances to prevent bottlenecks but this is not very cost-efficient. Even if we get autoscaling perfect, there’s still a risk that two builds might schedule to the same Buildkit instance - see the next section for more on this.

Furthermore, we’ve noticed that putting Buildkit behind a ClusterIP Service causes kubeproxy to sometimes prematurely reset TCP connections to pods that are being scaled down - even when the preStop hook hasn’t run yet. We haven’t yet figured out why this happens, but switching to a “headless” Service has allowed us to avoid this problem for now.

What We’d Do Differently

Our decision to run Buildkit as a Kubernetes Deployment + Service has been a mixed bag. Management has been easy, but high reliability has proven elusive. If we could start over, we’d start with a solution that guarantees (by design) that each build gets its own dedicated, ephemeral Buildkit instance that reliably tears down at the end of the build.

The Kubernetes driver for Buildkit partially satisfies this requirement, but it’s not a perfect fit for our needs. We’ll likely need some kind of proxy that intercepts the gRPC connection, spins up an ephemeral Buildkit pod, proxies the request through to that new pod, and then terminates the pod when the build is complete. (There are some other approaches we’ve been considering, but so far this seems like the most promising).

Regardless of how we get there, pivoting to ephemeral instances will finally give us true isolation and even better reliability, which will be a huge win for the engineers who rely on our CI/CD system.

Conclusion

Migrating our Docker image builds from EC2 to Kubernetes has been both challenging and rewarding. We’ve gained speed, flexibility, and a more maintainable CI/CD infrastructure.

Metrics showing image build duration and job counts over time
Builds are faster on Kubernetes even as the number of build jobs increases

However, it has also been a valuable learning experience - autoscaling, graceful shutdowns, and resource management all required more thought and iteration than we initially anticipated. We found that Kubernetes offered new possibilities for optimizing our builds, but these benefits required a deep understanding of both Kubernetes and our workloads.

We hope that by sharing our experience, we can help others who are on a similar path. If you’re considering moving your CI builds to Kubernetes, our advice is to go for it - but be prepared for some unexpected challenges along the way. The benefits are real, but they come with complexities that require careful planning and an ongoing commitment to refinement.

What’s Next?

Stay tuned for the next post in this series, where we’ll explore how we tackled artifact storage and caching in our new Kubernetes-based CI/CD system. We’ll dive into the strategies we used to optimize artifact retrieval and share some insights into how we managed to further improve the efficiency and reliability of our CI/CD workflows.

Introducing Mailroom: An Open-Source Internal Notification Framework

Mailroom logo

We’re excited to introduce our latest open-source project: Mailroom! It’s a flexible and extensible framework designed to simplify the creation, routing, and delivery of developer notifications based on events from platform systems (like Argo CD, GitHub/GitLab, etc.) In this post, we’ll share how Mailroom helps to streamline developer workflows by delivering concise, timely, and actionable notifications.

Crafting Delightful Notifications

On the Developer Experience team we believe strongly in the importance of timely, actionable, and concise notifications. Poorly crafted or overly spammy notifications can easily lead to frustration - nobody wants to have their focus disrupted by useless noise. Instead, we believe that the right notification at the right time can be incredibly powerful and even delightful, but only if done thoughtfully.

Notifications must be immediately useful, providing users with just the right amount of information they need to understand what is happening without overwhelming them. At SeatGeek, we carefully considered how to make notifications effective and meaningful, rather than intrusive or overwhelming.

For example, when we adopted ArgoCD, we could have easily implemented basic notifications via Slack via Argo’s notification system to a channel - perhaps something like this:

Example of a basic ArgoCD Slack notification
What was deployed? Was it successful? There’s not enough context here.

Sure, we could have gotten more clever with the templated JSON to include more information, but the basic template-based approach with a static recipient list would only take us so far. We wanted more control, like the ability to use complex logic for formatting, or sending different notifications to dynamic recipients custom-tailored to their role in the deployment (merger vs committer). This would enable us to provide a better experience like this:

Example of a custom ArgoCD Slack notification
The perfect amount of information! :chefkiss:

This requires creating custom notifications from scratch with code, being deliberate about what information we include, exclude, and how it gets presented. Our goal was to make 99% of notifications immediately useful without requiring further action from the user - if a notification disrupts or confuses rather than informs, it isn’t doing its job.

Why We Built Mailroom

The idea for Mailroom arose from a common pattern we observed. To build these delightful notifications, our Platform teams had a repeated need for specialized Slack bots to handle notifications from different external systems. Each bot basically did the same thing: transforming incoming webhooks into user-targeted notifications, looking up Slack IDs, and sending the messages. But building and maintaining separate bots meant setting up new repositories, implementing telemetry, managing CI/CD pipelines, and more. Creating new bots each time meant repeating a lot of boilerplate work.

Mailroom emerged from the desire to solve this once and for all. Instead of building standalone bots, we created a reusable framework that handles all the internal plumbing - allowing us to focus directly on delivering the value that our users craved.

With Mailroom, creating new notifications is straightforward. Developers simply define a Handler to transform incoming webhooks into Notification objects, and Mailroom takes care of the rest - from dispatching notifications to users based on their preferences, to managing retries, logging, and delivery failures.

How Mailroom’s Architecture Enables Platform Teams

Mailroom provides all the scaffolding and plumbing needed - simply plug in your Handlers and Transports, and you’re ready to go.

Diagram show the architecture of Mailroom
Mailroom’s architecture

Handlers process incoming events, such as “PR created” or “deploy finished,” transforming them into actionable notifications.

Transports send the notifications to users over their preferred channels - whether that’s Slack, email, or carrier pigeon.

The core concepts are simple, yet powerful, enabling flexibility for whatever your notification needs are. Want to send a GitHub PR notification via email and a failed deployment via Slack? Just write your Handler and let Mailroom’s Notifier do the rest!

By making it easy for developers to craft custom notifications, Mailroom helps our Platform team iterate quickly, ensuring that notifications remain targeted, relevant, and useful. By removing the boilerplate work, developers can focus on delivering real value without worrying about the underlying infrastructure.

Open Sourcing Mailroom

Today, we are announcing the availability of Mailroom as an open-source project to help other teams who face similar challenges with internal notifications. Whether you’re looking to build a quick Slack bot or need a scalable notification system across multiple services, Mailroom has you covered.

Mailroom allows Platform teams to focus on what really matters: delivering valuable information to users at the right time and in the right format - without needing to build out the underlying plumbing. We’ve provided some built-in integrations to help you get started faster, including Slack as a transport and both in-memory and PostgreSQL options for user stores. And we’re looking forward to expanding Mailroom’s capabilities with new features like native CloudEvents support in upcoming versions.

Get Started

Getting started with Mailroom is easy! You can find all the information in our GitHub repository. There’s a Getting Started guide that helps you set up a basic project using Mailroom, as well as more in-depth documentation for core concepts and advanced topics.

We welcome contributions from the community! Feel free to open issues, suggest features, or submit pull requests to help make Mailroom even better.

Check out Mailroom today on GitHub and let us know what you think! We can’t wait to see how you use it.

GitLab CI Runners Rearchitecture From Nomad to Kubernetes

Introduction: Chasing Better CI at Scale

Continuous Integration (CI) pipelines are one of the foundations of a good developer experience. At SeatGeek, we’ve leaned heavily on CI to keep developers productive and shipping code safely.

Earlier this year, we made a big push to modernize our stack by moving from using Nomad for orchestration to using Kubernetes to orchestrate all of our workloads across SeatGeek. When we started migrating our workloads to Kubernetes, we saw the perfect opportunity to reimagine how our CI runners worked. This post dives into our journey of modernizing CI at SeatGeek: the problems we faced, the architecture we landed on, and how we navigated the migration for 600+ repositories without slowing down development.

A Bit of History: From Nomad to Kubernetes

For years, we used Nomad to orchestrate runners at SeatGeek. We used fixed scaling to run ~80 hosts on weekdays and ~10 hosts on weekends. Each host would run one job at a time, so hosts were either running a job or waiting on a job.

Nomad architecture

This architecture got the job done but not without its quirks. Over time, we started to really feel some pain points in our CI setup:

  • Wasted Resources (and Money): Idle runners sat around, waiting for work, not adjusting to demand in real-time.
  • Long Queue Times: During peak working hours, developers waited… and sometimes waited a bit more. Productivity suffered.
  • State Pollution: Jobs stepped on each other’s toes, corrupting shared resources, and generally causing instability.

We decided to address these pain points head-on. What we wanted was simple: a CI architecture that was fast, efficient, and resilient. We needed a platform that could keep up with our engineering needs as we scaled.

New Architecture: Kubernetes-Powered Runners

After evaluating a few options, we landed on using the GitLab Kubernetes Executor to dynamically spin up ephemeral pods for each CI job. Here’s how it works at a high level:

GitLab Kubernetes Executor

Each CI job gets its own pod, with containers for:

  • Build: Runs the job script.
  • Helper: Handles git operations, caching, and artifacts.
  • Svc-x: Any service containers defined in the CI config (eg databases, queues, etc).

Using ephemeral pods eliminated resource waste and state pollution issues in one fell swoop. When a job is done, the pod is gone, taking any misconfigurations or leftover junk with it.

To set this up, we leaned on Terraform and the gitlab-runner Helm chart. These tools made it straightforward to configure our runners, permissions, cache buckets, and everything else needed to manage the system.

Breaking Up with the Docker Daemon

Historically, our CI relied heavily on the host’s Docker daemon via docker compose to spin up services for tests. While convenient for users, jobs sometimes poisoned hosts by failing to clean up after themselves, or they modified shared Docker configs in ways that broke subsequent pipelines running on that host.

Another big problem here was wasted resources and a lack of control of those resources. If a docker compose file spun up multiple containers, they would all share the same resource pool (the host).

Kubernetes gave us the perfect opportunity to cut ties with the Docker daemon. Instead, we fully embraced GitLab Services, which allowed us to define service containers in CI jobs, all managed by Kubernetes, with their resources defined according to the jobs individual needs. And another upside was that GitLab Services worked seamlessly across both Nomad and Kubernetes, letting us migrate in parallel to the larger Kubernetes runner migration.

Docker daemon

The Migration Playbook: Moving 600+ Repos Without Chaos

Migrating CI runners is a high-stakes operation. Do it wrong, and you risk breaking pipelines across the company. Here’s the phased approach we took to keep the risks manageable:

  1. Start Small We began with a few repositories owned by our team, tagging pipelines to use the new Kubernetes runners while continuing to operate the old Nomad runners. This let us iron out issues in a controlled environment.

  2. Expand to Platform-Owned Repos Next, we migrated all repos owned by the Platform team. This phase surfaced edge cases and gave us confidence in the runner architecture and performance.

  3. Shared Jobs Migration Then we updated shared jobs (like linting and deployment steps) to use the new runners. This phase alone shifted a significant portion of CI workloads to Kubernetes (saving a ton of money in the process).

  4. Mass Migration with Automation Finally, using multi-gitter, we generated migration MRs to update CI tags across hundreds of repositories. Successful pipelines were merged after a few basic safety checks, while failing pipelines flagged teams for manual intervention.

Was it easy? No. Migrating 600+ repositories across dozens of teams was a bit like rebuilding a plane while it’s in the air. We automated where we could but still had to dig in and fix edge cases manually.

Some of the issues we encountered were:

  • The usual hardcoded items that needed to be updated, things like ingress urls and other CI dependent services
  • We ran into a ton of false positives grokking for docker compose usage in CI, since it was often nested in underlying scripts
  • Images built on the fly using docker compose had to instead be pushed/tagged in a new step, while also needing to be rewritten to be compatible as a CI job entrypoint
    • We also set up shorter lifecycle policies for these CI-specific images to avoid ballooning costs
  • Some pods were OOMing now that we had more granular request/limits, this needed to be tuned for each job’s needs above our default resourcing
  • We also took this as a chance to refactor how we auth to AWS, in order to use more granular permissions, the downside was that we had to manually update each job that relies on IAM roles

What About Building Container Images?

One of the thorniest challenges was handling Docker image builds in Kubernetes. Our old approach relied heavily on the Docker daemon, which obviously doesn’t translate well to a Kubernetes-native model. Solving this was a significant project on its own, so much so that we’re dedicating an entire blog post to it.

Check out our next post, where we dive deep into our builder architecture and the lessons we learned there.

Closing Thoughts: A Better CI for a Better Dev Experience

This has measurably increased velocity while bringing costs down significantly (we’ll talk more about cost optimization in the third post in this series).

Through this we’ve doubled the number of concurrent jobs we’re running

Concurrent Jobs

and reduced the average job queue time from 16 seconds down to 2 seconds and the p98 queue time from over 3 minutes to less than 4 seconds!

Job Queue Time

Modernizing CI runners wasn’t just about cutting costs or improving queue times (though those were nice bonuses). It was about building a system that scaled with our engineering needs, reduced toil, increased velocity, and made CI pipelines something developers could trust and rely on.

If your team is looking to overhaul its CI system, we hope this series of posts can provide ideas and learnings to guide you on your journey. Have fun!