# Introducing Sixpack: A New A/B Testing Framework

Today we’re publicly launching Sixpack, a language-agnostic A/B testing framework with an easy to use API and built-in dashboard.

Sixpack has two main components: the sixpack server, which collects experiment data and makes decisions about which alternatives to show to which users, and sixpack-web, the web-based dashboard. Within sixpack-web you can update experiment descriptions, set variations as winners, archive experiments, and view graphs of an experiment’s success across multiple KPIs.

## Why did we do this?

We try to A/B test as much as possible, and have found that the key to running frequent, high-quality tests is to make it trivial to setup the scaffolding for a new test. After some discussion about how to make the creation of tests as simple as possible, we settled on the idea of porting Andrew Nesbitt’s fantastic Ruby Gem ‘Split’ to PHP, as the templating layer of the SeatGeek application is written in PHP. This worked for a bit, but we soon realized that only being able to start and finish tests in PHP was a big limitation.

SeatGeek is a service-oriented web application with PHP only in the templating/routing layer. We’ve also got a variety of Python and Ruby services, and plenty of complex JavaScript in the browser. In addition have a WordPress blog that doesn’t play nicely with Symfony (our PHP MVC) sessions and cookies. A/B testing across these platforms with our PHP port of Split was a hassle that involved manually passing around user tokens and alternative names.

If, for example, we wanted to figure out which variation of content in a modal window on our blog (implemented in JavaScript) led to the highest rate of clicks on tickets in our app (implemented in PHP), we’d need to create a one-off ajax endpoint to register participation and pass along a user token of some sort into the Symfony world. This kind of complexity was stopping us from running frequent, high-quality tests; they just took to long to set up.

Ideally we wanted to be able to start a test with a single line of JavaScript, and then to finish it with a single line of PHP. Since there was no tool that enabled us to do this, we wrote Sixpack.

## How does is work?

Once you install the service, make a request to participate in an experiment like so:

$curl http://localhost:5000/participate?experiment=bold-header&alternatives=yes&alternatives=no&client_id=867905675c2e8d54b6497ea5635ea94dca9fb415  You’ll get a response like this: { status: "ok", alternative: { name: "no" }, experiment: { version: 0, name: "bold-header" }, client_id: "867905675c2e8d54b6497ea5635ea94dca9fb415" }  The alternative is first chosen by random, but subsequent requests choose the alternative based on the client_id query parameter. The client library is responsible for generating and storing unique client ids. All of the official SeatGeek client libraries use some version of UUID. Client ids can be stored in MySQL, Redis, cookies, sessions or anything you prefer and are unique to each user. Converting a user is just as simple. The request looks like: $ curl http://localhost:5000/convert?experiment=bold-header&client_id=867905675c2e8d54b6497ea5635ea94dca9fb415&kpi=goal-1


You don’t need to pass along the alternative that converted, as this is handled by the sixpack server. The relevant response looks like this:

{
status: "ok",
alternative: {
name: "no"
},
experiment: {
version: 0,
},
conversion: {
value: null
kpi: "goal-1"
},
client_id: "867905675c2e8d54b6497ea5635ea94dca9fb415"
}


As a company we aren’t only interested in absolute conversions; we’re interested in revenue too. Thus, the next Sixpack release will allow you to pass a revenue value with each conversion which sixpack-web will use to determine a revenue-optimized winner of the experiment.

## Clients

We’ve written clients for Sixpack in PHP, Ruby, Python and JavaScript which make it easy to integrate your application with Sixpack. Here’s an example using our Ruby client:

require 'sixpack'
session = Sixpack::Session.new

# Participate in a test (creates the test if necessary)
session.participate("new-test", ["alternative-1", "alternative-2"])

# Convert
session.convert("new-test")


Note that while we must wait for a response from the participate endpoint to get our alternative necessary to render the page, we do not have to wait for the conversion action. By backgrounding the call to convert we can save a blocking web request.

## What did we use to build this thing?

Sixpack is built with Python, Redis, and Lua.

### The core

At the heart of Sixpack is a set of libraries that are shared between sixpack-server and sixpack-web. To keep things fast and efficient, Sixpack uses Redis as its only datastore. Redis’s built-in Lua scripting also gives us the ability to do some pretty cool things. For example, we borrowed ‘monotonic_zadd’ from Crashlytics for generating internal sequential user ids from UUIDs provided from the client libraries.

### The Sixpack Server

We wanted to keep the server as lightweight as possible since making additional web requests for each experiment on each page load could quickly become expensive. We had originally thought to write Sixpack as a pure WSGI application, but decided that the benefits of using Werkzeug outweighed the cost of an additional dependency. In addition, Werkzeug plays very nicely with gunicorn, which we had already planned to use with Sixpack in our production environment.

### Sixpack-web

Sixpack-web is slightly heavier, and uses Flask because of its ease of use and templating. The UI is built with Twitter Bootstrap, and the charts are drawn with d3.js.

## How to get it and contribute

You can check out Sixpack here.

We’ve been using Sixpack internally at SeatGeek for over six months with great success. But Sixpack is young, and as such is still under active development. If you notice any bugs, please open a GitHub issue (http://github.com/seatgeek/sixpack), or fork the repo and make a pull request.

# OpenCV Face Detection for Cropping Faces

Part of what makes SeatGeek amazing are the performer images which help turn it from something I would make as a developer (a database frontend) into a beautiful site. These previously required a lot of man power and a lot of time to collect and resize so we’ve recently created a new process using OpenCV face detection to automatically crop our images.

We use these images in our iPhone App, Explore pages as well as our performer pages throughout the site:

## Old and Busted Way

Collection of images would take up to a month any time we wanted to add a new size. Outsourcers would be commissioned to collect 5000 images of a certain size. A simple step but expensive, inflexible, and time consuming.

## New Hotness OpenCV!

OpenCV does the hard work of finding shapes that resemble a human face and returning the coordinates. It includes a python version but there are other libraries that wrap it up and make it a bit easier to work with.

Since we don’t have to do the hard work of finding faces why don’t we just get really big images and automate the process of making smaller ones!

The haarcascade_frontalface_alt.xml file contains the results of training the classifier, which you can find as part of the OpenCV library or with a quick search online.

Starting with this picture of Eisley:

we can use PIL to draw rectangles around the faces that OpenCV found:

### How to resize once we can find faces

Once we have a face we need to resize the image. In our case the images we collected are landscape format images and we use landscape images for our larger sizes. Staying in the format makes resizing a bit easier, we mostly make thinner images, a higher aspect ratio, so we can just resize to the correct width and crop it into a rectangle with the correct height that we want.

The face_buffer is the amount of space we want to leave above the top-most face after finding the height from the top of the top face to the bottom of the bottom face to make sure we aren’t cropping anyone out of the photo.

Generally we want to include as much of the image as possible without cropping out anyones face so this works reasonably well for images where the final size is a higher aspect ratio then when you started. Now that you have the faces though you can use any sort of cropping that you need.

## Installing into a Virtualenv

If you are installing this on your Ubuntu machine then the system packages should include everything you need. If you use virtualenvs you are going to run in to some issues. With virtualenvs I’ve found the steps for installing Simple CV to be incredibly helpful as a starting point.

## Learnings

This was originally setup for some small images used in a few places around the site and would resize on-the-fly if the CDN didn’t have the image cached. Live resizing works for smaller images reasonably well, sometimes it would just take a couple of seconds to load an image, not ideal, but not horrible. As the images grew in size the face detection and resizing would take up to 20 seconds, safely a deal-breaker.

Resizing the width first and only cropping the height is an easy first step if the final aspect ratio will be greater. That will likely become an issue when other people find out they can get all of the images in whatever size they want. If you have to make images small enough that you can’t fit all of the faces into the image then you will really need to make something more intelligent.

We actually use ImageMagick instead of PIL since the service this is part of was already using it. ImageMagick is rather quirky and can sometimes ignore your commands without any mention of why.

3rd party services exist that can do this for you as well. With a little development work to integrate the service it is still cheaper than hiring someone to resize all of the images and still significantly faster. If you don’t want to pay for external hosting you can easily store them on your servers or S3.

A full example can bee seen as a gist. If you want to use these images and more to code some great interfaces we’re hiring frontend developers and more!

# Introducing the SeatGeek Event Recommendations API

We’ve made it our mission to become America’s gateway to live entertainment. We even put it on our wall, right by the front entrance of our office.

So one area we’ve focused on is live event recommendations. After all, you can’t go see your favorite band if you don’t know it’s in town.

For the past year, we’ve been improving the recommendation service that powers our recommendations calendar on seatgeek.com…

…as well as our concert recommendation app on Spotify:

After much work, we’ve finally advanced it to the point where we’re comfortable integrating it with our public events API and releasing it to the world.

We believe our recommendations are far more advanced than anything you can find on the Web right now, and we’re excited to see developers start to use it.

## How Most Music Recommendation APIs Work

Recommendation engines operate on a pretty simple principle. You take a whole bunch of users and find out what they like. Then you build a whole bunch of correlations of the form, “People who like X like Y.” From that, you assume that X is similar to Y. When your next user comes along who likes X, you say to him or her, “We think you might like Y as well.”

There are quite a few publicly available APIs that support “similar artist” type queries. Last.fm has a good one. It’s a simple model to implement. The results are easy to cache. The problem is you get some pretty mediocre results when you try to do anything interesting.

## A Motivating Example

Let’s say we have a user, Bob. Bob lists his two favorite musicians as Taylor Swift and Kenny Chesney. If you were to hit the SeatGeek API and ask for similar artists, you might get something that looks like this:

Artists Similar to Taylor Swift
1. Carrie Underwood
2. Justin Bieber
3. One Direction
4. Katy Perry
5. Ed Sheeran

Artists Similar to Kenny Chesney
1. Tim McGraw
3. Zac Brown Band
4. Jason Aldean
5. Keith Urban


Taylor Swift is a pop star with country influences and teen appeal. Unsurprisingly, she is most similar (on a 1:1 basis) with Carrie Underwood, another pop star with country influences and teen appeal. But she is also similar to some teen pop sensations (Justin Bieber) and some ordinary pop stars (Katy Perry).

Kenny Chesney, on the other hand, is pretty much just a country singer.

You can probably guess where I’m going with this. If Bob likes Taylor Swift and Kenny Chesney because he’s a country music fan and we start encouraging him to go see One Direction shows, he’s gonna be none too pleased.

And yet, unless you want to go through the trouble of building out your own recommendation system from scratch, that’s about the best you can do in terms of public APIs on the Net.

## How the SeatGeek Recommendation API Works

The proper way to recommend music for Bob is to find other users like Bob and figure out what they like. In other words, to say, “People who like X and Y like __.” If Bob were to give us a third preference, the question becomes, “People who like X, Y and Z like __.” If he gives us a fourth preference, we use that as well.

Because the space of possible combinations grows exponentially, we can’t just compute all of these similarities and cache them. Instead, we use some clever math and compute affinity scores in real time. That allows us to support extremely flexible recommendation queries internally that we can use to build interesting experiences for our users.

Let’s go back to Taylor and Kenny. What happens if we try combining their preferences?

Artists Similar to Taylor Swift + Kenny Chesney (Jointly)
1. Tim McGraw
2. Jason Aldean
3. Carrie Underwood
5. Zac Brown Band

...

16. Katy Perry
23. Justin Bieber
31. One Direction


As you can see, the country music rises to the surface, and the teen-pop sensations fall out of the way.

Now let’s see what happens if we find a second user, Alice, who identifies her favorite bands as Taylor Swift and Katy Perry. Well, we might suspect she’s a fan of female pop stars, and our recommendations bear that out:

Artists Similar to Taylor Swift + Katy Perry
1. Ke$ha 2. Justin Bieber 3. Pink 4. Carrie Underwood 5. Kelly Clarkson ... 44. Zac Brown Band 49. Kenny Chesney  As we go deeper into the rabbit hole with more preferences, the recommendations become more and more advanced. ## Pictures! What follows is an example, simplified preference space. Green bands are ‘similar’ to X. Red bands are ‘similar’ to Y. Blue bands are ‘similar’ to Z. A user likes X and Z. What should we recommend? Most recommenders combine preference through what is essentially a union operation. If a user likes X and Z, he will be shown events which are similar to X and events which are similar to Z. SeatGeek’s recommendation engine (code-named Santamaria) computes the joint recommendation set of X and Z. In effect, it extracts the similar characteristics of X and Z and recommends other performers that share those specific traits. This leads to a much more accurate set of recommendations for the user. As the number of seeds grows, the composition of preferences becomes more and more specific, and we can accurately recommend shows to people with fairly idiosyncratic tastes. ## Using Our API We’re very excited to be finally opening up our recommendations API to the public. The full documentation for our API can be found here: http://platform.seatgeek.com/ You’ll need a SeatGeek account and an API key to get started: Step 1: Request an API key here: http://seatgeek.com/account/develop Step 2: Find some SeatGeek performers http://api.seatgeek.com/2/performers?q=taylor%20swift http://api.seatgeek.com/2/performers?q=kenny%20chesney  Step 3: Make a request using SG performer IDs http://api.seatgeek.com/2/recommendations?performers.id=35&performers.id=87&postal_code=10014&per_page=10&client_id=API_KEY  The API takes a geolocation parameter, an arbitrary list of performers, and a wide array of filtering parameters. Check it out and let us know what you think. You can email us at hi@seatgeek.com or post a message in our support forum. # Yak Shaving: Adding OAuth Support to Nginx via Lua **TL;DR:** We built OAuth2 authentication and authorization layer via nginx middleware using lua. If you intend on performing this, read the docs, automate what you can, and carry rations. As SeatGeek has grown over the years, we’ve amassed quite a few different administrative interfaces for various tasks. We regularly build new modules to export data for news outlets, our own blog posts, infographics, etc. We also regularly build internal dev tools to handle things such as deployment, operations visualization, event curation etc. In the course of doing that, we’ve also used and created a few different interfaces for authentication: • Github/Google Oauth • Our internal SeatGeek User System • Basic Auth • Hardcoded logins Obviously, this is subpar. The myriad of authentication systems makes it difficult to abstract features such as access levels and general permissioning for various datastores. ## One System to auth them all We did a bit of research about what sort of setup would solve our problems. This turned up Odin, which works well for authenticating users against Google Apps. Unfortunately, it would require us to use Apache, and we are pretty married to Nginx as a frontend for our backend applications. As luck would have it, I came across a post by mixlr referencing their usage of Lua at the Nginx level for: • Modifying response headers • Rewriting requests internally • Selectively denying access to hosts based on IP The last one in that set seemed interesting. Thus began the journey in package management hell. ## Building Nginx with Lua Support Lua support for Nginx is not distributed with the core Nginx source, and as such any testing would require us to build pacakges for both OS X–for testing purposes–and Linux - for deployment. ### Custom Nginx for OS X For Mac OS X, we promote the usage of the Homebrew for package management. Nginx does not come with many modules enabled in the base formula for one very good reason: The problem is that NGINX has so many options that adding them all to a formula would be batshit insane and adding some of them to a formula opens the door to adding all of them and associated insanity. - Charlie Sharpsteen, @sharpie So we needed to build our own. Preferably in a manner that would allow further customization in case we need more features in the future. Fortunately, modifying homebrew packages is quite straightforward. We want to have a workspace for working on the recipe: cd ~ mkdir -p src cd src  Next we need the formula itself. You can do one of the following to retrieve it: • Go spelunking in your HOMEBREW_PREFIX directory - usually /usr/local - for the nginx.rb • Have the github url memorized as an api and wget https://raw.github.com/mxcl/homebrew/master/Library/Formula/nginx.rb • Simply output your formula using brew cat nginx > nginx.rb If we brew install ./nginx.rb, that will install the recipe contained within that file. Since this is a completely custom nginx installation, we’ll want to rename the formula so that future brew upgrade calls do not nix our customizations. mv nginx.rb nginx-custom.rb cat nginx-custom.rb | sed 's/class Nginx/class NginxCustom/' >> tmp rm nginx-custom.rb mv tmp nginx-custom.rb  We’re now ready to add new modules to our compilation step. Thankfully this is easy, we just need to collect all the custom modules from passed arguments to the brew install command. The following bit of ruby takes care of this: # Collects arguments from ARGV def collect_modules regex=nil ARGV.select { |arg| arg.match(regex) != nil }.collect { |arg| arg.gsub(regex, '') } end # Get nginx modules that are not compiled in by default specified in ARGV def nginx_modules; collect_modules(/^--include-module-/); end # Get nginx modules that are available on github specified in ARGV def add_from_github; collect_modules(/^--add-github-module=/); end # Get nginx modules from mdounin's hg repository specified in ARGV def add_from_mdounin; collect_modules(/^--add-mdounin-module=/); end # Retrieve a repository from github def fetch_from_github name name, repository = name.split('/') raise "You must specify a repository name for github modules" if repository.nil? puts "- adding #{repository} from github..." git clone -q git://github.com/#{name}/#{repository} modules/#{name}/#{repository} path = Dir.pwd + '/modules/' + name + '/' + repository end # Retrieve a tar of a package from mdounin def fetch_from_mdounin name name, hash = name.split('#') raise "You must specify a commit sha for mdounin modules" if hash.nil? puts "- adding #{name} from mdounin..." mkdir -p modules/mdounin && cd$_ ; curl -s -O http://mdounin.ru/hg/#{name}/archive/#{hash}.tar.gz; tar -zxf #{hash}.tar.gz
path = Dir.pwd + '/modules/mdounin/' + name + '-' + hash
end


The above helper methods allow us to specify new modules to include on the command line and retrieve the modules from their respective locations. At this point, we’ll need to modify the nginx-custom.rb recipe to include the flags and retrieve the packages, around line 58:

nginx_modules.each { |name| args << "--with-#{name}"; puts "- adding #{name} module" }


At this point, we can compile a custom version of nginx with our own modules.

brew install ./nginx-custom.rb \
--include-module-http_gzip_static_module \


We’ve provided this formula as a tap for you convenience at seatgeek/homebrew-formulae.

### Custom Nginx for Debian

We typically deploy to some flavor of Debian–usually Ubuntu–for our production servers. As such, it would be nice to simply run dpkg -i nginx-custom to have our customized package installed. The steps to doing so are relatively simple once you’ve gone through them.

Some notes for those researching custom debian/ubuntu packaging:

• It is possible to get the debian package source using apt-get source PACKAGE_NAME
• Debian package building is generally governed by a rules file, which you’ll need some sed-fu to manipulate
• You can update deb dependencies by modifying the control file. Note that there are some meta-dependencies specified herein that you’ll not want to remove, but these are easy to identify.
• New releases must always have a section in the changelog, otherwise the package may not be upgraded to because it may have already been installed. You should use tags in the form +tag_name to idenfity changes from the baseline package with your own additions. I also personally append a number - starting from 0 - signifying the release number of the package.
• Most of these changes can be automated in some fashion, but it appears as though there are no simple command line tools for creating custom releases of packages. That’s definitely something we’re interested in, so feel free to link to tooling to do so if you know of anything.

While running this process is great, I have built a small bash script that should automate the majority of the process. It is available as a gist on github.

It only took 90 nginx package builds before I realized the process was scriptable.

## OAuth ALL the things

Now that it is possible to test and deploy a Lua script embedded within Nginx, we can move on to actually writing some Lua.

The nginx-lua module provides quite a few helper functions and variables for accessing most of Nginx’s abilities, so it is quite possible to force OAuth authentication via the access_by_lua directive provided by the module.

When using the *_by_lua_file directives, nginx must be reloaded for code changes to take effect.

I built a simple OAuth2 provider for SeatGeek in NodeJS. This part is simple, and you can likely find something off the box in your language of choice.

Next, our OAuth API uses JSON for handling token, access level, and re-authentication responses, so we needed to install the lua-cjson module.

# install lua-cjson
if [ ! -d lua-cjson-2.1.0 ]; then
tar zxf lua-cjson-2.1.0.tar.gz
fi
cd lua-cjson-2.1.0
sed 's/i686/x86_64/' /usr/share/lua/5.1/luarocks/config.lua > /usr/share/lua/5.1/luarocks/config.lua-tmp
rm /usr/share/lua/5.1/luarocks/config.lua
mv /usr/share/lua/5.1/luarocks/config.lua-tmp /usr/share/lua/5.1/luarocks/config.lua
luarocks make


My OAuth provider uses the query-string for sending error messages on authentication, so I needed to support that in my Lua script:

local args = ngx.req.get_uri_args()
if args.error and args.error == "access_denied" then
ngx.status = ngx.HTTP_UNAUTHORIZED
ngx.say("{\"status\": 401, \"message\": \""..args.error_description.."\"}")
return ngx.exit(ngx.HTTP_OK)
end


local access_token = ngx.var.cookie_SGAccessToken
if access_token then
end


At this point, we’ve handled error responses from the api, and stored the access_token away for later retrieval. We now need to ensure the oauth process actually kicks off. In this block, we’ll want to:

• Start the oauth process if there is no access_token stored and we are not in the middle of it
• Retrieve the user access_token from the oauth api if the oauth access code is present in the query string arguments
• Deny users with invalid access codes

Reading the docs on available nginx-lua functions and variables can clear up some issues, and perhaps show you various ways in which you can access certain request/response information

At this point we need to retrieve data from our api to retrieve an access token. Nginx-lua provides the ngx.location.capture method, which can be used to retrieve the response from any internal endpoint within redis. This means we cannot call something like http://seatgeek.com/ncaa-football-tickets directly, but would need to use proxy_pass in order to wrap the external url in an internal endpoint.

My convention for these endpoints is to prefix them with an _ (underscore), and normally blocked against direct access.

-- first lets check for a code where we retrieve
-- credentials from the api
if not access_token or args.code then
if args.code then
-- internal-oauth:1337/access_token
local res = ngx.location.capture("/_access_token?client_id="..app_id.."&client_secret="..app_secret.."&code="..args.code)

-- kill all invalid responses immediately
if res.status ~= 200 then
ngx.status = res.status
ngx.say(res.body)
ngx.exit(ngx.HTTP_OK)
end

-- decode the token
local text = res.body
local json = cjson.decode(text)
access_token = json.access_token
end

-- both the cookie and proxy_pass token retrieval failed
if not access_token then
-- Track the endpoint they wanted access to so we can transparently redirect them back

return ngx.redirect("internal-oauth:1337/oauth?client_id="..app_id.."&scope=all")
end
end


At this point in the Lua script, you should have a - hopefully! - valid access_token. We can use this against your whatever endpoint you have setup to provide user information. In my endpoint, I respond with a 401 status code if the user has zero access, 403 if their token is expired, and access_level information via a simple integer in the json response.

-- ensure we have a user with the proper access app-level
-- internal-oauth:1337/accessible
local res = ngx.location.capture("/_user", {args = { access_token = access_token } } )
if res.status ~= 200 then

-- Redirect 403 forbidden back to the oauth endpoint, as their stored token was somehow bad
if res.status == 403 then
return ngx.redirect("https://seatgeek.com/oauth?client_id="..app_id.."&scope=all")
end

-- Disallow access
ngx.status = res.status
ngx.say("{"status": 503, "message": "Error accessing api/me for credentials"}")
return ngx.exit(ngx.HTTP_OK)
end


Now that we’ve verified that the user is indeed authenticated and has some level of access, we can check their access level against whatever we define is the access level for the current endpoint. I personally delete the SGAccessToken at this step so that the user has the ability to log into a different user, but that is up to you.

local json = cjson.decode(res.body)
-- Ensure we have the minimum for access_level to this resource
if json.access_level < 255 then
-- Expire their stored token

-- Disallow access
ngx.status = ngx.HTTP_UNAUTHORIZED
return ngx.exit(ngx.HTTP_OK)
end

-- Store the access_token within a cookie

-- Support redirection back to your request if necessary
if redirect_back then
return ngx.redirect(redirect_back)
end


Now we just need to tell our current app who is logged in via some headers. You can reuse REMOTE_USER if you have some requirement that this replace basic auth, but otherwise anything is fair game.

-- Set some headers for use within the protected endpoint


I can now access these http headers like any others within my applications, replacing hundreds of lines of code and hours of work reimplementing authentication yet again.

## Nginx and Lua, sitting in a tree

At this point, we should have a working lua script that we can use to block/deny access. We can place this into a file on disk and then use access_by_lua_file to use it within our nginx site. At SeatGeek, we use Chef to template out config files, though you can use Puppet, Fabric, or whatever else you’d like to do so.

Below is the simplest nginx site you can use to get this entire thing running. You’ll also want to check out the access.lua - available here - which is the compilation of the above lua script.

# The app we are proxying to
upstream production-app {
server localhost:8080;
}

# The internal oauth provider
upstream internal-oauth {
server localhost:1337;
}

server {
listen       80;
server_name  private.example.com;
root         /apps;
charset      utf-8;

# This will run for everything but subrequests
access_by_lua_file "/etc/nginx/access.lua";

# Used in a subrequest
location /_access_token { proxy_pass http://internal-oauth/oauth/access_token; }
location /_user { proxy_pass http://internal-oauth/user; }

location / {
proxy_set_header  X-Real-IP  $remote_addr; proxy_set_header X-Forwarded-For$proxy_add_x_forwarded_for;
proxy_set_header  Host $http_host; proxy_redirect off; proxy_max_temp_file_size 0; if (!-f$request_filename) {
proxy_pass http://production-app;
break;
}
}

}


## Further Considerations

While this setup has worked really well for us, I’d like to point out some shortcomings:

• The above code is a simplification of our access_by_lua script. We also handle POST request saving, inject JS into pages to renew the session automatically, handle token renewal etc. You may not need these features, and in fact, I didn’t think I’d need them until we started testing this system on our internal systems.
• We had some endpoints which were available via basic auth for certain background tasks. These had to be reworked so that the data was retrieved from an external store, such as S3. Be aware that this may not always be possible, so oauth may not be the answer in your case.
• Oauth2 was simply the standard I chose. In theory, you could use Facebook Auth to achieve similar results. You may also combine this approach with rate-limiting, or storing various access levels in a datastore such as redis for easy manipulation and retrieval within your Lua script. If you were really bored, you could reimplement Basic Auth within Lua, it’s just up to you.
• There are no test harnesses for systems such as these. Test-junkies will cringe when they realize it’s going to be integration testing for a while. You can likely rerun the above by injecting variable mocks into the global scope and then executing scripts, but it’s not the ideal setup.
• You still need to modify apps to recognize your new access headers. Internal tools will be easiest, but you may need to make certain concessions for vendor software.

The above blog post combined nginx-lua and our internal oauth provider to enable using OAuth for access control to our infrastructure.

## Also

SeatGeek is hiring UI Developers and Web Engineers. Your first tasks will be to make my OAuth application pretty and write some tests for a bit of Lua code… (kidding)

If you have any questions about the project just let us know in the comments!

# Putting Venue Maps in a Terminal: Introducing SGCLI

At SeatGeek we have regularly scheduled Hackathons–opportunities for all of us to drop what we’re doing for two days and work on whatever interesting, creative, or experimental projects we dream up. We had a Hackathon last week, and I decided to write a command-line client for SeatGeek, which I called SGCLI.

Back in 2005 I spent a couple of months using a box running FreeBSD without X as my main computer. I’m not sure exactly what the thought process was that led to that setup, but using it taught me that in some instances command-line applications can be much faster to work with than their graphical equivalents. With that in mind, I set out to try and replicate (and if possible, improve) the experience of searching for and buying tickets on SeatGeek.

The first step to building the project was to learn some curses. Luckily, there’s a great tutorial on curses programming with Python on python.org. It didn’t take long to have a basic application running (complete with an ASCII-art version of the SeatGeek logo):

The rest of the first day was spent building out the main parts of the SeatGeek experience: the search page, selecting an event, browsing ticket listings and viewing an individual listing. There were some intense moments, but by the time I quit for the night (probably around 11:30 or so - and I was early!) it seemed like at least the basics would be working in time for Friday’s demo.

On Friday I focused mostly on the features that I thought would give some added spark to the demo, namely rendering SeatGeek’s beautiful maps into slightly-less-beautiful ASCII art, and meme integration. Rendering the maps proved to be a bit tricky, but in the end it worked out pretty well:

Map rendering works using the map infrastructure we built for the SeatGeek mobile website. That allows SGCLI to get a single .png for an event that represents the map. Then, SGCLI uses PIL to scale the map down such that the size in pixels is the same as the size in characters of the final map we want to render (that final size depends on the size of the terminal window). An important note here is that the aspect ratio has to change during this operation: almost all fonts are much taller than they are wide and we need to correct for that. So, if we decide that our final map should be 20 characters high and 36 characters wide, we use PIL to scale the 320x320 .png down to 36x20.

After scaling down we use PIL to create two copies of the same image. One of the copies gets converted to grayscale, and the other copy gets “posterized” down to a single bit per color channel (that matches the color options I have available in my terminal). The final step is to actually output the ASCII art. For each pixel we use the grayscale image to select a character to display based on luminosity (darker pixels get characters like #, while brighter pixels become characters like . or -). We use the posterized image to figure out what color to draw the character in (out of the 7 options available to us). After rendering the map using curses, we draw a marker over the top to indicate where the selected tickets are located.

This method certainly could be smarter (for example, it doesn’t take into account the ‘shape’ of a character when deciding what character to use), but it worked pretty well and was easy to implement. Check it out in the source code for SGCLI to see it in all of its glory.

Check out the final project–it’s open source and easy to install and run. Just do:

pip install sgcli # or easy_install sgcli
sgcli


There are plenty of neat things that I didn’t mention here. Be sure to check out the support for autocompletion while searching, and make sure to hit “b” on a ticket listings page (even if you don’t follow through and buy the tickets) - you’ll get a nice surprise when you return to the app. There are some notes on keyboard shortcuts in the README.

If you have any questions about the project just let us know in the comments!

# Introducing Absolute Deal Score

Among all of SeatGeek’s features, we have always been proudest of Deal Score, our metric that enables users to seamlessly pick out the best deal for an event from within thousands of ticket listings.  But today that feature is getting a whole lot cooler. We’re launching Absolute Deal Score, an upgrade that has been under development for many months here at SeatGeek.

For the uninitiated, the premise behind Deal Score is simple: Deal Score is a rating of whether a ticket is a bargain or a rip-off, which facilitates apples-to-apples comparisons among ticket listings.  A metric like Deal Score is particularly useful for live event tickets because every seat in a venue is different.  If you’re shopping online for batteries, you can sort your options by price and expect the cheapest options will include the good bargains.  But if you’re shopping for Yankees tickets and you sort by price, the cheapest options tend to be a bunch of nosebleed seats.  Deal Score offers a better way to identify good buys.

As initially designed, Deal Score compared the relative value of a ticket listing to that of all others listed in the venue for that one given event. The worst deal for every event was given a Deal Score of 0, the best deal was given a Score of 100, and everything else was filled in between those two numbers.  But as we thought about how best to surface values in the ticket marketplace, we came to the realization that anchoring value against just a single game wasn’t going far enough.

Today, we’re excited to announce a major overhaul of our Deal Score algorithm–one that not only identifies the best ticket deals within a given event, but also how those individual deals comparatively stack up against all tickets for similar events (for example, how a listing for a single Yankees ticket stacks up against those for all other Yankees games this season).   Listings are no longer anchored at 0 and 100 for every event but, rather, 0 is anchored to the absolute worst deal for all events on SeatGeek, and 100 is anchored to the best deal among all events.

SeatGeek’s head of R&D, Steve Ritter, spoke in great detail about the math and approach behind Deal Score in a two-part blogpost series back in May but, in brief, the algorithm assesses the listed price of a ticket against our estimated market value of that ticket (based on historical prices, row/section position and other factors).  As we thought about ways to build on the solid foundation of Deal Score, we realized that our Deal Score methodology could be applied across a broad series of similar events, such as a full season of NBA games or a multi-month run of a Broadway show. We’ve been testing this live on SeatGeek for the past few weeks, so you may have already noticed some changes.  As a user, this update has a some meaningful benefits:

• Deal Scores are now comparable across all events, not just against ticket listings within a single game or concert, as was the case previously. For example, if you compare a 93 deal score for a November 2012 Knicks game against a 85 deal score for a different game at MSG 3 months later, you’ll know that the November ticket is without question a better value–a distinction that couldn’t previously be made.
• Absolute Deal Score still surfaces the best ticket deals within a single event, just as the previous iteration of Deal Score. The only difference is that scores aren’t anchored to a relative distribution as before. The top 10 deals you see on any event page are still the best ticket deals available that evening, just as was the case previously.
• You may see fewer “100” deal scores on event pages, but the highest Deal Scores now truly represent exceptional values for that event type. This doesn’t mean that ticket labeled with even a 50 or 60 Deal Score is a poor value–indeed, any ticket with a DS above 50 ranks in the upper 20% of all deals on site. But we felt it important to define far better gradations between above average deals and purely outstanding deals, and to do so across the widest body of comparable ticket listings.

We’re excited about this update and how it will change the experience of shopping on SeatGeek. We’d love to hear what you think! As always, drop us a line at hi@seatgeek.com.

# Using a Kalman Filter to Predict Ticket Prices

Welcome back! In case you missed part one of this series, we’re opening up the hood on Deal Score, one of SeatGeek’s most popular features.

In part one we gave a brief overview of why we sort ticket listings by Deal Score rather than by price. We gave you our two main assumptions:

• Seat quality, within a given venue, has a consistent ordering.
• The relationship of seat price to seat quality follows a similar pattern across all events at a given venue.

In the last post we discussed how we take advantage of the first assumption. Today, we’ll explain the second assumption and how it leads to the accurate valuations of thousands of tickets on a daily basis.

This second assumption means that all we need to do in order to predict prices is find a function, unique to each event, that maps our seat score vector to a vector of expected prices. Since we’re working with limited data here and surely do not have enough data to induce the structure of this curve for each event, we make a further simplifying assumption: for each venue, the curve will look similar for every event at that venue, whether that curve is a straight line, polynomial, or otherwise.

This does not mean that we can assume that a premium seat will carry the same relative premium over a nosebleed for each game. For a game that will not sell out, for example, the value of a nosebleed should be negative–this ticket will not sell, and given the additional expenses involved in attending an event (gas, parking, food, etc.), someone would have to be paid in order to sit in that seat. In such a situation, box seats are infinitely more valuable than bleachers. For a premium game, on the other hand, such as a playoff game, opening day, or any game at Fenway Park, all seats will sell out, as there is significant value just getting through the gates to see the event. Perhaps the box seat is only worth two or three times as much as the nosebleed in these cases.

At this point, we break out a terrific tool for processing small amounts of noisy data, the Kalman filter. Heavily used in the guidance and control of spacecraft and aircraft as well as with time-series data in economic and financial spheres, the Kalman filter is an algorithm that uses state estimates of model parameters combined with estimates of their variance to make predictions about the output of a linear dynamic system. I’ll spare you the obligatory rocket science joke and jump straight into a tutorial on how to use a Kalman filter to make predictions in the face of noisy and limited ticket data.

Since every observed price is going to be the output of a noisy system at a point in time, we are most interested in the likely state of the system as of the last observation, making a recursive estimator such as the Kalman filter an excellent choice.

## Step one: model the system

In SeatGeek’s case, our assumption is that the underlying structure of price dynamics is similar across events, and we can therefore generate a curve mapping seat quality to expected prices using the same parameters on the curve each time. As an example, here is a plot with seat scores for Knicks games at Madison Square Garden on the x-axis and historical average sale price for those seats on the y-axis.

I’ve added a best-fit line, which shows that sale prices tend to grow exponentially with seat quality at this particular venue. As such, we will model our price predictions as log-linear with respect to seat quality. We’re about to do a lot of math here, so feel free to skip ahead.

The Kalman filter maintains the state of the filter at step k with two variables:

• $\mathbf{\hat{x}_{k}}$

: the parameters of the model given observations up to and including step k

• $\mathbf{P}_{k}$

: the covariance matrix of parameter errors, a measure of the confidence the model has in its parameters

In our simple case, $\mathbf{\hat{x}}$ represents the intercept $\hat{x_1}$ and slope $\hat{x_2}$ of our line. $\mathbf{P}$ represents the covariance of parameter errors. This covariance matrix will be used down the line to determine which parameters must change when we make a new observation. It will also determine the magnitude of the adjustment.

A general Kalman filter uses a state-transition matrix $\mathbf{F}$ in order to advance from one observation to predicting the value of the next.

Where: $\mathbf{x}_{k\|k-1}$ is our best estimate of $\mathbf{x}_k$ given observations up to and including $k-1$. $\mathbf{w}_{k}$ is white noise of the same dimension as the model, drawn from a multivariate normal distribution having covariance $\mathbf{Q}$, representing the process error. Many applications of the filter model physical systems that take velocity and acceleration into account (and behind the scenes, so does SeatGeek). In these cases, $\mathbf{F}$ can include time-varying parameters, but in our simple example, we set:

Assuming that in between observations the underlying model dynamics do not change according to any known physical system, we use the identity matrix.1

## Step two: model the output

Sharp eyes may have noticed that the preceding equation does not use our lovely seat scores quite yet. The reason is our observations do not come in the form of linear models, but rather in observed fair values for seats, i.e. when users express an intent to buy. We have to model our output as $\mathbf{z}_k = \mathbf{H}_k \mathbf{x}_k + \mathbf{v}_k$, where $\mathbf{H}_k$ is a 1x2 matrix $\begin{bmatrix} 1 & \theta_k \end{bmatrix}$, theta being the rating of the seat in question. Keeping with Kalman filter assumptions we model our residuals, the difference between our observations and predictions, as Gaussian white noise.2 In the single-output case, the observation noise can be thought of as the square of our standard estimation error, or how far we allow our predictions to be off before the model updates itself. This variance, $\mathbf{R}$, will be used later on when we update the model.

## Step three: predict

Now that we have a model of our system, we can start making predictions. Using historical data, we can generate $\mathbf{x}_0$, our default parameters, and start predicting prices. Our prediction, of course, is that our observations will lie on the line defined by $\mathbf{x}_0$, shown in the image above. In this case, $\mathbf{x}_0 = \begin{bmatrix} 7.2356 \ , \ .1428 \end{bmatrix}$.

We add those parameters to our listing feed, determine seat quality from the data provided to us by the market in question, and predict a price for each listing. We compare the predicted price to the listed price, assign a Deal Score to each listing, and sort your search results accordingly. We live to fight another day.

## Step four: observe

Market dynamics, however, are not so kind as to stay constant, and our models, alas, are unable to perfectly predict every price from the outset. The Kalman filter is thus useful for responding to changing tides. Since observations of changing tides can be few and far between and must inform our predictions on all other tickets, it behooves us to have a degree of certainty about our model, which we represent by $\mathbf{P}$, the 2x2 covariance matrix of our state estimate errors. A good design decision is to start off $\mathbf{P}_0$ with large numbers on the diagonals and zeroes elsewhere, assuming low certainty of model parameters and independence.2 The filtering process should be able to give you excellent color on their true relationship.

Many of the signals discussed in part one can be interpreted, directly or indirectly, as a fair ticket price, and when we observe a new price, $\mathbf{z}_k$, we model the residual as $y_k = \mathbf{z}_k - \mathbf{H}_k \mathbf{\hat{x}}_{k\|k-1}$ where $\mathbf{H}_k \mathbf{\hat{x}}_{k\|k-1}$ represents our predicted price for that seat. In the Kalman filter, the residual variance (variance of $y_k$) is modeled as $\mathbf{S}_k = \mathbf{H}_k \mathbf{P}_{k\|k-1} \mathbf{H}_{k}^{\mathbf{T}} + \mathbf{R}$. In the general case, these are covariance matrix. Since our model outputs only one value, a predicted price, $\mathbf{S}$ and $\mathbf{R}$ are variances. $\mathbf{R}$ we have seen before, this is the general model of our error variance. $\mathbf{S}_k$ is the variance of this particular observation, which varies depending on the seat score of this particular observation (see dotted lines in the slideshow below). Variance is higher on expensive tickets than it is on cheaper tickets. We will use the residual and its variance in the next step to order to update our parameters.

## Step five: update

We now come to the key element of the Kalman filter, the gain. The gain takes our a priori estimate covariance $\mathbf{P}_{k\|k-1}$, our observation model $\mathbf{H}_k$, and our residual variance $\mathbf{S}$ in order to decide how much we should change our model parameters before the next prediction. Our optimal gain is $\mathbf{K}_k = \mathbf{P}_{k\|k-1} \mathbf{H}_{k}^{\mathbf{T}} \mathbf{S}_{k}^{-1}$. Kalman gain is a bit of a tricky nut to crack. If you think of the new parameters $\mathbf{x}_{k+1}$ as a weighted average of the old estimate $\mathbf{x}_k$ and the new observation, $\mathbf{K}$ provides the optimal weight for the observed residual in the new average.

If you’ve been following along, you can see that the larger our a priori uncertainty, the higher this gain factor gets. Now that we have a gain factor, we can start making some updates.

For the model parameters, this is easy. We simply scale the Kalman gain by the measurement residual, yielding us a new estimate. You can see here that if we guessed the price exactly, the slope and intercept do not change (a good sanity check) and that if we were fairly sure about the estimate beforehand, it requires a major miss before we update it substantially:

Similarly, we also update our error covariance matrix,

The error covariance update is a bit of a headache as well; the easiest way to think about it is to remember that $\mathbf{P}$ represents the covariance of our parameter estimation errors $\mathbf{x}_k - \mathbf{\hat{x}_{k\|k-1} }$ and to play around with what makes the changes in $\mathbf{P}$ large or small. For example, if the observation noise, $\mathbf{S}_k$ is very large, our gain will be small and our certainty will remain mostly unchanged.

Now that we understand how the filter works, let’s rejoin our original programming and see it in action!

## Putting it all together

The slideshow below takes you on a visual tour through several steps of the dynamic linear model. In all slides, the dark red line represents our estimate of the mapping from seat quality to expected price and the and the dashed lines represent our 95% confidence interval for the price.

Before concluding, I’d like to note that a major motivation behind this series was the lack of real-world Kalman filter examples out here on the internet, which is disappointing given its usefulness as an estimator, especially for low-dimensional time-variant systems with small data. Here are some of the better articles I’ve found:

I gladly welcome thoughts on our usage of the filter or critiques of my explanation from those who have a better handle on things. Leave a comment or find me on twitter @steve_rit.

## Notes

• 1: If we wanted to add time-variance to our parameters, we could use something like:
• 2: To cut down on the amount of notation, I’ve removed some symbols representing noise that aren’t directly used in the predict-update process.

# The Math Behind Ticket Bargains

Greetings from SeatGeek Research & Development!

I’m here today to take you behind the curtain of one of SeatGeek’s major features, Deal Score. For the uninitiated, Deal Score is a 0-to-100 rating that reveals whether a ticket is a great bargain or a major rip-off. We humbly believe it’s the best way to find tickets. I’d like to quickly tell you why and then spend most of this post discussing some of the math behind Deal Score’s calculation. This is the first in a series of two blog posts, the second coming soon.

## Sorting vs. Searching

Why have Deal Score? The standard across ticket sites is, of course, sorting by price. On most ticket sites, a prospective buyer can select sections they want to sit in, filter tickets by price range, and spend a solid chunk of their day trying to figure out the best seats for the money. On most aggregators, listings from several ticketing websites are lumped together… and then sorted by price, whereupon the experience repeats itself with the added pleasure of more noisy data.

SeatGeek, however, is more than an aggregator, we’re a search engine. Using Deal Score, we sort tickets by value rather than price. As a quick example, let’s try to find some tickets for the Red Sox-Indians game May 12th at Fenway Park. If I sort the tickets by price, I need to wade through dozens of cheap listings for standing room only tickets and obstructed view seats. Cheap for sure, but anybody who’s been to Fenway Park can tell you there are some places you just don’t want to sit. I need to be vigilant in order to notice a listing for two tickets in the grandstand behind home plate for $53, the same price level as a listing in the back of the bleachers and in two neck-straining outfield grandstand seats. How good of a deal is this? Sorting by price these three listings look the same, but behind the scenes SeatGeek’s proprietary price prediction has pegged these bleacher seats as being worth$29, the outfield grandstand seats at $34, and the infield seats at$69. Deal Score compares every ticket’s expected price to its listed price and takes the mental leg work out of ticket shopping.

The basic principle behind Deal Score is simple and intuitive: by searching rather than sorting, we can intelligently filter secondary market ticket listings, saving consumers large amounts of time and money.

## How does it work?

The most important element of our Deal Score algorithm is to accurately estimate the current market value of a ticket listed on the secondary market. Most marketplaces have large amounts of transactional data on their products, often with supply and demand-side pricing signals. SeatGeek is in the undesirable position of trying to predict, on a daily basis, the price of millions of event tickets that have, by definition, never sold. Each seat at every event is a unique product; while its eventual price is informed by many other signals, the secondary market is both opaque and noisy.

Given our data constraints and the precision necessary, we made two assumptions about seats:

• Seat quality, within a given venue, has a consistent ordering. This means that for any given Red Sox game, we expect that Infield Grandstand 18, Row 12 is a better place to sit than Center Field Bleachers 37, Row 37.
• The relationship of seat price to seat quality follows a similar pattern across all events at a given venue. This means that a curve plotting sale price against seat quality for a weekend Red Sox-Yankees game at Fenway Park should look similar to a curve for a midweek Red Sox-Royals game, even though the market dynamics would be quite different.1

The first assumption allows us to use signals from many contexts to inform our predictions. The second assumption allows us to make confident predictions about prices after seeing as few as five or ten prices for each event.  In today’s installment, I’m going to show you the math we use to derive a key metric called “Seat Rank,” the ordinal quality rank of all seats within a venue.

## Seat Rank

In order to make the most of our first assumption, we determine the intrinsic “seat quality” of each seat relative to all others. Teams and promoters deal with this every day; they have to set face values for tens of thousands of seats in a stadium, but they have the advantage of only needing to compute a few dozen price levels, at most. In contrast, secondary markets have row-level pricing granularity, and thus require us to understand how much each row is going to sell for on the open market. Fenway Park, for example, has 4,022 distinct section/row pairs, and we must understand how they all rank on a relative basis. Using a little bit of cleverness along with vector coordinate data from SeatGeek’s venue maps, we reduce the problem slightly: we divide each venue into clusters of seats (we call them “seat groups”) whose physical locations and sale prices tend to be close enough to each other that they can be modeled together. These seat groups allow us to make use of less data to predict more prices.   Some venues have as few as twenty groups; others, well into the thousands. Fenway Park has 993.

To understand Seat Scores, consider a simple example where the set of listings $\mathcal{S}$ consists of three seats indexed by $i$:

.

Suppose these seats are equally priced, despite the fact that their quality $\theta_{i}$ varies. In fact, $s_1$ is twice as good as $s_2$, which is twice as good as $s_3$. Without loss of generality, we arbitrarily set $\theta_{1} = 1$ and can define a vector $\Theta$ of relative seat qualities:

Unfortunately, while SeatGeek has a lot of data, we cannot directly observe the relative true quality $\Theta$ of these seats.  However, we use a group of  different signals, including clicks on “buy” buttons and the physical location of a seat within a venue, to arrive at an estimated quality $\hat{\Theta}$.  One of these signals is pairwise comparison.  Shoppers constantly make pairwise comparisons among seats. We use this tendency to our advantage.  In particular, we obtain our estimate of $\Theta$ by assuming that users’ historical choices are proportional to the true relative quality of seats, revealing information about the true $\Theta$. For simplicity’s sake, assume that:

For example, when faced with a choice between $s_1$ and $s_3$, users will pick $s_1$ with probability $= \tfrac{1}{1+ 1/4} = 80\%$. In reality, the data will be much noisier. Each data point is a random realization of their perception of relative seat values. Some pick the first listing they see, others have disparate opinions about what makes for a quality seat, etc.2

Continuing with the Fenway Park example, after processing our input signals, we have a square matrix $\mathbf{R}$ where each cell represents the processed results of pairwise comparisons between seat groups. In this matrix, $\mathbf{R}$, we define each cell $r_{i,j}$ as the observed relative quality $s_i$ as compared to $s_j$.

The rough values for $\mathbf{R}$ are fairly noisy, as shown in the matrix below. The matrix below is sorted left-to-right, top-to-bottom by the raw “winning percentage” of each seat in pairwise comparisons. Each cell represents, roughly, the fraction of the time that a user clicked on the seat in the row (y-axis) when the seat in the column (x-axis) was available at an equal or lesser price. A row with mostly red is a seat that “wins” many comparisons, a row with mostly green tends to lose.

The initial $\Theta$’s implied by these raw winning percentages are a good start, but these data are far too noisy to be used as reliable estimates. This is a visual representation of what Fenway Park looks like with these raw seat scores:

To estimate $\hat{\Theta}$ in the presence of noisy data, we use a method called maximum likelihood estimation, which iterates over candidate values for $\hat{\Theta}$ to maximize the probability of observing the real data. We start with rough parameter values, $\hat{\Theta}$ and follow the steps: (1) calculate the probability of observing the data conditional on these values3:

(2) adjusting the parameter values to increase this likelihood

Watch below as the seat scores converge from our initial values to the maximum likelihood (use the controls below to navigate):

Presto! Once we’re finished, we end up with something that looks very similar Fenway’s actual seating chart, only with much more granular distinctions on price levels. With these seatscores, we would expect $\mathbf{R}$ to look like this filled-in matrix instead of the noisy, sparse mess from above.

With these powerful seat scores in hand, we’re halfway to our goal of predicting accurate prices for live events at any venue in the country. Come back for our next post to see how we go from our seat scores to market value predictions for thousands of events every day. UPDATE: View part 2: Using a Kalman Filter to Predict Ticket Prices

## Credits

In case you’re wondering what technology we use for these projects, here’s a sampling:

• pandas: a python data analysis library, for signal processing
• R: for statistical analysis and postprocessing
• ggplot2: to make the heatmaps seen above

## Notes

• 1: If you read this far and wondered whether we were ever going to get around to this, then you’ll want to come back for part 2, when we explain how price predictions are derived from these seat scores.
• 2: Fenway park is actually a good example of this phenomenon, Green Monster seats in particular are heavily disagreed upon by our signals.
• 3: $r_{ij} = 0$ whenever $i = j$, so we need not exclude these cases.

# The SeatGeek Platform

Over the past two and a half years, we’ve poured countless time into building a canonical database of live events in the US. Not only have we cataloged when and where each event is happening, but we also built a system that attaches copious metadata to each event–e.g., the latitude/longitude of the venue, the number of tickets currently listed, etc. Thus far, that database has been used exclusively to power the pages on SeatGeek.com. But we musn’t be selfish! Thus, we recently announced The SeatGeek Platform. Developers can use the Platform to add live event info to existing apps or as a foundation for entirely new apps that deal with live events.

The SeatGeek Platform is composed of our event, performer, and venue data, a REST API, our Partner Program, and a developer support community. The API exposes a mother lode of live event info—nearly all of the data you see available on SeatGeek.com, plus a lot more. Full documentation is here. The Partner Program gives Platform users an easy way to monetize. Anyone who signs up earns a 50/50 rev share whenever a user buys tickets using one of their links. For current partners, that has worked out to about $11 every time one of your users buys a ticket. A few of us on the SeatGeek dev team are closely tracking posts on the support forum, so if you have any questions about the API, just post there and you will get a prompt, thorough response. How might someone use this thing? A quick example: Let’s say that Sarah runs a site for her indie record label, SBeats, which gets a lot of traffic from fans. Since the record business isn’t massively lucrative these days (shocking, I know!) Sarah is looking for new ways to monetize. She’d also like to add a bit more content to her label’s site. She uses the SeatGeek API to pull in data about which of her artists are touring. She displays that info in a module on each artist’s page. To give users a bit of context, she pulls the “low price” field from the API to show the cost of the cheapest ticket for each show. Whenever a user clicks on a link for a show and buys a ticket, Sarah earns$11, on average.

We’re pumped about this launch. For the first time, we’re exposing our data to developers everywhere. I can’t wait to see what people build.

# Removing Price Forecasts

When we launched SeatGeek back in the fall of 2009, we positioned ourselves as a site that forecasts how ticket prices move on the secondary market. That was our “one thing”: forecasts. Russ had spent months building scrapers to collect ticket data. I’d spent months messing with that data in STATA, building models that could accurately forecast prices. We figured we could get traction by helping consumers time their ticket purchases optimally.

Much has changed. The past two years have been all about expanding the vision (trite but true) of what SeatGeek can be. Four months after we launched, we moved from being a price forecaster to a price forecaster that also had pretty good ticket search. The ticket search was getting a better response from users than the forecasts, so we continued to focus on that. Five months after that, we launched our own interactive mapping platform. That really opened the doors on how we could approach ticket search. We created something called Deal Score, which allowed us to use a lot of the data and analytical tools we’d built for forecasts, and that got a great response. In 2011, we added dozens more ticket sellers to our search results. Near the end of the year, we’ve begun to make the leap into being a full-fledged one-stop site for live entertainment.

Through all of this, the forecasting feature has been lost in the shuffle. We’ve continued to support it, but it became a hassle rather than a cause for excitement. Most users stopped paying attention to it; the site had other things that were more compelling. We stopped mentioning it when we described what SeatGeek does. Forecasts ceased to be relevant to our core mission–making ticket buying elegant and simple. Optimally timing a ticket purchase is a complicated, tricky value prop and holds us back in our pursuit of removing complication from ticket buying.

Thus, within the next week or two, we’ll be removing price forecasts from our site. We want to avoid feature bloat. We need to maintain clarity of purpose, for both our users and ourselves. Drop us a line at hi@seatgeek.com if you think this is a terrible idea and if we get enough emails we’ll reconsider. Otherwise, we will begin 2012 without forecasts, which will help us focus on the things that matter most as we try to upend the way people attend live events.