Welcome back! In case you missed part one of this series, we’re opening up the hood on Deal Score, one of SeatGeek’s most popular features.
In part one we gave a brief overview of why we sort ticket listings by Deal Score rather than by price. We gave you our two main assumptions:
- Seat quality, within a given venue, has a consistent ordering.
- The relationship of seat price to seat quality follows a similar pattern across all events at a given venue.
In the last post we discussed how we take advantage of the first assumption. Today, we’ll explain the second assumption and how it leads to the accurate valuations of thousands of tickets on a daily basis.
This second assumption means that all we need to do in order to predict prices is find a function, unique to each event, that maps our seat score vector to a vector of expected prices. Since we’re working with limited data here and surely do not have enough data to induce the structure of this curve for each event, we make a further simplifying assumption: for each venue, the curve will look similar for every event at that venue, whether that curve is a straight line, polynomial, or otherwise.
This does not mean that we can assume that a premium seat will carry the same relative premium over a nosebleed for each game. For a game that will not sell out, for example, the value of a nosebleed should be negative–this ticket will not sell, and given the additional expenses involved in attending an event (gas, parking, food, etc.), someone would have to be paid in order to sit in that seat. In such a situation, box seats are infinitely more valuable than bleachers. For a premium game, on the other hand, such as a playoff game, opening day, or any game at Fenway Park, all seats will sell out, as there is significant value just getting through the gates to see the event. Perhaps the box seat is only worth two or three times as much as the nosebleed in these cases.
At this point, we break out a terrific tool for processing small amounts of noisy data, the Kalman filter. Heavily used in the guidance and control of spacecraft and aircraft as well as with time-series data in economic and financial spheres, the Kalman filter is an algorithm that uses state estimates of model parameters combined with estimates of their variance to make predictions about the output of a linear dynamic system. I’ll spare you the obligatory rocket science joke and jump straight into a tutorial on how to use a Kalman filter to make predictions in the face of noisy and limited ticket data.
Since every observed price is going to be the output of a noisy system at a point in time, we are most interested in the likely state of the system as of the last observation, making a recursive estimator such as the Kalman filter an excellent choice.
Step one: model the system
In SeatGeek’s case, our assumption is that the underlying structure of price dynamics is similar across events, and we can therefore generate a curve mapping seat quality to expected prices using the same parameters on the curve each time. As an example, here is a plot with seat scores for Knicks games at Madison Square Garden on the x-axis and historical average sale price for those seats on the y-axis.
I’ve added a best-fit line, which shows that sale prices tend to grow exponentially with seat quality at this particular venue. As such, we will model our price predictions as log-linear with respect to seat quality. We’re about to do a lot of math here, so feel free to skip ahead.
The Kalman filter maintains the state of the filter at step k with two variables:
-
: the parameters of the model given observations up to and including step k
-
: the covariance matrix of parameter errors, a measure of the confidence the model has in its parameters
In our simple case,
A general Kalman filter uses a state-transition matrix
Where:
Assuming that in between observations the underlying model dynamics do not change according to any known physical system, we use the identity matrix.1
Step two: model the output
Sharp eyes may have noticed that the preceding equation does not use our lovely seat scores quite yet. The reason is our observations do not come in the form of linear models, but rather in observed fair values for seats, i.e. when users express an intent to buy. We have to model our output as
Step three: predict
Now that we have a model of our system, we can start making predictions. Using historical data, we can generate
We add those parameters to our listing feed, determine seat quality from the data provided to us by the market in question, and predict a price for each listing. We compare the predicted price to the listed price, assign a Deal Score to each listing, and sort your search results accordingly. We live to fight another day.
Step four: observe
Market dynamics, however, are not so kind as to stay constant, and our models, alas, are unable to perfectly predict every price from the outset. The Kalman filter is thus useful for responding to changing tides. Since observations of changing tides can be few and far between and must inform our predictions on all other tickets, it behooves us to have a degree of certainty about our model, which we represent by
Many of the signals discussed in part one can be interpreted, directly or indirectly, as a fair ticket price, and when we observe a new price,
Step five: update
We now come to the key element of the Kalman filter, the gain. The gain takes our a priori estimate covariance
If you’ve been following along, you can see that the larger our a priori uncertainty, the higher this gain factor gets. Now that we have a gain factor, we can start making some updates.
For the model parameters, this is easy. We simply scale the Kalman gain by the measurement residual, yielding us a new estimate. You can see here that if we guessed the price exactly, the slope and intercept do not change (a good sanity check) and that if we were fairly sure about the estimate beforehand, it requires a major miss before we update it substantially:
Similarly, we also update our error covariance matrix,
The error covariance update is a bit of a headache as well; the easiest way to think about it is to remember that
Now that we understand how the filter works, let’s rejoin our original programming and see it in action!
Putting it all together
The slideshow below takes you on a visual tour through several steps of the dynamic linear model. In all slides, the dark red line represents our estimate of the mapping from seat quality to expected price and the and the dashed lines represent our 95% confidence interval for the price.
Before concluding, I’d like to note that a major motivation behind this series was the lack of real-world Kalman filter examples out here on the internet, which is disappointing given its usefulness as an estimator, especially for low-dimensional time-variant systems with small data. Here are some of the better articles I’ve found:
- Giovanni Petris’ monograph on dynamic linear models in R
- his excellent description of dlm, his R package for dynamic linear modeling
- a wonderful cheat sheet on the dimensionality of higher order Kalman filters
- an introduction to Kalman filters for predicting basketball game scores, written by the incomparable Dean Oliver
I gladly welcome thoughts on our usage of the filter or critiques of my explanation from those who have a better handle on things. Leave a comment or find me on twitter @steve_rit.
Notes
- 1: If we wanted to add time-variance to our parameters, we could use something like:
- 2: To cut down on the amount of notation, I’ve removed some symbols representing noise that aren’t directly used in the predict-update process.