Jul 8th, 2011

FuzzyWuzzy: Fuzzy String Matching in Python

seatgeek open sourced seatgeek/fuzzywuzzy

Fuzzy String Matching in Python

We’ve made it our mission to pull in event tickets from every corner of the internet, showing you them all on the same screen so you can compare them and get to your game/concert/show as quickly as possible.

Of course, a big problem with most corners of the internet is labeling. One of our most consistently frustrating issues is trying to figure out whether two ticket listings are for the same real-life event (that is, without enlisting the help of our army of interns).

To pick an example completely at random, Cirque du Soleil has a show running in New York called “Zarkana”. When we scour the web to find tickets for sale, mostly those tickets are identified by a title, date, time, and venue. Here is a selection of some titles we’ve actually seen for this show:

Cirque du Soleil Zarkana New York
Cirque du Soleil-Zarkana
Cirque du Soleil: Zarkanna
Cirque Du Soleil - Zarkana Tickets 8/31/11 (New York)
Cirque Du Soleil - ZARKANA (Matinee) (New York)
Cirque du Soleil - New York

As far as the internet goes, this is not too bad. An normal human intern would have no trouble picking up that all of these listings are for the same show. And a normal human intern would have no trouble picking up that those listings are different than the ones below:

Cirque du Soleil Kooza New York
Cirque du Soleil: KA
Cirque du Soleil Zarkana Las Vegas

But as you might imagine, we have far too many events (over 60,000) to be able to just throw interns at the problem. So we want to do this programmatically, but we also want our programmatic results to pass the “intern” test, and make sense to normal users.

To achieve this, we’ve built up a library of “fuzzy” string matching routines to help us along. And good news! We’re open sourcing it. The library is called “Fuzzywuzzy”, the code is pure python, and it depends only on the (excellent) difflib python library. It is available on Github right now.

String Similarity

The simplest way to compare two strings is with a measurement of edit distance. For example, the following two strings are quite similar:

NEW YORK METS
NEW YORK MEATS

Looks like a harmless misspelling. Can we quantify it? Using python’s difflib, that’s pretty easy

from difflib import SequenceMatcher
m = SequenceMatcher(None, "NEW YORK METS", "NEW YORK MEATS")
m.ratio() ⇒ 0.962962962963

So it looks like these two strings are about 96% the same. Pretty good! We use this pattern so frequently, we wrote a helper method to encapsulate it

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96

Great, so we’re done! Not quite. It turns out that the standard “string closeness” measurement works fine for very short strings (such as a single word) and very long strings (such as a full book), but not so much for 3-10 word labels. The naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues.

Partial String Similarity

Here’s a good illustration:

fuzz.ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60
fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 75

This doesn’t pass the intern test. The first two strings are clearly referring to the same team, but the second two are clearly referring to different ones. Yet, the score of the “bad” match is higher than the “right” one.

Inconsistent substrings are a common problem for us. To get around it, we use a heuristic we call “best partial” when two strings are of noticeably different lengths (such as the case above). If the shorter string is length m, and the longer string is length n, we’re basically interested in the score of the best matching length-m substring.

In this case, we’d look at the following combinations

fuzz.ratio("YANKEES", "NEW YOR") ⇒ 14
fuzz.ratio("YANKEES", "EW YORK") ⇒ 28
fuzz.ratio("YANKEES", "W YORK ") ⇒ 28
fuzz.ratio("YANKEES", " YORK Y") ⇒ 28
...
fuzz.ratio("YANKEES", "YANKEES") ⇒ 100

and conclude that the last one is clearly the best. It turns out that “Yankees” and “New York Yankees” are a perfect partial match…the shorter string is a substring of the longer. We have a helper function for this too (and it’s far more efficient than the simplified algorithm I just laid out)

fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100
fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 69

That’s more like it.

Out Of Order

Substrings aren’t our only problem. We also have to deal with differences in string construction. Here is an extremely common pattern, where one seller constructs strings as “<HOME_TEAM> vs <AWAY_TEAM>” and another constructs strings as “<AWAY_TEAM> vs <HOME_TEAM>”

fuzz.ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 45
fuzz.partial_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 45

Again, these low scores don’t pass the intern test. If these listings are for the same day, they’re certainly referring to the same baseball game. We need a way to control for string construction.

To solve this, we’ve developed two different heuristics: The “token_sort” approach and the “token_set” approach. I’ll explain both.

Token Sort

The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. For example:

"new york mets vs atlanta braves"   →→  "atlanta braves mets new vs york"

We then compare the transformed strings with a simple ratio(). That nicely solves our ordering problem, as our helper function below indicates:

fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 100

Token Set

The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.

Here is an illustrative example:

s1 = "mariners vs angels"
s2 = "los angeles angels of anaheim at seattle mariners"

Using the token sort method isn’t that helpful, because the second (longer) string has too many extra tokens that get interleaved with the sort. We’d end up comparing:

t1 = "angels mariners vs"
t2 = "anaheim angeles angels los mariners of seattle vs"

Not very useful. Instead, the set method allows us to detect that “angels” and “mariners” are common to both strings, and separate those out (the set intersection). Now we construct and compare strings of the following form

t0 = [SORTED_INTERSECTION]
t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]

And then compare each pair.

The intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar. In our example

t0 = "angels mariners"
t1 = "angels mariners vs"
t2 = "angels mariners anaheim angeles at los of seattle"
fuzz.ratio(t0, t1) ⇒ 90
fuzz.ratio(t0, t2) ⇒ 46
fuzz.ratio(t1, t2) ⇒ 50
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") ⇒ 90

There are other ways to combine these values. For example, we could have taken an average, or a min. But in our experience, a “best match possible” approach seems to provide the best real life outcomes. And of course, using a set means that duplicate tokens get lost in the transformation.

fuzz.token_set_ratio("Sirhan, Sirhan", "Sirhan") ⇒ 100

Conclusion

So there you have it. One of the secrets of SeatGeek revealed. There are more tidbits in the library (available on Github), including convenience methods for matching values into a list of options. Happy hunting.

Feb 14th, 2011

Announcing Soulmate

seatgeek open sourced seatgeek/soulmate

Redis-backed service for fast autocompleting

Have you ever felt so close to someone that it seemed like the two of you were finishing each other’s sentences? Well, as a Valentine’s Day gift to the community, we at SeatGeek have distilled some of Cupid’s magic into a Redis-backed service for doing exactly that: Soulmate is a tool for building fast autocompleters.

Give it a try right now on SeatGeek.

Inspired by Auto Complete with Redis, Soulmate uses sorted sets to build an index of partially completed words and the corresponding top matching items, and provides a simple sinatra app to query them.

Here’s a quick overview of what the initial version of Soulmate supports:

Provide suggestions for multiple types of items in a single query (at SeatGeek we’re autocompleting for performers, events, and venues)
Results are ordered by a user-specified score<
Arbitrary metadata for each item (at SeatGeek we’re storing both a url and a subtitle)

Checkout the github repo for instructions on how to use it, or if you’re feeling saucy, just gem install soulmate.

Oct 25th, 2010

Henceforth, All Job Applicants Must Hack Into Our Backend

screenshot

The early stage of the hiring process has a huge signal-to-noise problem. A job posting on one of the standard career sites garners hundreds of resumes, but most are poor, and sorting cruft takes countless hours. Outstanding web developers do not generally spend their time trolling the job listings on Craigslist. They do, however, enjoy puzzles.

Therefore, we’re changing the application process for our web developer position. All applicants must now submit their resume by solving a puzzle: they must hack into our backend jobs admin panel.

Admittedly, this is contrived. We didn’t have a backend jobs admin till last week, when Eric and Mike made it for the purpose of this challenge. But it should be a fun challenge for any dev up to the task. Anyone who successfully submits their resume will be carefully considered. Even if you aren’t looking for a job, feel free to give it a spin and drop us a line at hi@seatgeek.com to let us know what you think.

Note: All errors you run into are intentional. A blank page should be considered an error. If you see a blank page, your resume has not been submitted.

Here’s the job description: https://seatgeek.com/jobs

Note: This challenge has been discontinued.

Oct 21st, 2010

Announcing DJJob - a PHP Port of Delayed_job

seatgeek open sourced seatgeek/djjob

Database backed asynchronous priority queue – A PHP port of delayed_job

DJJob is our database-backed job system that allows PHP web applications to process long-running tasks asynchronously. It is a nearly direct port of delayed_job, one of the most popular Ruby/Rails job processing systems.

A few months ago, I went searching for a PHP-friendly equivalent of the many popular queue/worker-based job systems available to Ruby apps. But the best I could find was a few pointers to pcntl_fork. There are a few language agnostic-systems, but in general they seem more complicated and often rely on additional moving parts.

Delayed_job’s design is attractive because most people are already running a SQL db, and storing jobs in the db instantly buys you a pretty full featured job management system in phpMyAdmin.

The result is that we decided to port delayed_job to PHP and bring simple, robust job processing to PHP web apps.

Take a look at the github repo to see how to use it and to get the code.

Sep 18th, 2009

SeatGeek Launches at TechCrunch 50

A few days ago we publicly launched SeatGeek at the TechCrunch50 conference in San Francisco. We got up on the main stage at TC50 and, in front of a live audience of 2,000+ and a streaming audience of another 8,000+, we described what SeatGeek does and how we think we can revolutionize the way people buy tickets. And then, right after we hopped off stage, Russ FTP’ed a few files and voilà, the site went live. Here’s a link to the presentation video.

Things have been quiet on this blog recently; apologies for that. TC50 is a superb event for startups, but the participation requirements stipulated we not talk publicly about our business nor our forthcoming launch. So, instead of awkwardly blogging about SeatGeek without even saying what the hell we do, Russ and I put our heads down and focused on making the site as useful and stunning as possible for our public launch.

But, thankfully, the moratorium has now been lifted. So expect more activity here. We’ve gotten some great positive press after our launch, nearly all of it positive, including articles by the LA Times, Fast Company, VentureBeat, and CNET.

Near term plans for Russ and I include:

Building our team by hiring a CTO and a few developers
Raising money–we’ve begun raising a seed round of 500K-1M
Adding NFL forecasts. They’re nearly ready, and they’re looking awesome.
Launching our premium broker service, SeatGeek Pro

Blog Archives Newer →

Code, Design, and Growth at SeatGeek

Jobs at SeatGeek