Crunching subway data- a New Yorker’s busiest stations

There are many reasons to complain about the subway system here in New York. It’s underfunded, the air conditioning breaks, and if you’ve ever tried relying on the G line you’ve probably ended up with a deep, serious commitment-phobia. But there are many bright spots of the subway system, and as a tech-head developer I’d like to draw your attention to one in particular- data.

The MTA makes a ton of data available. The entire subway and bus system are available as GTFS feeds, allowing you to set up your own instance of OpenTripPlanner for all your subway routing needs- something I used in the aftermath of Sandy to set up an emergency trip planner (and OpenPlans then used to create some great heatmaps). It has data on each and every Metrocard swipe in the city, and, er, Pantone colours for each of the subway lines. It also has a crazy amount of data on each subway turnstile in the city, which is what I’ve been playing around with lately.

The most popular stations in New York are already known- the MTA themselves has them listed. Unsurprisingly, Times Square tops the list by a large margin, with every other large station following closely behind. They don’t share how they came to that number, but I assume that it includes everyone who travels on the subway- commuters, tourists, even the Mariachi guys traveling from train to train. For a heatmap side project I’m currently working on, I wanted to know what the most popular weekday, commuter stations are.

So I downloaded some turnstile data from mid-January to late April of this year to use as my sample set. The format the MTA uses is… weird, to say the least. The basics are there- it doesn’t record every turnstile turn, but rather keeps cumulative totals during the day- 3am, 7am, 12pm, and so on. For some reason it has eight repeating sets of columns that should really be rows, so I threw together a quick and dirty node.js script to flatten these out (and merge all my CSV files into one) and imported the data into a Postgres database.

First off, I needed to get my commuter totals. I did this by creating a view that ran an exceptionally messy SELECT statement which selected the first row of exit data available after or on 3am, then matched it up with the first result after 11am- while excluding all weekend results:

As I said, awful SQL. If anyone has any suggestions for improvements I’d love them. But it worked. For each turnstile, I now have the number of exits taken during the morning rush hour(…ish). Unfortunately, I quickly realised that I’d need to clean the data up- it appears that at certain points, turnstiles just go absolutely haywire and you end up with -200000 exits on one day, which can really mess with your totals. I discovered that I could easily chop this data out just by calculating how far that day’s result deviated from the turnstile’s overall median. The anomalous results were so different that I could set the cut off point at 10x the median and still exclude them.

With that done, it was only a short step to aggregate the data up to the station level, and discover the most popular commuter stations, or, The Stations With The Most Turnstile Exits During Peak-ish Hours on Weekdays:

  1. 14th St Union Square
  2. 42nd St Grand Central
  3. 42nd St Times Square
  4. 34th St Penn Station
  5. Fulton St
  6. 47-50th St Rockefeller Plaza
  7. 34th St Herald Square
  8. 23rd St (6)
  9. Chambers St
  10. 59th St – Columbus

So while many of the results are similar to the overall station popularity, there are some definite differences- Union Square jumping to the top being one of the most noticeable. Be careful not to take too much out of these numbers- as I said, it’s based on a limited dataset of a few months. And I’d welcome any corrections on my working from people smarter than myself!