This is a Pandas script that corresponds the start/end coordinates of Citibike rides with NYC neighborhoods.
Lyft, the owner of Citibike, publishes anonymous ridership data at the start of every month. These datasets contain the information about where and when the rides start, how they end, and whether the rider held a Citibike subscription. While the data is used in Lyft's own reports, it is also available for the public.
One thing Lyft does not do is identify which neighborhoods correspond to the start/end coordinates. I was working on a data story about the effects of congestion pricing in NYC and got very disappointed by that.
I used the official shapefile for the 2020 NYC Tabulation Areas and Nominatim to correspond each pair of coordinates with the neighborhood to which they belong. I created two additional columns (start_neighborhood
and end_neighborhood
) so the data would be easier to work with.
I was specifically working with the March 2024 and March 2025 datasets, which you don't have to do. Just go to the Lyft website and look up the name of the file you wish to import. By following the scripts outlined here, you can learn to perform a spatial join from start to finish for your own projects.
As for my personal project, here are some graphics (created with Datawrapper) that I was able to include in my data story.
Feel free to use this code however you want. Perhaps in the future I can turn this into a library of useful Pandas scripts for data journalists.
- Please do not remove or modify the
shapefiles
folder. - This notebook contains functions specific to my project (I was researching the rides between Brooklyn and Manhattan, specifically the congestion relief zone). Feel free to remove them before running the script on your machine.