Citibike Data Mining and Visualization


As an avid CitiBike rider, this project seeks to examine CitiBike ridership statistics and key metrics around cycling in New York City. The main source of information is the public CitiBike Archive for ridership statistics. Additional sources are the NYC public health data portal for Air Quality- and Commuting Method Data.
The resulting Tableau Story can be found above while the Python code used to datamine the CitiBike archive is displayed below.

Inspired by this visualization created by Ben Oldenburg.

All operations were performed in a Google Colab environment

To import and manage the data, the following libraries are used:


#data import and folder management libraries
import requests
import os as os
import shutil as shutil
import zipfile
import re
from google.colab import files

#data libraries
import pandas as pd

#url error handling
import requests

Data Import into Local Directory

The following functions are created to import and unzip the archive data:


#import files stored in Citibike archive
def ImportFiles(year):
	try:
	url = f"https://s3.amazonaws.com/tripdata/20{year}-citibike-tripdata.zip"
	!wget $url
	print(f'Imported: {url}\n')
	except:
	print(f'Unable to import: {url}\n')

#unzip files into local directory
def UnzipFiles(year):
	try:
	if year < 20:
		location = f"/content/20{year}-citibike-tripdata.zip"
		!unzip $location
	else:
		location = f"/content/20{year}-citibike-tripdata.zip"
		!unzip $location
		for month in range(1, 13):
		#the citiBike archive omits the leading 0 for some months without a clear pattern so different
		#filenames need to be accounted for
		if month < 10:
			sublocation = f"/content/20{year}-citibike-tripdata/20{year}0{month}-citibike-tripdata.zip"
			!unzip $sublocation
			os.remove(sublocation)
		else:
			sublocation = f"/content/20{year}-citibike-tripdata/20{year}{month}-citibike-tripdata.zip"
			!unzip $sublocation
			os.remove(sublocation)
	except:
	print(f'Unable to unzip: {location}\n')

#remove zip files from contents
def RemoveZips(year):
	try:
	location = f"/content/20{year}-citibike-tripdata.zip"
	os.remove(location)
	print(f'Removed: {location}\n')
	except:
	print(f'Unable to remove: {location}\n')

#remove incorrectly set up directories 
#only needed if specific directories were created incorrectly
#not required for normal program flow
def RemoveDirectory(year):
	try:
		shutil.rmtree(f'/content/20{year}-citibike-tripdata')
		print("Directory deleted successfully.")
	except FileNotFoundError:
		print("Directory not found.")


The import functions are then called for specific year ranges:
E.g. range(13,25) will import the entire archive while range(23,24) will only import 2023 data. Depending on the intended analysis or computational limitations, importing a subset of data might be preferable.


for i in range(13,24):
	ImportFiles(i)

for i in range(13,24):
	UnzipFiles(i)

for i in range(13,24):
	RemoveZips(i)

Extraction of Monthly Trips by Year

To extract the total number of trips for any given year, the following function is created:


#this function takes a two digit year in the function call and extracts the total trips by month for the given year
#due to changes and errors in file names and paths, several edge cases are included in the function
def MonthlyTrips(year):
	#for simplified loops and clearer structure the following lists are created, "months" takes into account the CitiBike Archive naming convention  
	months = ['1_January', '2_February', '3_March', '4_April', '5_May', '6_June', '7_July', '8_August', '9_September', '10_October', '11_November', '12_December']
	monthNumber = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
	monthName = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

	#empty list to store extracted data
	data = []

	#loop through each month, month number, and month name
	for month, month_num, month_name in zip(months, monthNumber, monthName):
		monthlyTotal = 0  #initialize monthly total to accumulate trip counts

		#load files for each month (with index j from 1 to 9)
		for j in range(1, 10):
			try:
			#first possible file path format
			file = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata_{j}.csv'

			#check if the file exists in the directory --> different naming conventions can exist within the same year 
			if os.path.exists(file):
				df = pd.read_csv(file)
				monthlyTotal += len(df)  #add the number of rows to the monthly total
			elif os.path.exists(f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'):
				firstAlternative = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'
				df = pd.read_csv(firstAlternative, low_memory=False) #low_memory = False to address different types in same df column
				monthlyTotal += len(df)  
			elif os.path.exists(f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'):
				secondAlternative = f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'
				df = pd.read_csv(secondAlternative, low_memory=False)
				monthlyTotal += len(df)
			except FileNotFoundError:           #if neither path format is found, print warning and continue to the next iteration
			print(f'Unable to load: {file}')
			continue
		#append year, month name, and total trips for each month, taking year from function call
		data.append([year, month_name, monthlyTotal])

	#create a DataFrame with Year, Month, and Trips columns
	totals = pd.DataFrame(data, columns=['Year', 'Month', 'Trips'])

	#set the Year as the index
	totals.set_index('Year', inplace=True)

	#return given year as df
	return totals

Next the function is called and the outputs are verified:


#create empty dataframe to be appended
tripsByYear = pd.DataFrame()

#loop through given range of years 
for i in range(13, 25):
	totalTrips = MonthlyTrips(i) #run function with year as the argument 
	tripsByYear = pd.concat([tripsByYear, totalTrips])

#optional: add 2000 to years in index column since 2 digit year was used in funct. call 
tripsByYear.index = tripsByYear.index + 2000

#verify results
for i in range(2012, 2025):
	print(tripsByYear[tripsByYear.index == i])

The following lines convert the extracted data into a .xlsx file and prompt the download:


tripsByYear.to_excel('NYC Total Monthly Citbike Trips 2013-23.xlsx', index=True)
if input('Type y to confirm download: ') == 'y':
	files.download('NYC Total Monthly Citbike Trips 2013-23.xlsx')
	print("File downloaded successfully!")
else:
	print("Export cancelled.")

Extract Station Locations by Year

To map the expansion of stations by year, the following function is created and called for a given year. It loops through all existing .csv files in the specified year range and extracts all stations with their longitude and latitude.


def UniqueStations(year):
	#month and month number for September
	month = '9_September'
	month_num = '09'

	#initialize a DataFrame to store data
	uniqueStatons = pd.DataFrame()

	#loop through all possible files for the month of September (looping through 1 to 9 .csv files)
	for j in range(1, 10):
		try:
			#generate first possible file path format
			file = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata_{j}.csv'

			#check if the file exists and read it
			if os.path.exists(file):
				df = pd.read_csv(file)
			elif os.path.exists(f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'):
				file = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'
				df = pd.read_csv(file, low_memory=False)
			elif os.path.exists(f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'):
				file = f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'
				df = pd.read_csv(file, low_memory=False)
			else:
				continue  #if file does not exist, move to the next iteration

			#the naming convention for columns was changed after 2020, kept columns need to be adjusted accordingly
			#keep only the necessary columns and use .copy() to avoid SettingWithCopyWarning
			if year < 20:
				dfUnique = df[['start station name', 'start station latitude', 'start station longitude']].copy()
			else:
				dfUnique = df[['start_station_name', 'start_lat', 'start_lng']].copy()

			#assign the 'Year' column using .loc
			dfUnique.loc[:, 'Year'] = year + 2000 #add 2000 since only the last two numbers are used as function argument
			dfUnique.set_index('Year', inplace=True)

			#rename the columns
			dfUnique.columns = ['Station Name', 'Latitude', 'Longitude']

			# Append the unique data to uniqueStatons
			uniqueStatons = pd.concat([uniqueStatons, dfUnique])

		except FileNotFoundError:
			print(f'Unable to load: {file}')
			continue

	#drop duplicates based on 'Station Name', keeping the first instance of each station for the year
	uniqueStatons = uniqueStatons.drop_duplicates(subset=['Station Name'], keep='first')

	return uniqueStatons

The function is then called and the station data verified:


stationsByYear = pd.DataFrame()

#call function for a custom year range and add to a dataframe
for i in range (13, 24):
	yearData = UniqueStations(i)
	stationsByYear = pd.concat([stationsByYear, yearData])

#verify extracted data
for i in range(2013, 2024):
	year_data = stationsByYear[stationsByYear.index == i]
	if not year_data.empty:  #check if the DataFrame is empty
		print(f"\n\nYear: {i}")
		print(year_data.value_counts())
	else:
		print(f"No data found for year: {i}")

stationsByYear.shape

Prompt .xlsx export of data:


stationsByYear.to_excel('NYC Unique Citbike Stations 2013-23.xlsx', index=True)

if input('Type y to confirm download: ') == 'y':
	files.download('NYC Unique Citbike Stations 2013-23.xlsx')
	print("File downloaded successfully!")
else:
	print("Export cancelled.")

Analysis of a Single Month

Since the functions for the extraction of data for entire years can be computationally demanding and time consuming, investigating single months or years for specific features might be more convenient. The following code showcases extracting data for the top ten starting stations in the month of August. Out of a total of close to four million trips a sheet of more than 100,000 trips is exported allowing for investigation of specific trends or features. The code can be easily adapted to investigate other time periods or features of the dataset.


#add data from all four files for August to single dataframe
for i in range(1,5):
	df2 = pd.read_csv(f'/content/202308-citibike-tripdata_{i}.csv', low_memory=False)
	august23 = pd.concat([august23, df2])

#verify outputs
august23.sample(5)

august23.shape

august23['start_station_id'].value_counts()

Filter data by top ten starting stations:


top10Stations = august23['start_station_id'].value_counts().head(10).index
top10Stations = august23[august23['start_station_id'].isin(top10Stations)]
top10Stations

Export data to .xlsx:


top10Stations.to_excel('August 2023 Top 10 CitiBike Stations.xlsx', index=False)