I am a MS Information Systems - Data Analytics graduate currently working at the Rudolf Steiner Fellowship Foundation.
Hi, my name is Jan Mauritz Kortenkamp. I graduated with a MS in Information Systems - Data Analytics from Baruch College. My undergraduate degree is a Bachelor's of Business Administration, specializing in Operations Management. Through my academic journey I have obtained a robust
skillset in business process analysis and -optimization, enabling me to streamline operations and enhance efficiency. Complementing my analytical expertise, I am proficient in several programming languages, which empowers me to leverage data-driven
insights and develop innovative solutions. I am passionate about integrating technology and business strategy to drive impactful results.If you would like to know more, please see my full resume below!
The below paper is the final term project for the fall 2024 CIS 9665 - Applied Natural Language Processing course at Baruch College.
Using a dataset of categorized Amazon reviews hosted by the University of California San Diego Rady School of Management, our team created a binary classification model to predict customer review verification status. Given the increasing importance of online marketplaces, we sought to create a solution with a wide range of applications.Applying several natural language processing techniques such as sentiment analysis as well as classifying review phrasing and -contents, we built a Random Forest classification model with 86% accuracy. Since the model was created on a subset of the data, the preprocessing steps and model creation can be easily reproduced and applied in academic and commercial settings.
Team member names have been redacted for privacy considerations.
As an avid CitiBike rider, this project seeks to examine CitiBike ridership statistics and key metrics around cycling in New York City. The main source of information is the public CitiBike Archive for ridership statistics. Additional sources are the NYC public health data portal for Air Quality- and Commuting Method Data. The resulting Tableau Story can be found above while the Python code used to datamine the CitiBike archive is displayed below.
Inspired by this vsiualization created by Ben Oldenburg.
All operations were performed in a Google Colab environment
To import and manage the data, the following libraries are used:
#data import and folder management libraries
import requests
import os as os
import shutil as shutil
import zipfile
import re
from google.colab import files
#data libraries
import pandas as pd
#url error handling
import requests
The following functions are created to import and unzip the archive data:
#import files stored in Citibike archive
def ImportFiles(year):
try:
url = f"https://s3.amazonaws.com/tripdata/20{year}-citibike-tripdata.zip"
!wget $url
print(f'Imported: {url}\n')
except:
print(f'Unable to import: {url}\n')
#unzip files into local directory
def UnzipFiles(year):
try:
if year < 20:
location = f"/content/20{year}-citibike-tripdata.zip"
!unzip $location
else:
location = f"/content/20{year}-citibike-tripdata.zip"
!unzip $location
for month in range(1, 13):
#the citiBike archive omits the leading 0 for some months without a clear pattern so different
#filenames need to be accounted for
if month < 10:
sublocation = f"/content/20{year}-citibike-tripdata/20{year}0{month}-citibike-tripdata.zip"
!unzip $sublocation
os.remove(sublocation)
else:
sublocation = f"/content/20{year}-citibike-tripdata/20{year}{month}-citibike-tripdata.zip"
!unzip $sublocation
os.remove(sublocation)
except:
print(f'Unable to unzip: {location}\n')
#remove zip files from contents
def RemoveZips(year):
try:
location = f"/content/20{year}-citibike-tripdata.zip"
os.remove(location)
print(f'Removed: {location}\n')
except:
print(f'Unable to remove: {location}\n')
#remove incorrectly set up directories
#only needed if specific directories were created incorrectly
#not required for normal program flow
def RemoveDirectory(year):
try:
shutil.rmtree(f'/content/20{year}-citibike-tripdata')
print("Directory deleted successfully.")
except FileNotFoundError:
print("Directory not found.")
The import functions are then called for specific year ranges:E.g. range(13,25) will import the entire archive while range(23,24) will only import 2023 data. Depending on the intended analysis or computational limitations, a importing a subset of data might be preferable.
for i in range(13,24):
ImportFiles(i)
for i in range(13,24):
UnzipFiles(i)
for i in range(13,24):
RemoveZips(i)
To extract the total number of trips for any given year, the following function is created:
#this function takes a two digit year in the function call and extracts the total trips by month for the given year
#due to changes and errors in file names and paths, several edge cases are included in the function
def MonthlyTrips(year):
#for simplified loops and clearer structure the following lists are created, "months" takes into account the CitiBike Archive naming convention
months = ['1_January', '2_February', '3_March', '4_April', '5_May', '6_June', '7_July', '8_August', '9_September', '10_October', '11_November', '12_December']
monthNumber = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
monthName = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
#empty list to store extracted data
data = []
#loop through each month, month number, and month name
for month, month_num, month_name in zip(months, monthNumber, monthName):
monthlyTotal = 0 #initialize monthly total to accumulate trip counts
#load files for each month (with index j from 1 to 9)
for j in range(1, 10):
try:
#first possible file path format
file = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata_{j}.csv'
#check if the file exists in the directory --> different naming conventions can exist within the same year
if os.path.exists(file):
df = pd.read_csv(file)
monthlyTotal += len(df) #add the number of rows to the monthly total
elif os.path.exists(f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'):
firstAlternative = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'
df = pd.read_csv(firstAlternative, low_memory=False) #low_memory = False to address differnt types in same df column
monthlyTotal += len(df)
elif os.path.exists(f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'):
secondAlternative = f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'
df = pd.read_csv(secondAlternative, low_memory=False)
monthlyTotal += len(df)
except FileNotFoundError: #if neither path format is found, print warning and continue to the next iteration
print(f'Unable to load: {file}')
continue
#append year, month name, and total trips for each month, taking year from function call
data.append([year, month_name, monthlyTotal])
#create a DataFrame with Year, Month, and Trips columns
totals = pd.DataFrame(data, columns=['Year', 'Month', 'Trips'])
#set the Year as the index
totals.set_index('Year', inplace=True)
#return given year as df
return totals
Next the function is called and the outputs are verified:
#create empty dataframe to be appended
tripsByYear = pd.DataFrame()
#loop through given range of years
for i in range(13, 25):
totalTrips = MonthlyTrips(i) #run function with year as the argument
tripsByYear = pd.concat([tripsByYear, totalTrips])
#optional: add 2000 to years in index column since 2 digit year was used in funct. call
tripsByYear.index = tripsByYear.index + 2000
#verify results
for i in range(2012, 2025):
print(tripsByYear[tripsByYear.index == i])
The following lines convert the extracted data into a .xlsx file and prompt the download:
tripsByYear.to_excel('NYC Total Monthly Citbike Trips 2013-23.xlsx', index=True)
if input('Type y to confirm download: ') == 'y':
files.download('NYC Total Monthly Citbike Trips 2013-23.xlsx')
print("File downloaded successfully!")
else:
print("Export cancelled.")
To map the expansion of stations by year, the following function is created and called for a given year. It loops through all existing .csv files in the specified year range and extracts all stations with their longitude and latitude.
def UniqueStations(year):
#month and month number for September
month = '9_September'
month_num = '09'
#initialize a DataFrame to store data
uniqueStatons = pd.DataFrame()
#loop through all possible files for the month of September (looping through 1 to 9 .csv files)
for j in range(1, 10):
try:
#generate first possible file path format
file = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata_{j}.csv'
#check if the file exists and read it
if os.path.exists(file):
df = pd.read_csv(file)
elif os.path.exists(f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'):
file = f'/content/20{year}-citibike-tripdata/{month}/20{year}{month_num}-citibike-tripdata.csv_{j}.csv'
df = pd.read_csv(file, low_memory=False)
elif os.path.exists(f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'):
file = f'/content/20{year}{month_num}-citibike-tripdata_{j}.csv'
df = pd.read_csv(file, low_memory=False)
else:
continue #if file does not exist, move to the next iteration
#the naming convention for columns was changed after 2020, kept columns need to be adjusted accordingly
#keep only the necessary columns and use .copy() to avoid SettingWithCopyWarning
if year < 20:
dfUnique = df[['start station name', 'start station latitude', 'start station longitude']].copy()
else:
dfUnique = df[['start_station_name', 'start_lat', 'start_lng']].copy()
#assign the 'Year' column using .loc
dfUnique.loc[:, 'Year'] = year + 2000 #add 2000 since only the last two numbers are used as function argument
dfUnique.set_index('Year', inplace=True)
#rename the columns
dfUnique.columns = ['Station Name', 'Latitude', 'Longitude']
# Append the unique data to uniqueStatons
uniqueStatons = pd.concat([uniqueStatons, dfUnique])
except FileNotFoundError:
print(f'Unable to load: {file}')
continue
#drop duplicates based on 'Station Name', keeping the first instance of each station for the year
uniqueStatons = uniqueStatons.drop_duplicates(subset=['Station Name'], keep='first')
return uniqueStatons
The function is then called and the station data verified:
stationsByYear = pd.DataFrame()
#call function for a custom year range and add to a dataframe
for i in range (13, 24):
yearData = UniqueStations(i)
stationsByYear = pd.concat([stationsByYear, yearData])
#verify extracted data
for i in range(2013, 2024):
year_data = stationsByYear[stationsByYear.index == i]
if not year_data.empty: #check if the DataFrame is empty
print(f"\n\nYear: {i}")
print(year_data.value_counts())
else:
print(f"No data found for year: {i}")
stationsByYear.shape
Prompt .xlsx export of data:
stationsByYear.to_excel('NYC Unique Citbike Stations 2013-23.xlsx', index=True)
if input('Type y to confirm download: ') == 'y':
files.download('NYC Unique Citbike Stations 2013-23.xlsx')
print("File downloaded successfully!")
else:
print("Export cancelled.")
Since the functions for the extraction of data for entire years can be computationally demanding and time consuming, investigating single months or years for specific features might be more convenient. The following code showcases extracting data for the top ten starting stations in the month of August. Out of a total of close to four million trips a sheet of more than 100,000 trips is exported allowing for investigation of specific trends or features. The code can be easily adapted to investigate other time periods or features of the dataset.
#add data from all four files for August to single dataframe
for i in range(1,5):
df2 = pd.read_csv(f'/content/202308-citibike-tripdata_{i}.csv', low_memory=False)
august23 = pd.concat([august23, df2])
#verify outputs
august23.sample(5)
august23.shape
august23['start_station_ id'].value_counts()
Filter data by top ten starting stations:
top10Stations = august23['start_station_id'].value_counts().head(10).index
top10Stations = august23[august23['start_station_id'].isin(top10Stations)]
top10Stations
Export data to .xlsx:
top10Stations.to_excel('August 2023 Top 10 CitiBike Stations.xlsx', index=False)
This program can be used with a high accuracy scale to count pills, pellets, capsules etc. By averaging the weight per item for n number of three items against the total weight of items, the program calculates the count of items to an accuracy of +-3 per 100. This is especially useful for larger quantities of small items.
Front-end and downloadable .exe still pending.
#include
#include
#include
#include
using namespace std;
bool ContProg();
/*Precondition: User is prompted for choice to continue program.
Postcondition: true/false return depending on user input.*/
int WeightToCount();
/*Precondition: Weight of all entered pills is put in.
Postcondition: Returns count of pills.*/
string MedInput();
/*Precondition: Empty string
Postcondition: Returns count of pills.*/
bool ContProg() {
string yesNoResponse;
//Prompt user input.
cin >> yesNoResponse;
//Run while loop until correct input is given.
while ((yesNoResponse != "yes") && (yesNoResponse != "no")) {
cout << "Incorrect response, please enter \'yes\' or \'no\': ";
cin >> yesNoResponse;
}
/* Remove potential remaining newline characters potentially causing
issues with the input if the main function runs again.*/
cin.ignore();
//Return condition based on input.
if (yesNoResponse == "yes") {
return true;
}
else {
return false;
}
}
int WeightToCount() {
int counter = 0;
double input, threePillWeight = 0, pillCount, avgWeight,
totalWeight;
do {
cout << "Enter weight of three pills: ";
while (!(cin >> input)) {
cout << "Invalid input. Please enter a decimal or integer value: ";
cin.clear();
cin.ignore(numeric_limits::max(), '\n');
}
threePillWeight += input;
counter++;
if (counter < 2) {
cout << "You have entered " << counter
<< " weight. Type \'yes\' to enter another weight?\n";
}
else {
cout << "You have entered " << counter
<< " weights. Type \'yes\' to enter another weight?\n";
}
} while (ContProg());
cout << "Enter total weight of pills in bottle: ";
cin >> totalWeight;
avgWeight = threePillWeight / counter;
pillCount = (totalWeight / avgWeight) * 3;
return pillCount;
}
string MedInput() {
string input;
cout << "Enter medication name: ";
getline(cin, input);
return input;
}
int main() {
//Initialize empty vectors
vector medName;
vector medCount;
int count = 0;
char end = 'n';
do
{
medName.push_back(MedInput());
medCount.push_back(WeightToCount());
count++;
if (count < 2) {
cout << "You have entered " << count
<< " medication. Type \'yes\' to enter another medication?\n";
}
else {
cout << "You have entered " << count
<< " medications. Type \'yes\' to enter another medication?\n";
} //Correct input is evaluated by function call in while() call
} while (ContProg());
for (int i = 0; i < medCount.size(); i++) {
cout << medName[i] << " has " << medCount[i] <<
" pills left.\n";
}
do {
cout << "Type 'yes' to terminate the program: \n";
cin >> end;
} while (end != 'y');
cout << "Thank you for using this program!\n\n";
return 0;
}
The below presentation is the final term project for the spring 2023 interdisciplinary BPL 5100 - Business Policy course at Baruch College.
With the goal of solving strategic issues for a specific US company, our team examined Pfizer Inc.'s 2023 corporate finaces and product pipeline and recommeded the acquisition of a smaller biotech firm. Given the extensive cash reserves resulting from COVID-19 vaccine revenues as well as the pending expiration of several high-revenue drugs, we recommeded the investment in either Seagen Inc. or BeiGene Inc. for stratgic positioning. Both firms had extensive portfolios with several oncology pharmaceuticals in the later development stages. As such they offered to complement Pfizer's strategic position perfectly. In December of 2023, Pfizer completed the acquisition of Seagen Inc.
The below presentation is the final term project for the spring 2023 OPM 4500 - Cases in Operations Management capstone course at Baruch College.
The "Bottlenecks and batching in dragon fruit jam production" case published by the Singapore Management University, outlines the challenges faces by a small artisanal jam producer that has developed the need to produce at scale, while maintaining product differentiation factors and high quality. By reducing variability, batching and creating buffers, our team successfully identified removed the bottleneck and improved service quality overall.