Opportunity
Taxi drivers could increase their earnings by changing their strategy.
Questions to solve
- How much a taxi driver can increase its monthly earning just by skipping trips under defined conditions?
- How much a taxi driver can increase its monthly earning just by changing its initial zone and time?
Business success criteria
Develop a strategy to increase NYC taxi drivers’ monthly earnings by 20%.
Project scope
This project will be limited to Juno, Uber, Via and Lyft taxi drivers who work in New York City in trips that take place between any zone of Manhattan, Brooklyn or Queens (the more active ones).
To solve those questions we are going to use use the Cross-Industry Standard Process for Data Mining (CRISP-DM).
And based on this steps we can organize the articles created in this portfolio web site.
-
Business Understanding
- Business Understanding Overview
- Defining Base Line
-
Data Understanding
- Defining Development Environment
- Data Collection Process
- Data Sampling
- Initial Exploration
- Expanding Geospatial Information
- Exploring Transportation and Socioeconomic Patterns
-
Data Preparation
- (Pending)
-
Modeling
- (Pending)
-
Evaluation
- (Pending)
-
Deployment
- (Pending)
In this project, we will use a subset of the data available in the TLC Trip Record Data from 2022 to 2023 for High Volume For-Hire Vehicle with the columns described in its data dictionary.
(Pending)
Core Language: R
Key Ecosystems & Frameworks:
- Tidyverse: Heavily utilized, including core packages like:
dplyr
(Data Manipulation)ggplot2
(Data Visualization)tidyr
(Data Tidying)readr
(Data Import)purrr
(Functional Programming)stringr
(String Manipulation)lubridate
(Dates/Times)forcats
(Factor Handling)tibble
(Modern Data Frames)
- Geospatial Analysis: Extensive use of spatial packages:
sf
(Simple Features - Modern standard for spatial data)leaflet
&tmap
(Interactive and static thematic mapping)terra
&raster
(Raster data processing)osmdata
(OpenStreetMap data access)- Supporting spatial packages (
sp
,lwgeom
,s2
,units
,proj4
,wk
, etc.)
- Modeling & Preprocessing:
recipes
(Data preprocessing pipelines for modeling)rpart
(Decision Trees)- Potentially others depending on usage (
MASS
,nnet
,e1071
,ipred
,correlationfunnel
) broom
(Tidying model outputs)infer
(Statistical inference)
- Data Handling & Access:
data.table
(High-performance data manipulation)DBI
&duckdb
(Database connectivity and in-process analytics database)httr
,httr2
,curl
,rvest
(Web data access and scraping)fst
,qs2
(Fast data serialization)vroom
,readxl
(Data import)
- Reporting, Visualization & Apps:
rmarkdown
&knitr
(Report generation)shiny
(Interactive web applications)plotly
(Interactive plots)- Various HTML widget-based visualization packages (
DiagrammeR
,networkD3
,visNetwork
,ggiraph
)
- Workflow & Parallel Processing:
renv
(Project environment management)here
(Project path management)future
,future.apply
,parallelly
(Parallel and asynchronous processing)
This project was completed by making strong assumptions due the reality that the data used to create the analysis don’t provide any unique identifier for taxi drivers, that could help us to deliver more realistic results.
On the other hand, this project aims to increase taxi driver earnings, regardless that if we apply it extensively, it could also end producing the following results:
-
Reduced service quality: Drivers focusing solely on maximizing earnings may avoid less profitable areas or times, potentially leaving some passengers underserved.
-
Increased congestion: Drivers congregating in high-profit areas could worsen traffic in already busy parts of the city.
In conclusion, this project was created to show my abilities as Data Scientist, but it is not a project that should be implemented due this considerations.