This project is a lab exercise completed during the Machine Learning Engineer Learning Path course. It demonstrates how to use BigQuery ML to construct and assess a linear regression model for predicting taxi fares in New York City.
Utilizing the nyc-tlc.yellow.trips_2015 dataset
, the project walks through a workflow that includes exploring the data, preparing it for training by selecting relevant features and performing feature engineering, building and evaluating an initial model, and then refining the model through techniques like data filtering and retraining to achieve improved prediction accuracy.
Run the query trips_per_month.sql
to calculate the number of trips Yellow taxis took each month in 2015 from the NYC Yellow taxi trip dataset. The query groups the data by month and presents the results in chronological order.
Run the query speed_per_hour.sql
to calculate the average speed of Yellow taxi trips for each hour of the day in 2015, considering only trips with valid distance, fare-to-distance ratios, and pickup/dropoff times.
This step (query: training_data.sql
) prepares the training data for a taxi fare prediction model. It selects relevant features, performs feature engineering (calculating total_fare
, extracting day of the week and hour of the day), filters out invalid data, and splits the data into training and evaluation sets using hashing.
CREATE or REPLACE MODEL taxi.taxifare_model
OPTIONS
(model_type='linear_reg', labels=['total_fare']) AS -- specify the model type: linear regression
WITH params AS (
SELECT
1 AS TRAIN,
2 AS EVAL
),
daynames AS
(SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),
taxitrips AS (
SELECT
(tolls_amount + fare_amount) AS total_fare,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
passenger_count AS passengers
FROM
`nyc-tlc.yellow.trips`, daynames, params
WHERE
trip_distance > 0 AND fare_amount > 0
AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN
)
SELECT *
FROM taxitrips
(query in: taxifare_model.sql
)
SELECT
SQRT(mean_squared_error) AS rmse
FROM
ML.EVALUATE(MODEL taxi.taxifare_model,
(
WITH params AS (
SELECT
1 AS TRAIN,
2 AS EVAL
),
daynames AS
(SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),
taxitrips AS (
SELECT
(tolls_amount + fare_amount) AS total_fare,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
passenger_count AS passengers
FROM
`nyc-tlc.yellow.trips`, daynames, params
WHERE
trip_distance > 0 AND fare_amount > 0
AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL
)
SELECT *
FROM taxitrips
))
(query in: evaluate_taxifare_model.sql
)
In this step (query predict_taxi_fare.sql
), the taxi.taxifare_model is used to predict taxi fare amounts. The query applies the model to a subset of the nyc-tlc.yellow.trips
dataset, generating predictions based on trip features such as pickup/dropoff locations, time of day, and passenger count.
This step focuses on improving the taxi fare prediction model by using feature engineering and data filtering techniques. The goal is to identify and select the most relevant features and data points for training a more accurate model. This is achieved by running a sequence of three queries:
- 7.1. Explore initial fare statistics to understand the overall fare distribution
- 7.2. Filter out-of-range fares to avoid learning on outliers
- 7.3. Limit to NYC geographic coordinates to ensure that the model is trained on trips that are relevant to the target area
(complete sequency of queries here:
feature_engineering.sql
)
Next, this query taxifare_model_2.sql
helps to improve the accuracy of the taxi fare prediction model by incorporating new features, filtering out irrelevant data, and retraining the model using a refined dataset.
SELECT
SQRT(mean_squared_error) AS rmse
FROM
ML.EVALUATE(MODEL taxi.taxifare_model_2,
(
WITH params AS (
SELECT
1 AS TRAIN,
2 AS EVAL
),
daynames AS
(SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),
taxitrips AS (
SELECT
(tolls_amount + fare_amount) AS total_fare,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
SQRT(POW((pickup_longitude - dropoff_longitude),2) + POW(( pickup_latitude - dropoff_latitude), 2)) as dist, #Euclidean distance between pickup and drop off
SQRT(POW((pickup_longitude - dropoff_longitude),2)) as longitude, #Euclidean distance between pickup and drop off in longitude
SQRT(POW((pickup_latitude - dropoff_latitude), 2)) as latitude, #Euclidean distance between pickup and drop off in latitude
passenger_count AS passengers
FROM
`nyc-tlc.yellow.trips`, daynames, params
WHERE trip_distance > 0 AND fare_amount BETWEEN 6 and 200
AND pickup_longitude > -75 #limiting of the distance the taxis travel out
AND pickup_longitude < -73
AND dropoff_longitude > -75
AND dropoff_longitude < -73
AND pickup_latitude > 40
AND pickup_latitude < 42
AND dropoff_latitude > 40
AND dropoff_latitude < 42
AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL
)
SELECT *
FROM taxitrips
))
The RMSE decreased from approximately +-9.47 to +-5.12, indicating a significant gain in accuracy. Since the RMSE quantifies the typical magnitude of prediction errors, this reduction demonstrates that the retrained linear regression model is considerably more precise.
(query in: evaluate_taxifare_model_2.sql
)
- A Google Cloud Project
- Access to BigQuery
This project utilizes the following public dataset:
bigquery-public-data.new_york.tlc_yellow_trips_2015