Skip to content

This project uses BigQuery ML to build and evaluate a linear regression model for predicting New York City taxi fares based on trip data.

Notifications You must be signed in to change notification settings

larisanti/taxi-fare-ml

Repository files navigation

Taxi Fare Forecasting using BigQuery ML

BigQuery ML

This project is a lab exercise completed during the Machine Learning Engineer Learning Path course. It demonstrates how to use BigQuery ML to construct and assess a linear regression model for predicting taxi fares in New York City.

Utilizing the nyc-tlc.yellow.trips_2015 dataset, the project walks through a workflow that includes exploring the data, preparing it for training by selecting relevant features and performing feature engineering, building and evaluating an initial model, and then refining the model through techniques like data filtering and retraining to achieve improved prediction accuracy.

Workflow

1. Explore the dataset: Calculate the number of trips

Run the query trips_per_month.sql to calculate the number of trips Yellow taxis took each month in 2015 from the NYC Yellow taxi trip dataset. The query groups the data by month and presents the results in chronological order.


2. Explore the dataset: Calculate the average speed

Run the query speed_per_hour.sql to calculate the average speed of Yellow taxi trips for each hour of the day in 2015, considering only trips with valid distance, fare-to-distance ratios, and pickup/dropoff times.


3. Prepare the training data

This step (query: training_data.sql) prepares the training data for a taxi fare prediction model. It selects relevant features, performs feature engineering (calculating total_fare, extracting day of the week and hour of the day), filters out invalid data, and splits the data into training and evaluation sets using hashing.


4. Create and train the taxifare_model

CREATE or REPLACE MODEL taxi.taxifare_model
OPTIONS
  (model_type='linear_reg', labels=['total_fare']) AS -- specify the model type: linear regression

WITH params AS (
    SELECT
    1 AS TRAIN,
    2 AS EVAL
    ),

  daynames AS
    (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),

  taxitrips AS (
  SELECT
    (tolls_amount + fare_amount) AS total_fare,
    daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
    EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
    pickup_longitude AS pickuplon,
    pickup_latitude AS pickuplat,
    dropoff_longitude AS dropofflon,
    dropoff_latitude AS dropofflat,
    passenger_count AS passengers
  FROM
    `nyc-tlc.yellow.trips`, daynames, params
  WHERE
    trip_distance > 0 AND fare_amount > 0
    AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN
  )

  SELECT *
  FROM taxitrips

ss 3

(query in: taxifare_model.sql)


5. Evaluate the model using ML.EVALUATE

SELECT
  SQRT(mean_squared_error) AS rmse
FROM
  ML.EVALUATE(MODEL taxi.taxifare_model,
  (

  WITH params AS (
    SELECT
    1 AS TRAIN,
    2 AS EVAL
    ),

  daynames AS
    (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),

  taxitrips AS (
  SELECT
    (tolls_amount + fare_amount) AS total_fare,
    daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
    EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
    pickup_longitude AS pickuplon,
    pickup_latitude AS pickuplat,
    dropoff_longitude AS dropofflon,
    dropoff_latitude AS dropofflat,
    passenger_count AS passengers
  FROM
    `nyc-tlc.yellow.trips`, daynames, params
  WHERE
    trip_distance > 0 AND fare_amount > 0
    AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL
  )

  SELECT *
  FROM taxitrips

  ))

ss 4

(query in: evaluate_taxifare_model.sql)


6. Predict taxi fares

In this step (query predict_taxi_fare.sql), the taxi.taxifare_model is used to predict taxi fare amounts. The query applies the model to a subset of the nyc-tlc.yellow.trips dataset, generating predictions based on trip features such as pickup/dropoff locations, time of day, and passenger count.


7. Improve the model with Feature Engineering

This step focuses on improving the taxi fare prediction model by using feature engineering and data filtering techniques. The goal is to identify and select the most relevant features and data points for training a more accurate model. This is achieved by running a sequence of three queries:

  • 7.1. Explore initial fare statistics to understand the overall fare distribution
  • 7.2. Filter out-of-range fares to avoid learning on outliers
  • 7.3. Limit to NYC geographic coordinates to ensure that the model is trained on trips that are relevant to the target area (complete sequency of queries here: feature_engineering.sql)

8. Retrain the model and create taxi.taxifare_model_2

Next, this query taxifare_model_2.sql helps to improve the accuracy of the taxi fare prediction model by incorporating new features, filtering out irrelevant data, and retraining the model using a refined dataset.


9. Evaluate taxifare_model_2

SELECT
  SQRT(mean_squared_error) AS rmse
FROM
  ML.EVALUATE(MODEL taxi.taxifare_model_2,
  (

  WITH params AS (
    SELECT
    1 AS TRAIN,
    2 AS EVAL
    ),

  daynames AS
    (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),

  taxitrips AS (
  SELECT
    (tolls_amount + fare_amount) AS total_fare,
    daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
    EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
    SQRT(POW((pickup_longitude - dropoff_longitude),2) + POW(( pickup_latitude - dropoff_latitude), 2)) as dist, #Euclidean distance between pickup and drop off
    SQRT(POW((pickup_longitude - dropoff_longitude),2)) as longitude, #Euclidean distance between pickup and drop off in longitude
    SQRT(POW((pickup_latitude - dropoff_latitude), 2)) as latitude, #Euclidean distance between pickup and drop off in latitude
    passenger_count AS passengers
  FROM
    `nyc-tlc.yellow.trips`, daynames, params
WHERE trip_distance > 0 AND fare_amount BETWEEN 6 and 200
    AND pickup_longitude > -75 #limiting of the distance the taxis travel out
    AND pickup_longitude < -73
    AND dropoff_longitude > -75
    AND dropoff_longitude < -73
    AND pickup_latitude > 40
    AND pickup_latitude < 42
    AND dropoff_latitude > 40
    AND dropoff_latitude < 42
    AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL
  )

  SELECT *
  FROM taxitrips

  ))

ss 7

The RMSE decreased from approximately +-9.47 to +-5.12, indicating a significant gain in accuracy. Since the RMSE quantifies the typical magnitude of prediction errors, this reduction demonstrates that the retrained linear regression model is considerably more precise.

(query in: evaluate_taxifare_model_2.sql)


Prerequisites

  • A Google Cloud Project
  • Access to BigQuery

Dataset

This project utilizes the following public dataset:

  • bigquery-public-data.new_york.tlc_yellow_trips_2015

About

This project uses BigQuery ML to build and evaluate a linear regression model for predicting New York City taxi fares based on trip data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published