This project implements, visualizes, and compares three common variants of the Gradient Descent (GD) optimization algorithm:
- Batch Gradient Descent (BGD)
- Stochastic Gradient Descent (SGD)
- Mini-Batch Gradient Descent (MBGD)
The comparison is performed on two distinct optimization problems to highlight the strengths and weaknesses of each method under different conditions:
-
🎯 Problem 1: Simple Linear Regression
- Goal: Find the best-fit line (
y = wx + b
) for a synthetically generated dataset with noise. - Nature: A convex optimization problem where the objective is to minimize the Mean Squared Error (MSE) loss function. This serves as a baseline to understand the fundamental behavior of the optimizers on a well-behaved surface.
- Parameters: The slope
w
and the y-interceptb
of the line. - Expected Outcome: All methods should converge close to the true parameters used to generate the data. We'll analyze convergence speed (per epoch and per update), path stability, and final accuracy.
- Goal: Find the best-fit line (
-
⛰️ Problem 2: Rosenbrock Function Optimization
- Goal: Find the minimum value of the challenging non-convex Rosenbrock function, often referred to as the "banana function" due to its curved valley shape.
- Nature: A non-convex optimization problem famous for its narrow, parabolic valley leading to the minimum. The minimum is located at
(w, b) = (1, 1)
. Optimizing this function tests the algorithms' ability to navigate difficult terrains with potentially poor gradient directions. - Parameters: The two input variables
w
andb
of the Rosenbrock function. - Expected Outcome: Convergence towards
(1, 1)
can be slow, especially along the valley floor. We expect to observe characteristic behaviors like zig-zagging, sensitivity to learning rates, and differences in how well each method follows the curved valley.
The primary objective is to gain a visual and numerical understanding of the operational dynamics, convergence speed, computational efficiency, and optimization paths taken by each algorithm across these distinct scenarios.
The project unfolded through the following key stages:
-
Synthetic Dataset Creation (Linear Regression): 📊
- Generated
X
andy
data based on the linear relationshipy = 2x + 5
, incorporating random noise to simulate real-world data. - Defined the cost function: Mean Squared Error (MSE).
- Derived and implemented the function to calculate MSE gradients with respect to the model parameters (
w
andb
).
- Generated
-
Implementation of GD Variants (Linear Regression): ⚙️
- BGD: Computed the gradient using the entire dataset in each epoch, performing one parameter update per epoch.
- SGD: Computed the gradient using only one randomly selected data sample at each step, resulting in frequent updates (equal to the number of samples per epoch).
- MBGD: Computed the gradient using a small batch (mini-batch) of data, updating parameters once per mini-batch.
-
Analysis & Visualization (Linear Regression): 🔍📈
- Metric Tracking: Recorded the cost function value (Loss) per epoch, the total number of parameter updates, and the trajectory of the weights (
w
,b
) in the parameter space. - Execution Time Measurement: Timed the execution of each optimization algorithm. ⏱️
- Plotting: Generated comparative plots:
- Loss vs. Epoch (comparing convergence over time).
- Loss vs. Number of Updates (comparing computational efficiency).
- Loss function contour plot overlaid with the weight trajectories (visualizing the search path).
- Final learned regression lines for each method alongside the true line and the original data points.
- Metric Tracking: Recorded the cost function value (Loss) per epoch, the total number of parameter updates, and the trajectory of the weights (
-
Introducing a New Challenge: The Rosenbrock Function: 🍌⛰️
- Objective Function Definition: Switched the target to the famous Rosenbrock function
f(w, b) = (1 - w)^2 + 100 * (b - w^2)^2
, known for its challenging narrow, curved valley with a minimum at(1, 1)
. - Rosenbrock Gradient Calculation: Derived and implemented the partial derivatives of the function with respect to
w
andb
. - Algorithm Adaptation: Adjusted training parameters (significantly smaller learning rate, potentially more epochs) and adapted the SGD/MBGD loops to simulate frequent updates (as there's no data to sample from, the difference lies purely in update frequency).
- Objective Function Definition: Switched the target to the famous Rosenbrock function
-
Execution & Analysis (Rosenbrock): 🔬📉
- Re-ran the three optimization methods on the Rosenbrock function, starting from a defined initial point (e.g.,
(-1.5, -1.0)
). - Tracked and compared metrics: Rosenbrock function value (instead of Loss), update counts, execution time, and parameter trajectories.
- Generated Similar Plots:
- Function value vs. Epoch.
- Function value vs. Number of Updates.
- Contour plot of the Rosenbrock function showing the optimization paths (highly illustrative for this function, revealing the valley-following challenge).
- Re-ran the three optimization methods on the Rosenbrock function, starting from a defined initial point (e.g.,
-
Final Comparison & Conclusion: ✨
- Summarized the performance of each algorithm on both problems.
- Discussed the trade-offs of each method (speed, stability, learning rate sensitivity, convergence quality).
linear_regression_gd.py
(or similar): Contains the code for the linear regression problem (Stages 1-3 and the final line plot).rosenbrock_optimization.py
(or similar): Contains the code dedicated to optimizing the Rosenbrock function (Stages 4-5).
- Python 3.x 🐍
- NumPy:
pip install numpy
- Matplotlib:
pip install matplotlib
-
Open the
Mountaineers_in_Search_of_the_Global_Minimum.ipynb
notebook in Google Colab.- You can open it directly from GitHub (if hosted there) using the Colab integration, or download the
.ipynb
file and upload it viaFile -> Upload notebook...
.
- You can open it directly from GitHub (if hosted there) using the Colab integration, or download the
-
Run the cells sequentially.
- Use the
Runtime -> Run all
option, or execute each cell individually usingShift + Enter
.
- Use the
-
The notebook includes explanatory text, code blocks, numerical outputs, and generated plots inline.
-
The scripts will print numerical results (final parameter values, execution times, update counts) to the console and display the comparative plots.
- Linear Regression: All three methods should converge close to the true parameters (
w=2, b=5
). SGD and MBGD typically show faster initial progress per epoch (due to more updates) but exhibit noisier trajectories. BGD follows a smoother path but might be slower per epoch on large datasets. Overall runtime comparisons can vary based on implementation details and dataset size. The final learned lines should be very close to the true line. - Rosenbrock Function: This function highlights the optimizers' challenges. BGD might progress very slowly along the narrow valley or exhibit significant zig-zagging. SGD and MBGD (due to higher update frequency simulating stochasticity's effect in other contexts) might navigate the curve slightly better but will still converge slowly towards
(1, 1)
and require careful tuning of the learning rate. The contour plot will clearly visualize this behavior.
This project provides a practical demonstration of the differences between various gradient descent flavors and how the characteristics of the objective function landscape influence their performance.
Sayyed Hossein Hosseini DolatAbadi