- 📌 Project Overview
- 📁 Step 1. Dataset & Table Structure
- 🧹 Step 2. Data Cleaning in SQLlite
- ⏱ Step 3. Delay & Lag Analysis
- 💸 Step 4. Profitability & Margin Diagnostics
- 🏭 Step 5. Process Bottlenecks & Delay Causes
- 📈 Step 6. Advanced Operational Metrics
- ✅ Step 7. Conclusion & Process Recommendations
This project presents a comprehensive business process optimization and operational performance analysis using Python and SQLite. This project dives deep into:
- End-to-end delay analysis across the fulfillment pipeline (from procurement to delivery)
- Diagnostics of fulfillment lags at each process stage
- Identification of bottlenecks by product, confirming delays are systemic rather than isolated to specific warehouses or branches
- Order-level profitability analysis to test whether fulfillment delays impact gross margins
- Downstream performance indicators for business process improvement
All analysis is performed using Python and SQLite . Visuals are built using matplotlib
, seaborn
, and plotly
. The dataset represents 3 years (2018–2020) of transactional data.
We simulate an SQL table named sales_data
with the following structure:
CREATE TABLE sales_data (
InvoiceNumber TEXT,
SalesChannel TEXT,
WarehouseLocationCode TEXT,
ProcurementDate TEXT,
SalesOrderDate TEXT,
DispatchDate TEXT,
DeliveryDate TEXT,
Currency TEXT,
SalesRepID INTEGER,
CustomerID INTEGER,
BranchCode TEXT,
ProductCode INTEGER,
QuantityOrdered INTEGER,
DiscountRate REAL,
UnitSalePrice REAL,
UnitProductionCost REAL
);
Each row captures the full cycle of an order — from procurement to dispatch to delivery — including costs, quantities, and prices.
All cleaning is executed directly within SQLite to ensure optimal data integrity before loading into Python for analysis.
DELETE FROM sales_data
WHERE
InvoiceNumber IS NULL OR
SalesOrderDate IS NULL OR
CustomerID IS NULL OR
ProductCode IS NULL;
DELETE FROM sales_data
WHERE
UnitSalePrice IS NULL OR UnitSalePrice <= 0 OR
QuantityOrdered IS NULL OR QuantityOrdered <= 0;
DELETE FROM sales_data
WHERE Currency != 'IRR';
DELETE FROM sales_data
WHERE
SalesOrderDate < '2018-01-01' OR
SalesOrderDate > '2020-12-31';
UPDATE sales_data
SET
InvoiceNumber = TRIM(InvoiceNumber),
SalesChannel = TRIM(SalesChannel),
WarehouseLocationCode = TRIM(WarehouseLocationCode),
BranchCode = TRIM(BranchCode);
DELETE FROM sales_data
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM sales_data
GROUP BY InvoiceNumber, CustomerID, ProductCode
);
CREATE TABLE sales_data_cleaned AS
SELECT * FROM sales_data;
Now that the cleaned dataset has been loaded into Python, we begin by calculating the key time intervals across the order lifecycle:
- Time between procurement and sales order
- Time between sales order and dispatch
- Time between dispatch and delivery
- Full fulfillment time (procurement → delivery)
These metrics help identify inefficiencies, bottlenecks, and inconsistencies across products, branches, and warehouses.
import pandas as pd
# Load cleaned data into a DataFrame
df = pd.read_csv("sales_data_cleaned.csv")
# Convert dates
date_cols = ['ProcurementDate', 'SalesOrderDate', 'DispatchDate', 'DeliveryDate']
for col in date_cols:
df[col] = pd.to_datetime(df[col], errors='coerce')
df['Procurement_to_Order'] = (df['SalesOrderDate'] - df['ProcurementDate']).dt.days
df['Order_to_Dispatch'] = (df['DispatchDate'] - df['SalesOrderDate']).dt.days
df['Dispatch_to_Delivery'] = (df['DeliveryDate'] - df['DispatchDate']).dt.days
df['Total_Fulfillment_Time'] = (df['DeliveryDate'] - df['ProcurementDate']).dt.days
df[['Procurement_to_Order', 'Order_to_Dispatch', 'Dispatch_to_Delivery', 'Total_Fulfillment_Time']].describe()
Procurement_to_Order | Order_to_Dispatch | Dispatch_to_Delivery | Total_Fulfillment_Time | |
---|---|---|---|---|
count | 7991.000000 | 7991.000000 | 7991.000000 | 7991.000000 |
mean | 109.243149 | 15.168940 | 5.503942 | 129.916031 |
std | 34.371490 | 7.793951 | 2.855809 | 35.297567 |
min | 31.000000 | 2.000000 | 1.000000 | 40.000000 |
25% | 83.000000 | 8.000000 | 3.000000 | 103.000000 |
50% | 108.000000 | 15.000000 | 5.000000 | 129.000000 |
75% | 135.000000 | 22.000000 | 8.000000 | 156.000000 |
max | 189.000000 | 28.000000 | 10.000000 | 219.000000 |
Our lag analysis revealed that the biggest bottleneck in the business process is the time between procurement and sales order — a whopping 109 days on average! This long lead time dwarfs the delays in order fulfillment and delivery, which are both relatively quick (just 15 days and 5 days, respectively).
Understanding the overall spread and range of each lag metric is essential to see if delays are normal, clustered, or if there are frequent outliers.
Here, we use histograms to quickly spot if most orders cluster around a typical value or if there’s a long “tail” of unusually slow steps.
import matplotlib.pyplot as plt
import seaborn as sns
lag_columns = ['Procurement_to_Order', 'Order_to_Dispatch', 'Dispatch_to_Delivery', 'Total_Fulfillment_Time']
plt.figure(figsize=(16, 8))
for i, col in enumerate(lag_columns, 1):
plt.subplot(2, 2, i)
sns.histplot(df[col], kde=True, bins=30, color='skyblue')
plt.title(f' Distribution of {col.replace("_", " ")}')
plt.xlabel('Days')
plt.tight_layout()
plt.show()
Most of the process is quite predictable—there are no strong outliers or extreme delays visible in the histograms. However, the main opportunity for improvement is in the procurement-to-order lag, which dominates the total fulfillment time.
We’ll next use boxplots to further confirm if any individual extreme outliers are present.
Boxplots provide a quick way to spot outliers and compare the overall spread of each process step. This helps us see if a few rare cases are driving up the average, or if the delays are common across most orders.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(14, 5))
sns.boxplot(data=df[lag_columns], orient='h', palette='pastel')
plt.title('Boxplot of Fulfillment Lag Metrics')
plt.xlabel('Days')
plt.show()
- Procurement to Order and Total Fulfillment Time both show a wide spread but no extreme outliers. This means the long delays are common across the majority of orders, not caused by a few rare cases.
- Order to Dispatch and Dispatch to Delivery are tightly clustered, confirming that once an order is placed, fulfillment and delivery are fast and consistent.
- No individual lag step shows significant outliers—delays are a general, process-wide feature rather than isolated incidents.
Having identified long fulfillment delays (Step 3), our next question is:
Are these delays actually hurting our margins, or is profitability stable across the business?
We calculate gross profit and margin for each order:
df['Revenue'] = df['UnitSalePrice'] * df['QuantityOrdered']
df['ProductionCost'] = df['UnitProductionCost'] * df['QuantityOrdered']
df['GrossProfit'] = df['Revenue'] - df['ProductionCost']
df['GrossMarginPercent'] = (df['GrossProfit'] / df['Revenue']).replace([float('inf'), -float('inf')], 0) * 100
df[['Revenue', 'ProductionCost', 'GrossProfit', 'GrossMarginPercent']].describe()
Revenue | ProductionCost | GrossProfit | GrossMarginPercent | |
---|---|---|---|---|
count | 7991.000000 | 7991.000000 | 7991.000000 | 7991.000000 |
mean | 10348.232587 | 6484.551694 | 3863.680892 | 37.369416 |
std | 10021.420331 | 6556.445655 | 4196.856592 | 13.225177 |
min | 167.500000 | 73.700000 | 25.125000 | 15.000000 |
25% | 3195.900000 | 1833.924000 | 1013.509000 | 26.000000 |
50% | 6994.800000 | 4205.992000 | 2398.600000 | 37.000000 |
75% | 14780.200000 | 8975.856000 | 5094.144000 | 49.000000 |
max | 52313.600000 | 42308.088000 | 30905.760000 | 60.000000 |
Margins are strong across almost all orders, with no sign of process-wide losses—even though orders wait a long time to move through the pipeline (Step 3). This means that reducing delays is a direct path to improved cash flow and volume, not just margin repair.
Having established that overall margins are healthy, the next question is whether longer fulfillment times are associated with lower profitability at the order level.
If longer delays lead to reduced gross margin, process improvement is not just about speed, but also about protecting profit.
We group orders into fulfillment speed quartiles and compare their average gross margins to test this relationship.
import pandas as pd
# Create fulfillment speed quartiles
df['FulfillmentSpeed'] = pd.qcut(
df['Total_Fulfillment_Time'],
4,
labels=['Fastest', 'Fast', 'Slow', 'Slowest']
)
# For bar plot: Fastest vs. Slowest
grouped = df.groupby('FulfillmentSpeed',observed=False)['GrossMarginPercent'].mean().reindex(['Fastest', 'Slowest'])
grouped.plot(kind='bar', color=['seagreen', 'indianred'])
plt.title('Average Margin: Fastest vs. Slowest Fulfillment')
plt.ylabel('Average Gross Margin (%)')
plt.xlabel('Fulfillment Speed')
plt.tight_layout()
plt.show()
The average gross margin is nearly identical between the fastest- and slowest-fulfilled orders. This confirms that even the slowest processes are not eroding profitability. Therefore, any improvements to reduce fulfillment delays will improve business performance without risking margin—a clear win for operational change.
Now that we know where delays happen and that they don’t destroy profitability, our next goal is to identify the root causes behind those delays.
Purpose of this step:
- To see if bottlenecks are tied to specific products, or other order attributes.
- To highlight which part(s) of the business process need attention for maximum efficiency gains.
We’ll start by reviewing how lag times vary across different produc
# Example: Compare by product
product_lags = df.groupby('ProductCode')['Total_Fulfillment_Time'].mean().sort_values(ascending=False)
product_lags.head(10) # Show top 10 slowest products
ProductCode | Average_Fulfillment_Time (days) |
---|---|
1032 | 139.37 |
1065 | 137.84 |
1070 | 137.59 |
1000 | 136.55 |
1021 | 136.53 |
1033 | 136.27 |
1001 | 135.95 |
1046 | 135.42 |
1083 | 135.22 |
1045 | 135.18 |
Name: Total_Fulfillment_Time, dtype: float64
The analysis shows that certain products consistently have the longest fulfillment times—with top product codes averaging 135–140 days from procurement to delivery.
This pattern indicates that delays are more strongly tied to specific product lines than to location or order attributes.
With delays, bottlenecks, and profitability fully analyzed, this step explores additional metrics that provide deeper insight into the business process and highlight further opportunities for operational improvement.
Purpose of this step:
- Quantify efficiency and reliability of order fulfillment.
- Reveal “hidden” trends or risks not captured by margin or lag analysis alone.
- Deliver actionable metrics that can guide next steps in business process optimization.
We will cover:
- Repeat order rate: How many customers return for multiple purchases (a proxy for customer satisfaction and process reliability)?
- Order volume over time: Are process changes (or external shocks) impacting throughput?
- Discount analysis: Are high discount rates related to specific lag patterns or product lines?
# Percentage of customers placing more than one order
repeat_customers = (
df.groupby('CustomerID')['InvoiceNumber']
.nunique()
.reset_index(name='OrderCount')
)
repeat_customer_pct = (repeat_customers['OrderCount'] > 1).mean() * 100
print(f"🔁 Repeat order rate: {repeat_customer_pct:.1f}%")
🔁 Repeat order rate: 99.9%
A high repeat order rate suggests that—despite long fulfillment lags—customers remain loyal. If the rate is low, delays may be undermining long-term relationships.
# Monthly order volume
df['OrderMonth'] = pd.to_datetime(df['SalesOrderDate']).dt.to_period('M')
orders_per_month = df.groupby('OrderMonth')['InvoiceNumber'].nunique()
orders_per_month.plot(kind='bar', figsize=(10, 4), color='steelblue')
plt.title('Monthly Order Volume')
plt.ylabel('Number of Orders')
plt.xlabel('Month')
plt.tight_layout()
plt.show()
Order volumes remained relatively stable throughout the 2018–2020 period, with only minor fluctuations month-to-month. There are no obvious periods of sustained decline or sharp spikes, suggesting that—despite fulfillment delays—the business maintained consistent throughput. This stability indicates strong underlying demand and process resilience, though the absence of significant volume growth may point to a hidden opportunity for further improvement if fulfillment speed is increased.
To determine if longer fulfillment times are being compensated by higher discounts—potentially eroding profitability.
Analysis:
We calculated the average discount rate for each fulfillment speed group (Fastest, Fast, Slow, Slowest).
print(df.groupby('FulfillmentSpeed',observed=False)['DiscountRate'].mean())
FulfillmentSpeed | Average_Discount_Rate |
---|---|
Fastest | 0.1158 |
Fast | 0.1118 |
Slow | 0.1155 |
Slowest | 0.1145 |
Name: DiscountRate, dtype: float64
The average discount rate is identical (11%) across all fulfillment speed groups. This confirms that pricing is stable and unaffected by order processing delays.
Summary of Key Findings:
- Order fulfillment is consistently slow across all products and process stages, with average total lead times over four months.
- No single branch or warehouse is responsible for delays—bottlenecks are systemic, not location-specific.
- Profitability remains strong and stable regardless of fulfillment speed. Delays do not erode gross margin or force higher discounts.
- Customer loyalty is resilient: Repeat order rates are high, and discounting practices are consistent, suggesting that core business value is preserved despite lag.
- Order volume is steady over time, indicating reliable demand and process stability, but not breakthrough growth.
Recommendations for Business Process Improvement:
- Prioritize process speed-ups:
Reducing fulfillment lag will directly improve cash flow, asset turnover, and customer experience—without risking profitability. - Target systemic improvements:
Since bottlenecks are not tied to a specific branch or warehouse, focus on end-to-end process reengineering (e.g., procurement automation, inventory management, supplier integration). - Leverage customer goodwill:
Loyal customers have tolerated delays so far; faster fulfillment could unlock new growth and retention without requiring price incentives. - Monitor after changes:
Track profitability, order volume, and customer repeat rates as improvements are made to ensure performance gains are sustained and risks are controlled.