Skip to content

Commit 6492afe

Browse files
committed
final update, enhancements and mp4 + gif
1 parent e63d4f4 commit 6492afe

File tree

5 files changed

+475
-190
lines changed

5 files changed

+475
-190
lines changed

markdown/cve_data_stories/vendor_cve_trends/03_analysis.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ jupyter:
1616

1717

1818

19-
## Calculate Cumulative CVE Counts by Vendor (Starting from 1996)
19+
## Calculate Cumulative CVE Counts by Vendor (Starting from 1999)
2020

21-
This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1996, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.
21+
This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1999, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.
2222

2323
### Steps in the Script
2424

@@ -29,8 +29,8 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
2929
- Generates a range of dates from the earliest to the latest `Year` and `Month` in the dataset.
3030
- Ensures no months are missing for any vendor by creating a complete time series for all vendors.
3131

32-
3. **Filter Data to Start at 1996**:
33-
- After generating the complete date range, filters the data to include only years starting from 1996. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.
32+
3. **Filter Data to Start at 1999**:
33+
- After generating the complete date range, filters the data to include only years starting from 1999. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.
3434

3535
4. **Build a DataFrame for All Vendors and Dates**:
3636
- Combines the list of unique vendors with the filtered date range using a multi-index.
@@ -55,7 +55,7 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
5555
### Key Features
5656

5757
- **Filters Sparse Early Data**:
58-
- Focuses on data from 1996 onwards for improved analysis and visualization.
58+
- Focuses on data from 1999 onwards for improved analysis and visualization.
5959

6060
- **Handles Missing Data**:
6161
- Ensures every month is accounted for, even if no CVEs were reported for a vendor in a given month.
@@ -71,11 +71,11 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
7171
- The final output is a CSV file (`vendor_cumulative_counts.csv`) containing:
7272
| Vendor | Year | Month | Count | Cumulative_Count |
7373
|-----------|------|-------|-------|-------------------|
74-
| freebsd | 1996 | 1 | 5 | 5 |
75-
| freebsd | 1996 | 2 | 0 | 5 |
76-
| freebsd | 1996 | 3 | 8 | 13 |
77-
| redhat | 1996 | 1 | 0 | 0 |
78-
| redhat | 1996 | 2 | 15 | 15 |
74+
| freebsd | 1999 | 1 | 5 | 5 |
75+
| freebsd | 1999 | 2 | 0 | 5 |
76+
| freebsd | 1999 | 3 | 8 | 13 |
77+
| redhat | 1999 | 1 | 0 | 0 |
78+
| redhat | 1999 | 2 | 15 | 15 |
7979

8080

8181
```python
@@ -108,8 +108,8 @@ df_full = pd.DataFrame(index=full_index).reset_index()
108108
df_full["Year"] = df_full["Date"].dt.year
109109
df_full["Month"] = df_full["Date"].dt.month
110110

111-
# Filter to include only years from 1996 onwards
112-
df_full = df_full[df_full["Year"] >= 1996]
111+
# Filter to include only years from 1999 onwards
112+
df_full = df_full[df_full["Year"] >= 1999]
113113

114114
# Merge with the original data, filling missing counts with 0
115115
df = pd.merge(df_full, df, on=["Vendor", "Year", "Month"], how="left").fillna({"Count": 0})

markdown/cve_data_stories/vendor_cve_trends/05_visualizations.md

Lines changed: 179 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -15,84 +15,90 @@ jupyter:
1515
# CVE Data Stories: Vendor CVE Trends - Visualizations
1616

1717

18+
```python
19+
import warnings
1820

19-
## Bar Chart Race: Top 10 CVE Vendors (1996–2024)
21+
import matplotlib.pyplot as plt
22+
import pandas as pd
23+
from bar_chart_race import bar_chart_race
24+
from matplotlib.colors import to_hex
25+
```
2026

21-
This script generates a dynamic bar chart race showcasing the top 10 vendors by cumulative CVE count over time (1996–2024). CVE data offers critical insights into vendor-specific trends in cybersecurity vulnerabilities, highlighting shifts in the security landscape across two decades.
2227

23-
---
2428

25-
### Steps in the Script
2629

27-
1. **Import Necessary Libraries**:
28-
- `pandas`: For efficient data manipulation and preprocessing.
29-
- `bar_chart_race`: To create the bar chart race animation.
30-
- `matplotlib`: For additional visual customizations, including fonts and color palettes.
30+
## Bar Chart Race: Top CVE Vendors (1999–2024)
3131

32-
2. **Load and Preprocess Data**:
33-
- Reads a CSV file (`vendor_top_20.csv`) containing cumulative CVE counts for vendors by year and month.
34-
- Normalizes vendor names for consistency.
35-
- Ensures inclusion of all vendors that appeared in the top 20 during the analyzed period.
32+
This script generates dynamic bar chart race visualizations that showcase the top vendors by cumulative CVE count over time, covering the years 1999–2024. The project provides insights into long-term trends in vendor-specific vulnerabilities, highlighting shifts in the cybersecurity landscape over two decades.
3633

37-
3. **Pivot and Format Data**:
38-
- Prepares the dataset for visualization by transforming it into a pivot table:
39-
- **Rows**: Time (`Year`, `Month`).
40-
- **Columns**: Vendors.
41-
- **Values**: Cumulative CVE counts.
42-
- Combines `Year` and `Month` into a `Date` column (`YYYY-MM`) for a continuous time index.
34+
---
4335

44-
4. **Assign Colors**:
45-
- **Brand Colors**: Maps vendors to their official brand colors for easy recognition.
46-
- **Fallback Colors**: Assigns visually distinct colors to vendors without defined brand colors.
36+
### Purpose
4737

48-
5. **Generate the Bar Chart Race**:
49-
- Animates the top 10 vendors dynamically over time:
50-
- Bars update their positions and lengths based on cumulative CVE counts.
51-
- Parameters enhance readability and visual storytelling.
52-
- Saves the animation as an `.mp4` file for high-quality sharing.
38+
- **Analyze Vulnerability Trends**: Understand which vendors have consistently had the most reported vulnerabilities and how rankings have evolved over time.
39+
- **Engage Through Visualization**: Present data in a visually compelling way that draws attention to key trends in cybersecurity.
40+
- **Inspire Data-Driven Discussions**: Encourage conversations about how this data can inform risk management strategies.
5341

5442
---
5543

56-
### Key Parameters
44+
### Workflow
5745

58-
- **Top Vendors (`n_bars`)**: Displays the top 10 vendors based on cumulative CVE counts.
59-
- **Dynamic Ordering (`fixed_order=False`)**: Updates the bar order dynamically to reflect changes in rankings.
60-
- **Y-Axis Consistency (`fixed_max=True`)**: Maintains a consistent y-axis scale to enable meaningful visual comparisons.
61-
- **Smooth Transitions (`steps_per_period=10`)**: Creates fluid animations between monthly time steps.
62-
- **Frame Duration (`period_length=400`)**: Each time step lasts 400 milliseconds for optimal pacing.
46+
1. **Setup and Data Loading**:
47+
- Imports libraries for data manipulation (`pandas`), visualization (`bar_chart_race`, `matplotlib`), and system utilities (`os`, `warnings`).
48+
- Suppresses irrelevant warnings to streamline outputs.
49+
- Reads a preprocessed CSV file (`vendor_top_20.csv`) containing cumulative CVE counts by vendor, year, and month.
6350

64-
---
51+
2. **Vendor Name Normalization**:
52+
- Ensures vendor names are clean and consistent using a mapping dictionary.
53+
- Handles variations in vendor naming for accurate aggregation.
54+
55+
3. **Data Transformation**:
56+
- Converts the `Year` and `Month` columns into a `datetime` format for proper sorting and animation.
57+
- Pivots the dataset to create a table where:
58+
- **Rows**: Time intervals (monthly or yearly).
59+
- **Columns**: Vendors.
60+
- **Values**: Cumulative CVE counts.
61+
- Prepares both monthly and yearly datasets for separate animations.
6562

66-
### Customization
63+
4. **Color Assignment**:
64+
- Assigns official brand colors to vendors where available for consistent identification.
65+
- Generates fallback colors for vendors without official brand palettes, ensuring a visually distinct output.
6766

68-
- **Visual Enhancements**:
69-
- Clear labels with larger fonts (`bar_label_size=12`) improve readability.
70-
- High resolution (`dpi=300`) ensures professional-quality visuals suitable for presentations and reports.
71-
- **Colors**:
72-
- Brand colors make it easy to identify key vendors.
73-
- Fallback colors ensure distinction for all other vendors.
67+
5. **Bar Chart Race Generation**:
68+
- Creates animations for:
69+
- **Monthly Data**: Top 10 vendors shown dynamically across monthly time steps, saved as an `.mp4` file.
70+
- **Yearly Data**: Top 5 vendors aggregated by year, optimized as a `.gif` file for LinkedIn sharing.
71+
- Configures parameters for animation smoothness, readability, and file size optimization.
7472

7573
---
7674

77-
### Output
75+
### Parameters for Customization
7876

79-
- **Video File**:
80-
- The animation is saved as `top_10_vendors_cve_trends_2002_2024.mp4`, ready for sharing and embedding.
77+
- **Top Vendors (`n_bars`)**:
78+
- Displays the top 10 vendors for monthly visualizations and top 5 for yearly GIFs.
79+
- **Dynamic Rankings (`fixed_order=False`)**:
80+
- Bar positions adjust dynamically based on rankings in each time interval.
81+
- **Y-Axis Consistency (`fixed_max=True`)**:
82+
- Maintains a fixed scale across time intervals for meaningful comparisons.
83+
- **Transition Smoothness (`steps_per_period`)**:
84+
- Controls animation fluidity, with fewer steps used for smaller file sizes.
85+
- **Animation Speed (`period_length`)**:
86+
- Adjusted for LinkedIn-friendly GIFs with faster transitions.
8187

82-
- **Insights**:
83-
- Tracks the dynamic evolution of CVE counts by vendor.
84-
- Highlights key shifts and emerging trends in vulnerability disclosures across two decades, providing actionable insights into the cybersecurity landscape.
88+
---
8589

90+
### Outputs
8691

87-
```python jupyter={"is_executing": true}
88-
import os
89-
import warnings
92+
1. **Monthly Animation (`.mp4`)**:
93+
- High-quality video highlighting the top 10 vendors month by month.
94+
- Saved as `top_10_vendors_cve_trends_1999_2024.mp4`.
9095

91-
import matplotlib.pyplot as plt
92-
import pandas as pd
93-
from bar_chart_race import bar_chart_race
94-
from matplotlib.colors import to_hex
96+
2. **Yearly Animation (`.gif`)**:
97+
- Lightweight GIF optimized for LinkedIn, showing top 5 vendors per year.
98+
- Saved as `top_5_vendors_cve_trends_1999_2024.gif`.
9599

100+
101+
```python
96102
# Suppress font warnings
97103
warnings.filterwarnings("ignore", category=UserWarning)
98104

@@ -239,35 +245,131 @@ colors = [
239245
brand_colors.get(vendor, fallback_colors[i % len(fallback_colors)])
240246
for i, vendor in enumerate(df_pivot.columns)
241247
]
248+
```
249+
250+
### Generate Monthly MP4 Bar Chart Race
251+
In this step, we generate a bar chart race video in MP4 format that visualizes cumulative CVE counts by vendor over time, aggregated monthly.
252+
253+
- The output video will display the **top 10 vendors** ranked by their cumulative CVE counts for each month from 1999 to 2024.
254+
- The `period_length` and `steps_per_period` control the animation speed and smoothness.
255+
- The resolution (`dpi=300`) ensures high-quality output.
256+
257+
The resulting MP4 file will be saved to the specified path.
258+
259+
260+
```python jupyter={"is_executing": true}
261+
# Output file path
262+
output_file = "../../../data/cve_data_stories/vendor_cve_trends/processed/top_10_vendors_cve_trends_1999_2024.mp4"
263+
264+
# Generate bar chart race
265+
bar_chart_race(
266+
df=df_pivot, # Pivoted DataFrame with cumulative CVE counts by vendor over time
267+
filename=output_file, # Path to save the output video (e.g., .mp4). Set to None to display inline
268+
orientation="h", # Display bars horizontally to show vendor trends over time
269+
sort="desc", # Sort vendors by descending CVE count for each time period
270+
n_bars=10, # Display the top 10 vendors at any given time
271+
fixed_order=False, # Allow dynamic changes in the order of vendors as CVE counts update
272+
fixed_max=True, # Keep the maximum y-axis value consistent across all time periods
273+
steps_per_period=10, # Number of animation frames to transition between each month
274+
period_length=400, # Duration (in milliseconds) of each month in the animation
275+
interpolate_period=True, # Smoothly interpolate CVE counts between months for fluid animation
276+
label_bars=True, # Display the CVE count as a label on each bar
277+
bar_size=0.85, # Thickness of each bar as a fraction of the available space
278+
period_label={"size": 16, "x": 0.85, "y": 0.25}, # Customize date label size and position for each month
279+
period_fmt="%Y-%m", # Format of the date label displayed for each time period (e.g., "2023-01")
280+
title="Top Vendors by CVE", # Title of the bar chart animation
281+
title_size=20, # Font size for the chart title
282+
bar_label_size=12, # Font size for the CVE count labels displayed on each bar
283+
tick_label_size=10, # Font size for axis tick labels (representing CVE counts)
284+
cmap=colors, # Colors for each vendor's bar, using brand or fallback colors
285+
dpi=300, # Resolution of the output video (higher DPI produces better quality)
286+
bar_kwargs={"alpha": 0.85}, # Set the transparency of the bars (alpha value)
287+
)
288+
289+
print(f"Bar chart race mp4 saved to {output_file}.")
290+
```
291+
292+
### Prepare Data for Yearly GIF
293+
To simplify the visualization for LinkedIn, the CVE data is aggregated by year instead of monthly intervals. This reduces the size and complexity of the bar chart race while maintaining key trends.
294+
295+
#### Steps:
296+
1. **Convert Index to Datetime**:
297+
- The date index is converted to a datetime format for proper resampling.
298+
299+
2. **Resample by Year-End**:
300+
- Using the `resample('YE').last()` method, we extract the **last value of each year**. This ensures that the cumulative data accurately reflects the total CVE count for each vendor by the end of the year.
301+
302+
3. **Format the Index**:
303+
- The index is updated to show only the year as a string for clarity in the visualization.
304+
305+
4. **Handle Missing Data**:
306+
- Any missing values (`NaN`) are filled with `0` to prevent gaps in the animation.
307+
308+
5. **Avoid Rendering Issues**:
309+
- A small value (`1e-5`) is added to the data to avoid potential rendering artifacts during animation.
310+
311+
6. **Ensure Complete Year Range**:
312+
- The data is reindexed to include all years in the range, filling any missing years with `0`.
313+
314+
```python
315+
# Convert index to datetime and resample
316+
df_pivot.index = pd.to_datetime(df_pivot.index)
317+
df_yearly = df_pivot.resample('YE').last() # Use last value of each year for cumulative data
318+
319+
# Update index to show only the year
320+
df_yearly.index = df_yearly.index.year.astype(str) # Convert years to strings for proper formatting
321+
322+
# Fill NaN values
323+
df_yearly = df_yearly.fillna(0)
324+
325+
# Add a small value to avoid rendering issues
326+
df_yearly += 1e-5
327+
328+
# Ensure all years are present
329+
all_years = [str(year) for year in range(int(df_yearly.index[0]), int(df_yearly.index[-1]) + 1)]
330+
df_yearly = df_yearly.reindex(all_years, fill_value=0)
331+
```
332+
333+
### Generate Yearly GIF Bar Chart Race
334+
Using the aggregated yearly data, we create a **GIF optimized for LinkedIn**.
335+
336+
- The GIF shows the **top 5 vendors** ranked by cumulative CVE counts for each year from 1999 to 2024.
337+
- To ensure the file size is within LinkedIn's 8MB limit:
338+
- Resolution is reduced (`dpi=150`).
339+
- Animation transitions are faster (`period_length=200` milliseconds).
340+
- Fewer steps per period (`steps_per_period=5`) reduce frame count.
341+
342+
The resulting GIF will be saved to the specified path.
343+
242344

345+
```python
243346
# Output file path
244-
output_file = "../../../data/cve_data_stories/vendor_cve_trends/processed/top_10_vendors_cve_trends_1996_2024.mp4"
245-
os.makedirs(os.path.dirname(output_file), exist_ok=True)
347+
output_file = "../../../data/cve_data_stories/vendor_cve_trends/processed/top_5_vendors_cve_trends_1999_2024.gif"
246348

247349
# Generate bar chart race
248350
bar_chart_race(
249-
df=df_pivot, # The pivoted DataFrame containing cumulative CVE counts by vendor over time.
250-
filename=output_file, # Path to save the output video (e.g., .mp4). Set to None to display inline in a notebook.
251-
orientation="h", # Display bars horizontally to show vendor trends over time.
252-
sort="desc", # Sort vendors by descending CVE count for each time period.
253-
n_bars=10, # Number of top CVE vendors to display at any given time.
254-
fixed_order=False, # Allow the order of vendors to change dynamically as CVE counts update over time.
255-
fixed_max=True, # Keep the maximum CVE count consistent across all time periods for better comparison.
256-
steps_per_period=10, # Number of animation frames to transition between each month.
257-
period_length=400, # Duration (in milliseconds) for each month in the animation.
258-
interpolate_period=True, # Smoothly interpolate CVE counts between months for fluid animation.
259-
label_bars=True, # Display the CVE count as a label on each bar.
260-
bar_size=0.85, # Thickness of each bar as a fraction of the available space for the month.
261-
period_label={"size": 16, "x": 0.85, "y": 0.25}, # Customize the date label for each month (size and position).
262-
period_fmt="%Y-%m", # Format of the date label displayed for each time period (e.g., "2023-01").
263-
title="Top Vendors by CVE", # Title of the bar chart animation.
264-
title_size=20, # Font size for the chart title.
265-
bar_label_size=12, # Font size for the CVE count labels displayed on each bar.
266-
tick_label_size=10, # Font size for axis tick labels (representing CVE counts).
267-
cmap=colors, # Colors for each vendor's bar, using brand colors or fallback colors if unspecified.
268-
dpi=300, # Resolution of the output video (higher DPI produces better quality but larger files).
269-
bar_kwargs={"alpha": 0.85}, # Set the transparency of the bars (alpha value).
351+
df=df_yearly, # Aggregated DataFrame with yearly cumulative CVE counts by vendor
352+
filename=output_file, # Path to save the output GIF (optimized for LinkedIn)
353+
orientation="h", # Display bars horizontally to show vendor trends over time
354+
sort="desc", # Sort vendors by descending CVE count for each year
355+
n_bars=5, # Display the top 5 vendors at any given time
356+
fixed_order=False, # Allow dynamic changes in the order of vendors as CVE counts update
357+
fixed_max=True, # Keep the maximum y-axis value consistent across all time periods
358+
steps_per_period=5, # Number of animation frames to transition between each year
359+
period_length=200, # Duration (in milliseconds) of each year in the animation
360+
interpolate_period=False, # Disable interpolation to avoid rendering artifacts
361+
label_bars=True, # Display the CVE count as a label on each bar
362+
bar_size=0.85, # Thickness of each bar as a fraction of the available space
363+
period_label={"size": 16, "x": 0.85, "y": 0.25}, # Customize date label size and position for each year
364+
period_fmt="{x}", # Display the year as it appears in the DataFrame index
365+
title="Top Vendors by CVE (Yearly)", # Title of the bar chart animation
366+
title_size=18, # Font size for the chart title
367+
bar_label_size=10, # Font size for the CVE count labels displayed on each bar
368+
tick_label_size=8, # Font size for axis tick labels (representing CVE counts)
369+
cmap=colors, # Colors for each vendor's bar, using brand or fallback colors
370+
dpi=150, # Resolution of the output GIF (optimized for smaller file size)
371+
bar_kwargs={"alpha": 0.85}, # Set the transparency of the bars (alpha value)
270372
)
271373

272-
print(f"Bar chart race saved to {output_file}.")
374+
print(f"Bar chart race gif saved to {output_file}.")
273375
```

0 commit comments

Comments
 (0)