You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: markdown/cve_data_stories/vendor_cve_trends/03_analysis.md
+12-12Lines changed: 12 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -16,9 +16,9 @@ jupyter:
16
16
17
17
18
18
19
-
## Calculate Cumulative CVE Counts by Vendor (Starting from 1996)
19
+
## Calculate Cumulative CVE Counts by Vendor (Starting from 1999)
20
20
21
-
This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1996, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.
21
+
This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1999, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.
22
22
23
23
### Steps in the Script
24
24
@@ -29,8 +29,8 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
29
29
- Generates a range of dates from the earliest to the latest `Year` and `Month` in the dataset.
30
30
- Ensures no months are missing for any vendor by creating a complete time series for all vendors.
31
31
32
-
3.**Filter Data to Start at 1996**:
33
-
- After generating the complete date range, filters the data to include only years starting from 1996. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.
32
+
3.**Filter Data to Start at 1999**:
33
+
- After generating the complete date range, filters the data to include only years starting from 1999. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.
34
34
35
35
4.**Build a DataFrame for All Vendors and Dates**:
36
36
- Combines the list of unique vendors with the filtered date range using a multi-index.
@@ -55,7 +55,7 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
55
55
### Key Features
56
56
57
57
-**Filters Sparse Early Data**:
58
-
- Focuses on data from 1996 onwards for improved analysis and visualization.
58
+
- Focuses on data from 1999 onwards for improved analysis and visualization.
59
59
60
60
-**Handles Missing Data**:
61
61
- Ensures every month is accounted for, even if no CVEs were reported for a vendor in a given month.
@@ -71,11 +71,11 @@ This script processes a CSV file containing monthly CVE counts for each vendor,
71
71
- The final output is a CSV file (`vendor_cumulative_counts.csv`) containing:
# CVE Data Stories: Vendor CVE Trends - Visualizations
16
16
17
17
18
+
```python
19
+
import warnings
18
20
19
-
## Bar Chart Race: Top 10 CVE Vendors (1996–2024)
21
+
import matplotlib.pyplot as plt
22
+
import pandas as pd
23
+
from bar_chart_race import bar_chart_race
24
+
from matplotlib.colors import to_hex
25
+
```
20
26
21
-
This script generates a dynamic bar chart race showcasing the top 10 vendors by cumulative CVE count over time (1996–2024). CVE data offers critical insights into vendor-specific trends in cybersecurity vulnerabilities, highlighting shifts in the security landscape across two decades.
22
27
23
-
---
24
28
25
-
### Steps in the Script
26
29
27
-
1.**Import Necessary Libraries**:
28
-
-`pandas`: For efficient data manipulation and preprocessing.
29
-
-`bar_chart_race`: To create the bar chart race animation.
30
-
-`matplotlib`: For additional visual customizations, including fonts and color palettes.
30
+
## Bar Chart Race: Top CVE Vendors (1999–2024)
31
31
32
-
2.**Load and Preprocess Data**:
33
-
- Reads a CSV file (`vendor_top_20.csv`) containing cumulative CVE counts for vendors by year and month.
34
-
- Normalizes vendor names for consistency.
35
-
- Ensures inclusion of all vendors that appeared in the top 20 during the analyzed period.
32
+
This script generates dynamic bar chart race visualizations that showcase the top vendors by cumulative CVE count over time, covering the years 1999–2024. The project provides insights into long-term trends in vendor-specific vulnerabilities, highlighting shifts in the cybersecurity landscape over two decades.
36
33
37
-
3.**Pivot and Format Data**:
38
-
- Prepares the dataset for visualization by transforming it into a pivot table:
39
-
-**Rows**: Time (`Year`, `Month`).
40
-
-**Columns**: Vendors.
41
-
-**Values**: Cumulative CVE counts.
42
-
- Combines `Year` and `Month` into a `Date` column (`YYYY-MM`) for a continuous time index.
34
+
---
43
35
44
-
4.**Assign Colors**:
45
-
-**Brand Colors**: Maps vendors to their official brand colors for easy recognition.
46
-
-**Fallback Colors**: Assigns visually distinct colors to vendors without defined brand colors.
36
+
### Purpose
47
37
48
-
5.**Generate the Bar Chart Race**:
49
-
- Animates the top 10 vendors dynamically over time:
50
-
- Bars update their positions and lengths based on cumulative CVE counts.
51
-
- Parameters enhance readability and visual storytelling.
52
-
- Saves the animation as an `.mp4` file for high-quality sharing.
38
+
-**Analyze Vulnerability Trends**: Understand which vendors have consistently had the most reported vulnerabilities and how rankings have evolved over time.
39
+
-**Engage Through Visualization**: Present data in a visually compelling way that draws attention to key trends in cybersecurity.
40
+
-**Inspire Data-Driven Discussions**: Encourage conversations about how this data can inform risk management strategies.
53
41
54
42
---
55
43
56
-
### Key Parameters
44
+
### Workflow
57
45
58
-
-**Top Vendors (`n_bars`)**: Displays the top 10 vendors based on cumulative CVE counts.
59
-
-**Dynamic Ordering (`fixed_order=False`)**: Updates the bar order dynamically to reflect changes in rankings.
60
-
-**Y-Axis Consistency (`fixed_max=True`)**: Maintains a consistent y-axis scale to enable meaningful visual comparisons.
61
-
-**Smooth Transitions (`steps_per_period=10`)**: Creates fluid animations between monthly time steps.
62
-
-**Frame Duration (`period_length=400`)**: Each time step lasts 400 milliseconds for optimal pacing.
46
+
1.**Setup and Data Loading**:
47
+
- Imports libraries for data manipulation (`pandas`), visualization (`bar_chart_race`, `matplotlib`), and system utilities (`os`, `warnings`).
48
+
- Suppresses irrelevant warnings to streamline outputs.
49
+
- Reads a preprocessed CSV file (`vendor_top_20.csv`) containing cumulative CVE counts by vendor, year, and month.
63
50
64
-
---
51
+
2.**Vendor Name Normalization**:
52
+
- Ensures vendor names are clean and consistent using a mapping dictionary.
53
+
- Handles variations in vendor naming for accurate aggregation.
54
+
55
+
3.**Data Transformation**:
56
+
- Converts the `Year` and `Month` columns into a `datetime` format for proper sorting and animation.
57
+
- Pivots the dataset to create a table where:
58
+
-**Rows**: Time intervals (monthly or yearly).
59
+
-**Columns**: Vendors.
60
+
-**Values**: Cumulative CVE counts.
61
+
- Prepares both monthly and yearly datasets for separate animations.
65
62
66
-
### Customization
63
+
4.**Color Assignment**:
64
+
- Assigns official brand colors to vendors where available for consistent identification.
65
+
- Generates fallback colors for vendors without official brand palettes, ensuring a visually distinct output.
67
66
68
-
-**Visual Enhancements**:
69
-
- Clear labels with larger fonts (`bar_label_size=12`) improve readability.
70
-
- High resolution (`dpi=300`) ensures professional-quality visuals suitable for presentations and reports.
71
-
-**Colors**:
72
-
- Brand colors make it easy to identify key vendors.
73
-
- Fallback colors ensure distinction for all other vendors.
67
+
5.**Bar Chart Race Generation**:
68
+
- Creates animations for:
69
+
-**Monthly Data**: Top 10 vendors shown dynamically across monthly time steps, saved as an `.mp4` file.
70
+
-**Yearly Data**: Top 5 vendors aggregated by year, optimized as a `.gif` file for LinkedIn sharing.
71
+
- Configures parameters for animation smoothness, readability, and file size optimization.
74
72
75
73
---
76
74
77
-
### Output
75
+
### Parameters for Customization
78
76
79
-
-**Video File**:
80
-
- The animation is saved as `top_10_vendors_cve_trends_2002_2024.mp4`, ready for sharing and embedding.
77
+
-**Top Vendors (`n_bars`)**:
78
+
- Displays the top 10 vendors for monthly visualizations and top 5 for yearly GIFs.
79
+
-**Dynamic Rankings (`fixed_order=False`)**:
80
+
- Bar positions adjust dynamically based on rankings in each time interval.
81
+
-**Y-Axis Consistency (`fixed_max=True`)**:
82
+
- Maintains a fixed scale across time intervals for meaningful comparisons.
83
+
-**Transition Smoothness (`steps_per_period`)**:
84
+
- Controls animation fluidity, with fewer steps used for smaller file sizes.
85
+
-**Animation Speed (`period_length`)**:
86
+
- Adjusted for LinkedIn-friendly GIFs with faster transitions.
81
87
82
-
-**Insights**:
83
-
- Tracks the dynamic evolution of CVE counts by vendor.
84
-
- Highlights key shifts and emerging trends in vulnerability disclosures across two decades, providing actionable insights into the cybersecurity landscape.
88
+
---
85
89
90
+
### Outputs
86
91
87
-
```python jupyter={"is_executing": true}
88
-
import os
89
-
import warnings
92
+
1.**Monthly Animation (`.mp4`)**:
93
+
- High-quality video highlighting the top 10 vendors month by month.
94
+
- Saved as `top_10_vendors_cve_trends_1999_2024.mp4`.
90
95
91
-
import matplotlib.pyplot as plt
92
-
import pandas as pd
93
-
from bar_chart_race import bar_chart_race
94
-
from matplotlib.colors import to_hex
96
+
2.**Yearly Animation (`.gif`)**:
97
+
- Lightweight GIF optimized for LinkedIn, showing top 5 vendors per year.
98
+
- Saved as `top_5_vendors_cve_trends_1999_2024.gif`.
df=df_pivot, # Pivoted DataFrame with cumulative CVE counts by vendor over time
267
+
filename=output_file, # Path to save the output video (e.g., .mp4). Set to None to display inline
268
+
orientation="h", # Display bars horizontally to show vendor trends over time
269
+
sort="desc", # Sort vendors by descending CVE count for each time period
270
+
n_bars=10, # Display the top 10 vendors at any given time
271
+
fixed_order=False, # Allow dynamic changes in the order of vendors as CVE counts update
272
+
fixed_max=True, # Keep the maximum y-axis value consistent across all time periods
273
+
steps_per_period=10, # Number of animation frames to transition between each month
274
+
period_length=400, # Duration (in milliseconds) of each month in the animation
275
+
interpolate_period=True, # Smoothly interpolate CVE counts between months for fluid animation
276
+
label_bars=True, # Display the CVE count as a label on each bar
277
+
bar_size=0.85, # Thickness of each bar as a fraction of the available space
278
+
period_label={"size": 16, "x": 0.85, "y": 0.25}, # Customize date label size and position for each month
279
+
period_fmt="%Y-%m", # Format of the date label displayed for each time period (e.g., "2023-01")
280
+
title="Top Vendors by CVE", # Title of the bar chart animation
281
+
title_size=20, # Font size for the chart title
282
+
bar_label_size=12, # Font size for the CVE count labels displayed on each bar
283
+
tick_label_size=10, # Font size for axis tick labels (representing CVE counts)
284
+
cmap=colors, # Colors for each vendor's bar, using brand or fallback colors
285
+
dpi=300, # Resolution of the output video (higher DPI produces better quality)
286
+
bar_kwargs={"alpha": 0.85}, # Set the transparency of the bars (alpha value)
287
+
)
288
+
289
+
print(f"Bar chart race mp4 saved to {output_file}.")
290
+
```
291
+
292
+
### Prepare Data for Yearly GIF
293
+
To simplify the visualization for LinkedIn, the CVE data is aggregated by year instead of monthly intervals. This reduces the size and complexity of the bar chart race while maintaining key trends.
294
+
295
+
#### Steps:
296
+
1.**Convert Index to Datetime**:
297
+
- The date index is converted to a datetime format for proper resampling.
298
+
299
+
2.**Resample by Year-End**:
300
+
- Using the `resample('YE').last()` method, we extract the **last value of each year**. This ensures that the cumulative data accurately reflects the total CVE count for each vendor by the end of the year.
301
+
302
+
3.**Format the Index**:
303
+
- The index is updated to show only the year as a string for clarity in the visualization.
304
+
305
+
4.**Handle Missing Data**:
306
+
- Any missing values (`NaN`) are filled with `0` to prevent gaps in the animation.
307
+
308
+
5.**Avoid Rendering Issues**:
309
+
- A small value (`1e-5`) is added to the data to avoid potential rendering artifacts during animation.
310
+
311
+
6.**Ensure Complete Year Range**:
312
+
- The data is reindexed to include all years in the range, filling any missing years with `0`.
313
+
314
+
```python
315
+
# Convert index to datetime and resample
316
+
df_pivot.index = pd.to_datetime(df_pivot.index)
317
+
df_yearly = df_pivot.resample('YE').last() # Use last value of each year for cumulative data
318
+
319
+
# Update index to show only the year
320
+
df_yearly.index = df_yearly.index.year.astype(str) # Convert years to strings for proper formatting
321
+
322
+
# Fill NaN values
323
+
df_yearly = df_yearly.fillna(0)
324
+
325
+
# Add a small value to avoid rendering issues
326
+
df_yearly +=1e-5
327
+
328
+
# Ensure all years are present
329
+
all_years = [str(year) for year inrange(int(df_yearly.index[0]), int(df_yearly.index[-1]) +1)]
0 commit comments