Skip to content

Commit 4be78d8

Browse files
committed
add Vender CVE Trends
1 parent e2e85ae commit 4be78d8

File tree

9 files changed

+1601
-0
lines changed

9 files changed

+1601
-0
lines changed
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
text_representation:
5+
extension: .md
6+
format_name: markdown
7+
format_version: '1.3'
8+
jupytext_version: 1.16.6
9+
kernelspec:
10+
display_name: Python 3
11+
language: python
12+
name: python3
13+
---
14+
15+
# CVE Data Stories - Data Collection
16+
17+
```python
18+
import zipfile
19+
from pathlib import Path
20+
21+
import requests
22+
```
23+
24+
## Project Setup
25+
26+
Before proceeding with data collection, we need to ensure that the necessary directory for storing raw data is in place. This step is crucial to maintain an organized structure for our project, especially when working with multiple datasets over time.
27+
28+
The following Python code will check if the required `raw` directory under `cve_data_stories` exists, and if not, it will create it. This approach ensures that the environment is always correctly set up before any data processing begins, even if you're running this notebook on a new machine or a fresh clone of the repository.
29+
30+
```python
31+
# Target directory for raw data
32+
DATA_DIR = Path("../../data/cve_data_stories/raw")
33+
DATA_DIR.mkdir(parents=True, exist_ok=True)
34+
```
35+
36+
# Data Collection
37+
38+
To automate the downloading, unzipping, and saving of required datasets, execute the Python code in the **next cell**.
39+
40+
This script will:
41+
- Download the NIST NVD (2002-2024) and CISA KEV datasets.
42+
- Extract JSON files from ZIP archives.
43+
- Save all files to the directory: `/data/cve_data_stories/raw/`.
44+
45+
Once the script has run successfully, proceed to the data preprocessing steps in the next notebook.
46+
47+
```python
48+
# Target directory for raw data
49+
DATA_DIR = Path("../../data/cve_data_stories/raw")
50+
DATA_DIR.mkdir(parents=True, exist_ok=True)
51+
52+
# Generate URLs for NVD CVE datasets (2002-2024)
53+
BASE_URL = "https://nvd.nist.gov/feeds/json/cve/1.1/"
54+
DATASETS = {f"nvdcve-1.1-{year}.json.zip": f"{BASE_URL}nvdcve-1.1-{year}.json.zip" for year in range(2002, 2025)}
55+
56+
57+
def download_file(url, dest):
58+
"""Download a file from a URL to a destination."""
59+
print(f"Downloading: {url}")
60+
response = requests.get(url, stream=True)
61+
if response.status_code == 200:
62+
with open(dest, "wb") as file:
63+
for chunk in response.iter_content(chunk_size=1024):
64+
file.write(chunk)
65+
print(f"Saved to: {dest}")
66+
else:
67+
print(f"Failed to download {url} - Status code: {response.status_code}")
68+
69+
70+
def unzip_file(zip_path, dest_dir):
71+
"""Unzip a file to a destination directory."""
72+
with zipfile.ZipFile(zip_path, "r") as zip_ref:
73+
zip_ref.extractall(dest_dir)
74+
print(f"Unzipped {zip_path} to {dest_dir}")
75+
76+
77+
# Main execution
78+
for filename, url in DATASETS.items():
79+
dest_path = DATA_DIR / filename
80+
81+
# Download the file
82+
download_file(url, dest_path)
83+
84+
# If it's a ZIP file, extract its contents
85+
if filename.endswith(".zip"):
86+
unzip_file(dest_path, DATA_DIR)
87+
dest_path.unlink() # Remove the ZIP file after extraction
88+
89+
```
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
text_representation:
5+
extension: .md
6+
format_name: markdown
7+
format_version: '1.3'
8+
jupytext_version: 1.16.6
9+
kernelspec:
10+
display_name: Python 3
11+
language: python
12+
name: python3
13+
---
14+
15+
# CVE Data Stories: Vendor CVE Trends - Data Cleaning
16+
17+
```python
18+
import csv
19+
import json
20+
from collections import defaultdict
21+
from datetime import datetime
22+
from pathlib import Path
23+
```
24+
25+
## Project Setup
26+
27+
Before proceeding with data processing, we need to ensure that the necessary directory for storing processed data is in place. This step is crucial to maintain an organized structure for our project, especially when working with multiple datasets over time.
28+
29+
The following Python code will check if the required `processed` directory under `data/cve_data_stories/vendor_cve_trends/` exists, and if not, it will create it. This approach ensures that the environment is always correctly set up before any data processing begins, even if you're running this notebook on a new machine or a fresh clone of the repository.
30+
31+
```python
32+
# Target directory for processed data
33+
DATA_DIR = Path("../../../data/cve_data_stories/vendor_cve_trends/processed")
34+
DATA_DIR.mkdir(parents=True, exist_ok=True)
35+
```
36+
37+
## Collecting Monthly CVE Counts by Vendor
38+
39+
This script processes JSON files containing CVE data (downloaded from NVD) and extracts monthly counts of CVEs for each vendor. The output is saved as a CSV file for further analysis.
40+
41+
### Steps in the Script
42+
43+
1. **Define Datasets**:
44+
- A dictionary is created where each key is a year (2002–2024) and each value is the corresponding JSON file name:
45+
```python
46+
DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)}
47+
```
48+
49+
2. **Define a Function to Extract Monthly Counts**:
50+
- The function `collect_monthly_counts` processes a single JSON file and:
51+
- Extracts the `publishedDate` of each CVE to determine the year and month.
52+
- Extracts vendor names from the `cpe23Uri` field in the `configurations` section.
53+
- Updates a running tally of CVE counts for each `(vendor, year, month)`.
54+
55+
3. **Handle Missing or Invalid Data**:
56+
- Skips CVEs without a valid `publishedDate`.
57+
- Handles missing files, JSON decoding errors, or other exceptions gracefully by logging a message.
58+
59+
4. **Iterate Over All Datasets**:
60+
- Each year’s JSON file is processed in a loop:
61+
- Loads the file.
62+
- Extracts monthly CVE counts by vendor.
63+
- Uses a `defaultdict` to store cumulative counts for all `(vendor, year, month)` combinations.
64+
65+
5. **Write Results to a CSV File**:
66+
- Saves the data to a CSV file (`vendor_monthly_counts.csv`) with the following structure:
67+
| Vendor | Year | Month | Count |
68+
|-----------|------|-------|-------|
69+
| microsoft | 2023 | 1 | 12 |
70+
| adobe | 2023 | 1 | 8 |
71+
| redhat | 2023 | 1 | 5 |
72+
73+
### Key Features
74+
75+
- **Handles Duplicate Vendors**:
76+
- Each CVE might list a vendor multiple times, but the script uses a `set` to ensure each vendor is counted only once per CVE.
77+
78+
- **Efficient Storage**:
79+
- Uses a `defaultdict(int)` to avoid repetitive checks for existing keys, ensuring the data structure is memory-efficient.
80+
81+
- **Error Handling**:
82+
- Logs errors for missing files, invalid JSON, or unexpected issues, allowing the script to continue processing other datasets.
83+
84+
### Output
85+
- **CSV File**:
86+
- The final output is a CSV file (`vendor_monthly_counts.csv`) containing:
87+
- Vendor name.
88+
- Year and month.
89+
- CVE count for that vendor in the given month.
90+
91+
92+
```python
93+
# Define datasets
94+
DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)}
95+
96+
97+
def collect_monthly_counts(json_file, month_counts):
98+
try:
99+
with open(json_file, 'r') as f:
100+
data = json.load(f)
101+
102+
for item in data.get('CVE_Items', []):
103+
published_date = item.get('publishedDate', None)
104+
105+
# Parse year and month from the published date
106+
if published_date:
107+
date = datetime.strptime(published_date, "%Y-%m-%dT%H:%MZ")
108+
pub_year = date.year
109+
pub_month = date.month
110+
else:
111+
continue # Skip if no published date
112+
113+
# Extract vendor info
114+
vendors = set() # Avoid duplicate vendors per CVE
115+
for node in item.get('configurations', {}).get('nodes', []):
116+
for cpe in node.get('cpe_match', []):
117+
cpe_uri = cpe.get('cpe23Uri', '')
118+
if cpe_uri:
119+
parts = cpe_uri.split(':')
120+
if len(parts) > 4: # Ensure valid CPE format
121+
vendors.add(parts[3]) # Extract vendor
122+
123+
# Update monthly counts
124+
for v in vendors:
125+
month_counts[(v, pub_year, pub_month)] += 1
126+
127+
except FileNotFoundError:
128+
print(f"File not found: {json_file}")
129+
except json.JSONDecodeError:
130+
print(f"Error decoding JSON: {json_file}")
131+
except Exception as e:
132+
print(f"An error occurred: {e}")
133+
134+
135+
# Define data folder
136+
data_folder = Path("../../../data/cve_data_stories/raw")
137+
138+
# Initialize defaultdict to hold monthly counts
139+
monthly_counts = defaultdict(int)
140+
141+
# Process each dataset
142+
for year, file_name in DATASETS.items():
143+
input_file = data_folder / file_name
144+
print(f"Processing {input_file}")
145+
collect_monthly_counts(input_file, monthly_counts)
146+
147+
# Write monthly counts to a CSV
148+
output_csv = DATA_DIR / "vendor_monthly_counts.csv"
149+
with open(output_csv, 'w', newline='') as csvfile:
150+
writer = csv.writer(csvfile)
151+
writer.writerow(["Vendor", "Year", "Month", "Count"]) # Header row
152+
for (vendor, year, month), count in sorted(monthly_counts.items()):
153+
writer.writerow([vendor, year, month, count])
154+
155+
print(f"Monthly counts written to {output_csv}")
156+
157+
```
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
text_representation:
5+
extension: .md
6+
format_name: markdown
7+
format_version: '1.3'
8+
jupytext_version: 1.16.6
9+
kernelspec:
10+
display_name: Python 3
11+
language: python
12+
name: python3
13+
---
14+
15+
# CVE Data Stories: Vendor CVE Trends - Analysis
16+
17+
18+
19+
## Calculate Cumulative CVE Counts by Vendor (Starting from 1996)
20+
21+
This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1996, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.
22+
23+
### Steps in the Script
24+
25+
1. **Load the Monthly Counts CSV**:
26+
- Reads a CSV file (`vendor_monthly_counts.csv`) containing CVE counts grouped by `Vendor`, `Year`, and `Month`.
27+
28+
2. **Create a Complete Date Range**:
29+
- Generates a range of dates from the earliest to the latest `Year` and `Month` in the dataset.
30+
- Ensures no months are missing for any vendor by creating a complete time series for all vendors.
31+
32+
3. **Filter Data to Start at 1996**:
33+
- After generating the complete date range, filters the data to include only years starting from 1996. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.
34+
35+
4. **Build a DataFrame for All Vendors and Dates**:
36+
- Combines the list of unique vendors with the filtered date range using a multi-index.
37+
- Creates a DataFrame that represents every `(vendor, year, month)` combination, even for months with no CVEs.
38+
39+
5. **Merge and Fill Missing Counts**:
40+
- Merges the original data with the complete DataFrame, filling missing `Count` values with `0`.
41+
42+
6. **Sort the Data**:
43+
- Sorts the data by `Vendor`, `Year`, and `Month` to ensure proper order for cumulative calculations.
44+
45+
7. **Calculate Cumulative Totals**:
46+
- For each vendor, calculates a running total of CVE counts using the `cumsum` method.
47+
- Ensures the cumulative totals are stored as integers.
48+
49+
8. **Drop Unnecessary Columns**:
50+
- Removes the `Date` column (if not needed) to reduce file size and simplify the output.
51+
52+
9. **Save Results to a New CSV**:
53+
- Saves the processed data, including cumulative totals, to a new file (`vendor_cumulative_counts.csv`).
54+
55+
### Key Features
56+
57+
- **Filters Sparse Early Data**:
58+
- Focuses on data from 1996 onwards for improved analysis and visualization.
59+
60+
- **Handles Missing Data**:
61+
- Ensures every month is accounted for, even if no CVEs were reported for a vendor in a given month.
62+
63+
- **Efficient Cumulative Calculation**:
64+
- Uses `groupby` and `cumsum` to calculate cumulative totals efficiently for each vendor.
65+
66+
- **Clean and Sorted Output**:
67+
- The final CSV is sorted and ready for use in visualizations or additional analysis.
68+
69+
### Output
70+
- **CSV File**:
71+
- The final output is a CSV file (`vendor_cumulative_counts.csv`) containing:
72+
| Vendor | Year | Month | Count | Cumulative_Count |
73+
|-----------|------|-------|-------|-------------------|
74+
| freebsd | 1996 | 1 | 5 | 5 |
75+
| freebsd | 1996 | 2 | 0 | 5 |
76+
| freebsd | 1996 | 3 | 8 | 13 |
77+
| redhat | 1996 | 1 | 0 | 0 |
78+
| redhat | 1996 | 2 | 15 | 15 |
79+
80+
81+
```python
82+
import pandas as pd
83+
84+
# Load the monthly counts CSV
85+
input_csv = "../../../data/cve_data_stories/raw/vendor_monthly_counts.csv"
86+
output_csv = "../../../data/cve_data_stories/vendor_cve_trends/processed/vendor_cumulative_counts.csv"
87+
88+
# Read data into a DataFrame
89+
df = pd.read_csv(input_csv)
90+
91+
# Ensure all months are represented for each vendor
92+
# Create a complete date range from the earliest year and month to the latest
93+
date_range = pd.date_range(
94+
start=f"{df['Year'].min()}-{df['Month'].min()}-01",
95+
end=f"{df['Year'].max()}-{df['Month'].max()}-01",
96+
freq="MS" # Month Start frequency
97+
)
98+
99+
# Create a DataFrame for all vendors and the complete date range
100+
vendors = df["Vendor"].unique()
101+
full_index = pd.MultiIndex.from_product(
102+
[vendors, date_range],
103+
names=["Vendor", "Date"]
104+
)
105+
df_full = pd.DataFrame(index=full_index).reset_index()
106+
107+
# Extract Year and Month from the full date range
108+
df_full["Year"] = df_full["Date"].dt.year
109+
df_full["Month"] = df_full["Date"].dt.month
110+
111+
# Filter to include only years from 1996 onwards
112+
df_full = df_full[df_full["Year"] >= 1996]
113+
114+
# Merge with the original data, filling missing counts with 0
115+
df = pd.merge(df_full, df, on=["Vendor", "Year", "Month"], how="left").fillna({"Count": 0})
116+
117+
# Drop the Date column (no longer needed)
118+
df = df.drop(columns=["Date"])
119+
120+
# Sort data by vendor, year, and month
121+
df = df.sort_values(by=["Vendor", "Year", "Month"])
122+
123+
# Calculate cumulative totals
124+
df["Cumulative_Count"] = df.groupby("Vendor")["Count"].cumsum().astype(int)
125+
126+
# Save to a new CSV
127+
df.to_csv(output_csv, index=False)
128+
129+
print(f"Cumulative totals saved to {output_csv}")
130+
```

0 commit comments

Comments
 (0)