TypeError
diff --git a/‎markdown/cve_data_stories/01_data_collection.md
Lines changed: 89 additions & 0 deletions b/‎markdown/cve_data_stories/01_data_collection.md
Lines changed: 89 additions & 0 deletions
diff --git a/‎markdown/cve_data_stories/vendor_cve_trends/02_data_cleaning.md
Lines changed: 157 additions & 0 deletions b/‎markdown/cve_data_stories/vendor_cve_trends/02_data_cleaning.md
Lines changed: 157 additions & 0 deletions
diff --git a/‎markdown/cve_data_stories/vendor_cve_trends/03_analysis.md
Lines changed: 130 additions & 0 deletions b/‎markdown/cve_data_stories/vendor_cve_trends/03_analysis.md
Lines changed: 130 additions & 0 deletions
@@ -0,0 +1,89 @@
+---
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.16.6
+  kernelspec:
+    display_name: Python 3
+    language: python
+    name: python3
+---
+
+# CVE Data Stories - Data Collection
+
+```python
+import zipfile
+from pathlib import Path
+
+import requests
+```
+
+## Project Setup
+
+Before proceeding with data collection, we need to ensure that the necessary directory for storing raw data is in place. This step is crucial to maintain an organized structure for our project, especially when working with multiple datasets over time.
+
+The following Python code will check if the required `raw` directory under `cve_data_stories` exists, and if not, it will create it. This approach ensures that the environment is always correctly set up before any data processing begins, even if you're running this notebook on a new machine or a fresh clone of the repository.
+
+```python
+# Target directory for raw data
+DATA_DIR = Path("../../data/cve_data_stories/raw")
+DATA_DIR.mkdir(parents=True, exist_ok=True)
+```
+
+# Data Collection
+
+To automate the downloading, unzipping, and saving of required datasets, execute the Python code in the **next cell**.
+
+This script will:
+- Download the NIST NVD (2002-2024) and CISA KEV datasets.
+- Extract JSON files from ZIP archives.
+- Save all files to the directory: `/data/cve_data_stories/raw/`.
+
+Once the script has run successfully, proceed to the data preprocessing steps in the next notebook.
+
+```python
+# Target directory for raw data
+DATA_DIR = Path("../../data/cve_data_stories/raw")
+DATA_DIR.mkdir(parents=True, exist_ok=True)
+
+# Generate URLs for NVD CVE datasets (2002-2024)
+BASE_URL = "https://nvd.nist.gov/feeds/json/cve/1.1/"
+DATASETS = {f"nvdcve-1.1-{year}.json.zip": f"{BASE_URL}nvdcve-1.1-{year}.json.zip" for year in range(2002, 2025)}
+
+
+def download_file(url, dest):
+    """Download a file from a URL to a destination."""
+    print(f"Downloading: {url}")
+    response = requests.get(url, stream=True)
+    if response.status_code == 200:
+        with open(dest, "wb") as file:
+            for chunk in response.iter_content(chunk_size=1024):
+                file.write(chunk)
+        print(f"Saved to: {dest}")
+    else:
+        print(f"Failed to download {url} - Status code: {response.status_code}")
+
+
+def unzip_file(zip_path, dest_dir):
+    """Unzip a file to a destination directory."""
+    with zipfile.ZipFile(zip_path, "r") as zip_ref:
+        zip_ref.extractall(dest_dir)
+        print(f"Unzipped {zip_path} to {dest_dir}")
+
+
+# Main execution
+for filename, url in DATASETS.items():
+    dest_path = DATA_DIR / filename
+
+    # Download the file
+    download_file(url, dest_path)
+
+    # If it's a ZIP file, extract its contents
+    if filename.endswith(".zip"):
+        unzip_file(dest_path, DATA_DIR)
+        dest_path.unlink()  # Remove the ZIP file after extraction
+
+```
@@ -0,0 +1,157 @@
+---
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.16.6
+  kernelspec:
+    display_name: Python 3
+    language: python
+    name: python3
+---
+
+# CVE Data Stories: Vendor CVE Trends - Data Cleaning
+
+```python
+import csv
+import json
+from collections import defaultdict
+from datetime import datetime
+from pathlib import Path
+```
+
+## Project Setup
+
+Before proceeding with data processing, we need to ensure that the necessary directory for storing processed data is in place. This step is crucial to maintain an organized structure for our project, especially when working with multiple datasets over time.
+
+The following Python code will check if the required `processed` directory under `data/cve_data_stories/vendor_cve_trends/` exists, and if not, it will create it. This approach ensures that the environment is always correctly set up before any data processing begins, even if you're running this notebook on a new machine or a fresh clone of the repository.
+
+```python
+# Target directory for processed data
+DATA_DIR = Path("../../../data/cve_data_stories/vendor_cve_trends/processed")
+DATA_DIR.mkdir(parents=True, exist_ok=True)
+```
+
+## Collecting Monthly CVE Counts by Vendor
+
+This script processes JSON files containing CVE data (downloaded from NVD) and extracts monthly counts of CVEs for each vendor. The output is saved as a CSV file for further analysis.
+
+### Steps in the Script
+
+1. **Define Datasets**:
+   - A dictionary is created where each key is a year (2002–2024) and each value is the corresponding JSON file name:
+     ```python
+     DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)}
+     ```
+
+2. **Define a Function to Extract Monthly Counts**:
+   - The function `collect_monthly_counts` processes a single JSON file and:
+     - Extracts the `publishedDate` of each CVE to determine the year and month.
+     - Extracts vendor names from the `cpe23Uri` field in the `configurations` section.
+     - Updates a running tally of CVE counts for each `(vendor, year, month)`.
+
+3. **Handle Missing or Invalid Data**:
+   - Skips CVEs without a valid `publishedDate`.
+   - Handles missing files, JSON decoding errors, or other exceptions gracefully by logging a message.
+
+4. **Iterate Over All Datasets**:
+   - Each year’s JSON file is processed in a loop:
+     - Loads the file.
+     - Extracts monthly CVE counts by vendor.
+   - Uses a `defaultdict` to store cumulative counts for all `(vendor, year, month)` combinations.
+
+5. **Write Results to a CSV File**:
+   - Saves the data to a CSV file (`vendor_monthly_counts.csv`) with the following structure:
+     | Vendor    | Year | Month | Count |
+     |-----------|------|-------|-------|
+     | microsoft | 2023 | 1     | 12    |
+     | adobe     | 2023 | 1     | 8     |
+     | redhat    | 2023 | 1     | 5     |
+
+### Key Features
+
+- **Handles Duplicate Vendors**:
+  - Each CVE might list a vendor multiple times, but the script uses a `set` to ensure each vendor is counted only once per CVE.
+
+- **Efficient Storage**:
+  - Uses a `defaultdict(int)` to avoid repetitive checks for existing keys, ensuring the data structure is memory-efficient.
+
+- **Error Handling**:
+  - Logs errors for missing files, invalid JSON, or unexpected issues, allowing the script to continue processing other datasets.
+
+### Output
+- **CSV File**:
+  - The final output is a CSV file (`vendor_monthly_counts.csv`) containing:
+    - Vendor name.
+    - Year and month.
+    - CVE count for that vendor in the given month.
+
+
+```python
+# Define datasets
+DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)}
+
+
+def collect_monthly_counts(json_file, month_counts):
+    try:
+        with open(json_file, 'r') as f:
+            data = json.load(f)
+
+        for item in data.get('CVE_Items', []):
+            published_date = item.get('publishedDate', None)
+
+            # Parse year and month from the published date
+            if published_date:
+                date = datetime.strptime(published_date, "%Y-%m-%dT%H:%MZ")
+                pub_year = date.year
+                pub_month = date.month
+            else:
+                continue  # Skip if no published date
+
+            # Extract vendor info
+            vendors = set()  # Avoid duplicate vendors per CVE
+            for node in item.get('configurations', {}).get('nodes', []):
+                for cpe in node.get('cpe_match', []):
+                    cpe_uri = cpe.get('cpe23Uri', '')
+                    if cpe_uri:
+                        parts = cpe_uri.split(':')
+                        if len(parts) > 4:  # Ensure valid CPE format
+                            vendors.add(parts[3])  # Extract vendor
+
+            # Update monthly counts
+            for v in vendors:
+                month_counts[(v, pub_year, pub_month)] += 1
+
+    except FileNotFoundError:
+        print(f"File not found: {json_file}")
+    except json.JSONDecodeError:
+        print(f"Error decoding JSON: {json_file}")
+    except Exception as e:
+        print(f"An error occurred: {e}")
+
+
+# Define data folder
+data_folder = Path("../../../data/cve_data_stories/raw")
+
+# Initialize defaultdict to hold monthly counts
+monthly_counts = defaultdict(int)
+
+# Process each dataset
+for year, file_name in DATASETS.items():
+    input_file = data_folder / file_name
+    print(f"Processing {input_file}")
+    collect_monthly_counts(input_file, monthly_counts)
+
+# Write monthly counts to a CSV
+output_csv = DATA_DIR / "vendor_monthly_counts.csv"
+with open(output_csv, 'w', newline='') as csvfile:
+    writer = csv.writer(csvfile)
+    writer.writerow(["Vendor", "Year", "Month", "Count"])  # Header row
+    for (vendor, year, month), count in sorted(monthly_counts.items()):
+        writer.writerow([vendor, year, month, count])
+
+print(f"Monthly counts written to {output_csv}")
+
+```
@@ -0,0 +1,130 @@
+---
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.16.6
+  kernelspec:
+    display_name: Python 3
+    language: python
+    name: python3
+---
+
+# CVE Data Stories: Vendor CVE Trends - Analysis
+
+
+
+## Calculate Cumulative CVE Counts by Vendor (Starting from 1996)
+
+This script processes a CSV file containing monthly CVE counts for each vendor, filters the data to start at 1996, and calculates cumulative totals over time. The output is saved as a new CSV file for further analysis.
+
+### Steps in the Script
+
+1. **Load the Monthly Counts CSV**:
+   - Reads a CSV file (`vendor_monthly_counts.csv`) containing CVE counts grouped by `Vendor`, `Year`, and `Month`.
+
+2. **Create a Complete Date Range**:
+   - Generates a range of dates from the earliest to the latest `Year` and `Month` in the dataset.
+   - Ensures no months are missing for any vendor by creating a complete time series for all vendors.
+
+3. **Filter Data to Start at 1996**:
+   - After generating the complete date range, filters the data to include only years starting from 1996. This ensures the dataset focuses on meaningful trends and avoids sparse data from earlier years.
+
+4. **Build a DataFrame for All Vendors and Dates**:
+   - Combines the list of unique vendors with the filtered date range using a multi-index.
+   - Creates a DataFrame that represents every `(vendor, year, month)` combination, even for months with no CVEs.
+
+5. **Merge and Fill Missing Counts**:
+   - Merges the original data with the complete DataFrame, filling missing `Count` values with `0`.
+
+6. **Sort the Data**:
+   - Sorts the data by `Vendor`, `Year`, and `Month` to ensure proper order for cumulative calculations.
+
+7. **Calculate Cumulative Totals**:
+   - For each vendor, calculates a running total of CVE counts using the `cumsum` method.
+   - Ensures the cumulative totals are stored as integers.
+
+8. **Drop Unnecessary Columns**:
+   - Removes the `Date` column (if not needed) to reduce file size and simplify the output.
+
+9. **Save Results to a New CSV**:
+   - Saves the processed data, including cumulative totals, to a new file (`vendor_cumulative_counts.csv`).
+
+### Key Features
+
+- **Filters Sparse Early Data**:
+  - Focuses on data from 1996 onwards for improved analysis and visualization.
+
+- **Handles Missing Data**:
+  - Ensures every month is accounted for, even if no CVEs were reported for a vendor in a given month.
+
+- **Efficient Cumulative Calculation**:
+  - Uses `groupby` and `cumsum` to calculate cumulative totals efficiently for each vendor.
+
+- **Clean and Sorted Output**:
+  - The final CSV is sorted and ready for use in visualizations or additional analysis.
+
+### Output
+- **CSV File**:
+  - The final output is a CSV file (`vendor_cumulative_counts.csv`) containing:
+    | Vendor    | Year | Month | Count | Cumulative_Count |
+    |-----------|------|-------|-------|-------------------|
+    | freebsd   | 1996 | 1     | 5     | 5                 |
+    | freebsd   | 1996 | 2     | 0     | 5                 |
+    | freebsd   | 1996 | 3     | 8     | 13                |
+    | redhat    | 1996 | 1     | 0     | 0                 |
+    | redhat    | 1996 | 2     | 15    | 15                |
+
+
+```python
+import pandas as pd
+
+# Load the monthly counts CSV
+input_csv = "../../../data/cve_data_stories/raw/vendor_monthly_counts.csv"
+output_csv = "../../../data/cve_data_stories/vendor_cve_trends/processed/vendor_cumulative_counts.csv"
+
+# Read data into a DataFrame
+df = pd.read_csv(input_csv)
+
+# Ensure all months are represented for each vendor
+# Create a complete date range from the earliest year and month to the latest
+date_range = pd.date_range(
+    start=f"{df['Year'].min()}-{df['Month'].min()}-01",
+    end=f"{df['Year'].max()}-{df['Month'].max()}-01",
+    freq="MS"  # Month Start frequency
+)
+
+# Create a DataFrame for all vendors and the complete date range
+vendors = df["Vendor"].unique()
+full_index = pd.MultiIndex.from_product(
+    [vendors, date_range],
+    names=["Vendor", "Date"]
+)
+df_full = pd.DataFrame(index=full_index).reset_index()
+
+# Extract Year and Month from the full date range
+df_full["Year"] = df_full["Date"].dt.year
+df_full["Month"] = df_full["Date"].dt.month
+
+# Filter to include only years from 1996 onwards
+df_full = df_full[df_full["Year"] >= 1996]
+
+# Merge with the original data, filling missing counts with 0
+df = pd.merge(df_full, df, on=["Vendor", "Year", "Month"], how="left").fillna({"Count": 0})
+
+# Drop the Date column (no longer needed)
+df = df.drop(columns=["Date"])
+
+# Sort data by vendor, year, and month
+df = df.sort_values(by=["Vendor", "Year", "Month"])
+
+# Calculate cumulative totals
+df["Cumulative_Count"] = df.groupby("Vendor")["Count"].cumsum().astype(int)
+
+# Save to a new CSV
+df.to_csv(output_csv, index=False)
+
+print(f"Cumulative totals saved to {output_csv}")
+```