|
| 1 | +--- |
| 2 | +jupyter: |
| 3 | + jupytext: |
| 4 | + text_representation: |
| 5 | + extension: .md |
| 6 | + format_name: markdown |
| 7 | + format_version: '1.3' |
| 8 | + jupytext_version: 1.16.6 |
| 9 | + kernelspec: |
| 10 | + display_name: Python 3 |
| 11 | + language: python |
| 12 | + name: python3 |
| 13 | +--- |
| 14 | + |
| 15 | +# CVE Data Stories: Vendor CVE Trends - Data Cleaning |
| 16 | + |
| 17 | +```python |
| 18 | +import csv |
| 19 | +import json |
| 20 | +from collections import defaultdict |
| 21 | +from datetime import datetime |
| 22 | +from pathlib import Path |
| 23 | +``` |
| 24 | + |
| 25 | +## Project Setup |
| 26 | + |
| 27 | +Before proceeding with data processing, we need to ensure that the necessary directory for storing processed data is in place. This step is crucial to maintain an organized structure for our project, especially when working with multiple datasets over time. |
| 28 | + |
| 29 | +The following Python code will check if the required `processed` directory under `data/cve_data_stories/vendor_cve_trends/` exists, and if not, it will create it. This approach ensures that the environment is always correctly set up before any data processing begins, even if you're running this notebook on a new machine or a fresh clone of the repository. |
| 30 | + |
| 31 | +```python |
| 32 | +# Target directory for processed data |
| 33 | +DATA_DIR = Path("../../../data/cve_data_stories/vendor_cve_trends/processed") |
| 34 | +DATA_DIR.mkdir(parents=True, exist_ok=True) |
| 35 | +``` |
| 36 | + |
| 37 | +## Collecting Monthly CVE Counts by Vendor |
| 38 | + |
| 39 | +This script processes JSON files containing CVE data (downloaded from NVD) and extracts monthly counts of CVEs for each vendor. The output is saved as a CSV file for further analysis. |
| 40 | + |
| 41 | +### Steps in the Script |
| 42 | + |
| 43 | +1. **Define Datasets**: |
| 44 | + - A dictionary is created where each key is a year (2002–2024) and each value is the corresponding JSON file name: |
| 45 | + ```python |
| 46 | + DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)} |
| 47 | + ``` |
| 48 | + |
| 49 | +2. **Define a Function to Extract Monthly Counts**: |
| 50 | + - The function `collect_monthly_counts` processes a single JSON file and: |
| 51 | + - Extracts the `publishedDate` of each CVE to determine the year and month. |
| 52 | + - Extracts vendor names from the `cpe23Uri` field in the `configurations` section. |
| 53 | + - Updates a running tally of CVE counts for each `(vendor, year, month)`. |
| 54 | + |
| 55 | +3. **Handle Missing or Invalid Data**: |
| 56 | + - Skips CVEs without a valid `publishedDate`. |
| 57 | + - Handles missing files, JSON decoding errors, or other exceptions gracefully by logging a message. |
| 58 | + |
| 59 | +4. **Iterate Over All Datasets**: |
| 60 | + - Each year’s JSON file is processed in a loop: |
| 61 | + - Loads the file. |
| 62 | + - Extracts monthly CVE counts by vendor. |
| 63 | + - Uses a `defaultdict` to store cumulative counts for all `(vendor, year, month)` combinations. |
| 64 | + |
| 65 | +5. **Write Results to a CSV File**: |
| 66 | + - Saves the data to a CSV file (`vendor_monthly_counts.csv`) with the following structure: |
| 67 | + | Vendor | Year | Month | Count | |
| 68 | + |-----------|------|-------|-------| |
| 69 | + | microsoft | 2023 | 1 | 12 | |
| 70 | + | adobe | 2023 | 1 | 8 | |
| 71 | + | redhat | 2023 | 1 | 5 | |
| 72 | + |
| 73 | +### Key Features |
| 74 | + |
| 75 | +- **Handles Duplicate Vendors**: |
| 76 | + - Each CVE might list a vendor multiple times, but the script uses a `set` to ensure each vendor is counted only once per CVE. |
| 77 | + |
| 78 | +- **Efficient Storage**: |
| 79 | + - Uses a `defaultdict(int)` to avoid repetitive checks for existing keys, ensuring the data structure is memory-efficient. |
| 80 | + |
| 81 | +- **Error Handling**: |
| 82 | + - Logs errors for missing files, invalid JSON, or unexpected issues, allowing the script to continue processing other datasets. |
| 83 | + |
| 84 | +### Output |
| 85 | +- **CSV File**: |
| 86 | + - The final output is a CSV file (`vendor_monthly_counts.csv`) containing: |
| 87 | + - Vendor name. |
| 88 | + - Year and month. |
| 89 | + - CVE count for that vendor in the given month. |
| 90 | + |
| 91 | + |
| 92 | +```python |
| 93 | +# Define datasets |
| 94 | +DATASETS = {year: f"nvdcve-1.1-{year}.json" for year in range(2002, 2025)} |
| 95 | + |
| 96 | + |
| 97 | +def collect_monthly_counts(json_file, month_counts): |
| 98 | + try: |
| 99 | + with open(json_file, 'r') as f: |
| 100 | + data = json.load(f) |
| 101 | + |
| 102 | + for item in data.get('CVE_Items', []): |
| 103 | + published_date = item.get('publishedDate', None) |
| 104 | + |
| 105 | + # Parse year and month from the published date |
| 106 | + if published_date: |
| 107 | + date = datetime.strptime(published_date, "%Y-%m-%dT%H:%MZ") |
| 108 | + pub_year = date.year |
| 109 | + pub_month = date.month |
| 110 | + else: |
| 111 | + continue # Skip if no published date |
| 112 | + |
| 113 | + # Extract vendor info |
| 114 | + vendors = set() # Avoid duplicate vendors per CVE |
| 115 | + for node in item.get('configurations', {}).get('nodes', []): |
| 116 | + for cpe in node.get('cpe_match', []): |
| 117 | + cpe_uri = cpe.get('cpe23Uri', '') |
| 118 | + if cpe_uri: |
| 119 | + parts = cpe_uri.split(':') |
| 120 | + if len(parts) > 4: # Ensure valid CPE format |
| 121 | + vendors.add(parts[3]) # Extract vendor |
| 122 | + |
| 123 | + # Update monthly counts |
| 124 | + for v in vendors: |
| 125 | + month_counts[(v, pub_year, pub_month)] += 1 |
| 126 | + |
| 127 | + except FileNotFoundError: |
| 128 | + print(f"File not found: {json_file}") |
| 129 | + except json.JSONDecodeError: |
| 130 | + print(f"Error decoding JSON: {json_file}") |
| 131 | + except Exception as e: |
| 132 | + print(f"An error occurred: {e}") |
| 133 | + |
| 134 | + |
| 135 | +# Define data folder |
| 136 | +data_folder = Path("../../../data/cve_data_stories/raw") |
| 137 | + |
| 138 | +# Initialize defaultdict to hold monthly counts |
| 139 | +monthly_counts = defaultdict(int) |
| 140 | + |
| 141 | +# Process each dataset |
| 142 | +for year, file_name in DATASETS.items(): |
| 143 | + input_file = data_folder / file_name |
| 144 | + print(f"Processing {input_file}") |
| 145 | + collect_monthly_counts(input_file, monthly_counts) |
| 146 | + |
| 147 | +# Write monthly counts to a CSV |
| 148 | +output_csv = DATA_DIR / "vendor_monthly_counts.csv" |
| 149 | +with open(output_csv, 'w', newline='') as csvfile: |
| 150 | + writer = csv.writer(csvfile) |
| 151 | + writer.writerow(["Vendor", "Year", "Month", "Count"]) # Header row |
| 152 | + for (vendor, year, month), count in sorted(monthly_counts.items()): |
| 153 | + writer.writerow([vendor, year, month, count]) |
| 154 | + |
| 155 | +print(f"Monthly counts written to {output_csv}") |
| 156 | + |
| 157 | +``` |
0 commit comments