Skip to content

Singularity-Coder/Quarter-Billion-Records-EDA-On-MacBook-Air

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Quarter Billion Records EDA On MacBook Air (Work In-Progress)

Exploratory Data Analysis on 1/4 Billion records of Reddit data done on MacBook Air with 16GB RAM.

Dataset 5 files

  • Reddit Comments 2015 dataset
  • The data is in the JSONL or single object per line format.
  • Total records: 529,610,375 which is about 530 million or 1/2 billion records.

Sample JSONL reddit comments 2015

{
  "score_hidden": false,
  "name": "t1_cnas8zv",
  "link_id": "t3_2qyrla",
  "body": "Most of us have some family members like this. *Most* of my family is like this.",
  "downs": 0,
  "created_utc": "1420070400",
  "score": 14,
  "author": "YoungModern",
  "distinguished": null,
  "id": "cnas8zv",
  "archived": false,
  "parent_id": "t3_2qyrla",
  "subreddit": "exmormon",
  "author_flair_css_class": null,
  "author_flair_text": null,
  "gilded": 0,
  "retrieved_on": 1425124282,
  "ups": 14,
  "controversiality": 0,
  "subreddit_id": "t5_2r0gj",
  "edited": false
}

External Storage

I am using a MacBook with m1 chip, 16GB RAM, 500GB internal storage, I obviously cannot load terra bytes of data and do EDA. So I used 10TB external storage given below:

Goals

  • Stream files (not load all into RAM)
  • Auto-detect format (JSON array vs JSONL)
  • Write merged output incrementally
  • Minimize temp file and memory use
  • Safe for external drive I/O

Best Approach: Line-by-Line Streaming Merge with Format Detection Here’s a memory-safe script that:

  • Opens each file.
  • Peeks at the first non-whitespace character:
    • [ → JSON Array (use streaming with ijson)
    • { → JSONL or single object per line
  • Merges all entries line by line to avoid RAM bloat.
  • Uses tqdm for progress.
  • Handles I/O safely on large external drives.

Install Required Package

!pip install ijson tqdm
!pip install ijson
!pip install polars duckdb datatable
!pip install --quiet polars duckdb datatable
!pip install "modin[ray]"
!pip install "dask[complete]"
!pip install "ray[default]"
!pip install csvstat
# Restart Kernal after this

Merge all 5 JSON files

Merge 2 files at a time. file1 & file2. Then the merged output of file1 & file2 with file3, etc. Merging all 5 at once did not work for me for some reason. The total file size of merged.json is 160.07 GB.

import os
import json
import ijson
from tqdm import tqdm

input_dir = '/Volumes/TenTB/five_files_130GB_reddit_comments'
output_path = '/Volumes/TenTB/merged_output/merged.jsonl'

files = [f for f in os.listdir(input_dir) if f.endswith('.json') or f.endswith('.jsonl')]
print(f"Found {len(files)} files to process.\n")

count = 0

with open(output_path, 'w', encoding='utf-8') as outfile:
    for filename in tqdm(files, desc="Merging files"):
        file_path = os.path.join(input_dir, filename)

        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                # Peek at first character
                first_char = ''
                while True:
                    c = f.read(1)
                    if not c:
                        break
                    if not c.isspace():
                        first_char = c
                        break
                f.seek(0)  # rewind

                if first_char == '[':
                    # Stream JSON array with ijson
                    for item in ijson.items(f, 'item'):
                        json.dump(item, outfile)
                        outfile.write('\n')
                        count += 1
                elif first_char == '{':
                    # Assume NDJSON / JSONL
                    for line in f:
                        outfile.write(line)
                        count += 1
                else:
                    print(f"[!] Skipped {filename}: Unknown format")

        except Exception as e:
            print(f"[!] Error reading {filename}: {e}")

print(f"\n✅ Merged {count} total JSON objects into:\n{output_path}")

Convert to Valid JSON

The merged JSON is not in a valid JSON format. You will get IncompleteJSONError: parse error: trailing garbage. This typically means:

  • Your JSON file is not a valid single JSON object or array.
  • You're trying to parse a file that contains multiple JSON objects without being in a list.
  • Example of bad JSON:
{"a": 1}
{"b": 2}

This is not valid if parsed as a single JSON object. We should correct it by wrapping it in an array and separate the objects by a comma. To convert it into a single object do as shown below. The total file size of correct.json is 160.33 GB.

with open("/Volumes/TenTB/merged_output/merged.json", "r", encoding="utf-8") as infile, \
     open("/Volumes/TenTB/correct_json/correct.json", "w", encoding="utf-8") as outfile:

    outfile.write("[\n")
    first = True
    for line in infile:
        line = line.strip()
        if not line:
            continue
        if not first:
            outfile.write(",\n")
        outfile.write(line)
        first = False
    outfile.write("\n]")
    print(f"JSON Array file saved to: {output_csv_path}")

JSON to CSV

Use ijson — a streaming JSON parser that reads the file incrementally, so it doesn’t blow up your memory. json.load(json_file) tries to load the entire 160 GB JSON file into RAM, which exceeds your 16 GB RAM and causes Jupyter (and possibly your entire system) to hang or crash. The total file size of 2tb_reddit_comments_2015.csv is 85.07 GB.

import ijson
import csv
import os

# Paths
input_json_path = '/Volumes/TenTB/correct_json/correct.json'
output_dir = '/Volumes/TenTB/csv_output/'
output_csv_path = os.path.join(output_dir, 'reddit_comments_2015.csv')

# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)

# First pass: collect all keys from all rows
print("Scanning all keys...")
all_keys = set()
with open(input_json_path, 'rb') as json_file:
    items = ijson.items(json_file, 'item')
    for item in items:
        all_keys.update(item.keys())

fieldnames = sorted(all_keys)

# Second pass: write to CSV
print("Writing CSV...")
with open(input_json_path, 'rb') as json_file, open(output_csv_path, 'w', newline='', encoding='utf-8') as csv_file:
    items = ijson.items(json_file, 'item')
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames, extrasaction='ignore')
    writer.writeheader()
    for item in items:
        writer.writerow(item)

print(f"✅ CSV file saved to: {output_csv_path}")

Get Total Records

  • Do start file names with digits. Shell commands will cause problems.
  • Do "jupyter notebook" by navigating to external storage directory in terminal
  • Running wc -l /Volumes/TenTB/csv_output/reddit_comments_2015.csv in terminal will not give the correct number of records unless each record is exactly 1 line. This counts the number of lines which is inaccurate. It gave 529,610,375 records.
  • This will give the correct number of records. 266,268,920 which is about a quarter billion records:
import polars as pl # Super fast compared to csvstat

lf = pl.scan_csv('/Volumes/alienHD/csv_output/reddit_comments_2015.csv', infer_schema_length=1000)
row_count = lf.select(pl.len()).collect()[0, 0]
print(f"Total rows: {row_count}")

About

Exploratory Data Analysis on 1/4 Billion records of Reddit data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published