For context, here’s the machine I used for testing:
- MacBook Pro 13", 2020
- OS: macOS Sequoia Version 15.3.1
- CPU: 2 GHz Quad-Core Intel Core i5
- Graphics: Intel Iris Plus Graphics 1536 MB
- RAM: 16 GB 3733 MHz LPDDR4X
- Storage: 512Gb
The challenge is as simple as it is daunting:
You have a dataset containing 1,000,000,000 rows. Each row includes: A string identifier (e.g., PROPERTY_A) A floating-point value (e.g., 123456.78) You must compute the minimum, average, and maximum price for each unique property.
Example row:
PROPERTY_A;123456.78
By the end, you want a map (or any form of summary) that gives you something like:
{
PROPERTY_A=100000.0/175000.0/250000.0,
PROPERTY_B=95000.0/210500.4/399999.9,
...
}
The format is Min/Mean/Max for each identifier.
The real challenge is doing it fast. Let’s start with the most straightforward method: one thread, reading lines one by one.
I created a script using Python as language on Jupyter, to simplify the process. I used only 3 properties for them.
First, download and install Python (recommended version: Python 3.8 or later):
- Windows & macOS: Download from python.org and follow the installation steps.
- Linux (Debian/Ubuntu): Install via terminal:
sudo apt update && sudo apt install python3 python3-pip -y
## On Fedora:
sudo dnf install python3 python3-pip -y
- Open a terminal or command prompt and install Jupyter using pip:
pip install jupyter
- After installation, launch Jupyter Notebook by running:
jupyter notebook
- On the Jupyter you can Open file generaBigFile.ipynb, and execute this script.
Progess: 100.00% | Time: 4602.91s File 'big_file.txt' successfully generated!
- On Java
- On Scala?
- On Python