One Billion Row Challenge

Hardware & Environment Setup

For context, here’s the machine I used for testing:

MacBook Pro 13", 2020
OS: macOS Sequoia Version 15.3.1
CPU: 2 GHz Quad-Core Intel Core i5
Graphics: Intel Iris Plus Graphics 1536 MB
RAM: 16 GB 3733 MHz LPDDR4X
Storage: 512Gb

What Is the One Billion Row Challenge?

The challenge is as simple as it is daunting:

You have a dataset containing 1,000,000,000 rows. Each row includes: A string identifier (e.g., PROPERTY_A) A floating-point value (e.g., 123456.78) You must compute the minimum, average, and maximum price for each unique property.

Example row:

PROPERTY_A;123456.78

By the end, you want a map (or any form of summary) that gives you something like:

{
PROPERTY_A=100000.0/175000.0/250000.0,
PROPERTY_B=95000.0/210500.4/399999.9,
...
}

The format is Min/Mean/Max for each identifier.

Goal:

The real challenge is doing it fast. Let’s start with the most straightforward method: one thread, reading lines one by one.

How to Generate the File

I created a script using Python as language on Jupyter, to simplify the process. I used only 3 properties for them.

1. Install Python

First, download and install Python (recommended version: Python 3.8 or later):

Windows & macOS: Download from python.org and follow the installation steps.
Linux (Debian/Ubuntu): Install via terminal:

sudo apt update && sudo apt install python3 python3-pip -y
## On Fedora:
sudo dnf install python3 python3-pip -y

2. Once Python is installed,

Open a terminal or command prompt and install Jupyter using pip:

pip install jupyter

3. Running Jupyter Notebook

After installation, launch Jupyter Notebook by running:

jupyter notebook

4. Jupyter

On the Jupyter you can Open file generaBigFile.ipynb, and execute this script.

Progess: 100.00% | Time: 4602.91s File 'big_file.txt' successfully generated!

Languages to solve the problem

On Java
On Scala?
On Python

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
images		images
java		java
python		python
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

One Billion Row Challenge

Hardware & Environment Setup

What Is the One Billion Row Challenge?

Goal:

How to Generate the File

1. Install Python

2. Once Python is installed,

3. Running Jupyter Notebook

4. Jupyter

Languages to solve the problem

About

Uh oh!

Releases

Packages

Uh oh!

Languages

xmacedo/ReadBigFilesBenchMark

Folders and files

Latest commit

History

Repository files navigation

One Billion Row Challenge

Hardware & Environment Setup

What Is the One Billion Row Challenge?

Goal:

How to Generate the File

1. Install Python

2. Once Python is installed,

3. Running Jupyter Notebook

4. Jupyter

Languages to solve the problem

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages