Skip to content

xmacedo/ReadBigFilesBenchMark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

One Billion Row Challenge

Hardware & Environment Setup

For context, here’s the machine I used for testing:

  • MacBook Pro 13", 2020
  • OS: macOS Sequoia Version 15.3.1
  • CPU: 2 GHz Quad-Core Intel Core i5
  • Graphics: Intel Iris Plus Graphics 1536 MB
  • RAM: 16 GB 3733 MHz LPDDR4X
  • Storage: 512Gb

What Is the One Billion Row Challenge?

The challenge is as simple as it is daunting:

You have a dataset containing 1,000,000,000 rows. Each row includes: A string identifier (e.g., PROPERTY_A) A floating-point value (e.g., 123456.78) You must compute the minimum, average, and maximum price for each unique property.

Example row:

PROPERTY_A;123456.78

By the end, you want a map (or any form of summary) that gives you something like:

{
PROPERTY_A=100000.0/175000.0/250000.0,
PROPERTY_B=95000.0/210500.4/399999.9,
...
}

The format is Min/Mean/Max for each identifier.

Goal:

The real challenge is doing it fast. Let’s start with the most straightforward method: one thread, reading lines one by one.

How to Generate the File

I created a script using Python as language on Jupyter, to simplify the process. I used only 3 properties for them.

1. Install Python

First, download and install Python (recommended version: Python 3.8 or later):

  • Windows & macOS: Download from python.org and follow the installation steps.
  • Linux (Debian/Ubuntu): Install via terminal:
sudo apt update && sudo apt install python3 python3-pip -y
## On Fedora:
sudo dnf install python3 python3-pip -y

2. Once Python is installed,

  • Open a terminal or command prompt and install Jupyter using pip:
pip install jupyter

3. Running Jupyter Notebook

  • After installation, launch Jupyter Notebook by running:
jupyter notebook

4. Jupyter

Progess: 100.00% | Time: 4602.91s File 'big_file.txt' successfully generated!

On my machine it takes 1 hours, 16 minutes e 71,23 seconds

Languages to solve the problem

  • On Java
  • On Scala?
  • On Python

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published