A high-performance, multi-threaded compression system for IEX market data feeds, optimizing storage while maintaining data integrity and accessibility.
Foziea Garada · LinkedIn · Resume
- Bio: Foziea is a senior studying Computer Science at the University of Illinois Urbana-Champaign, as well as a published researcher, TEDx speaker, and startup cofounder. She has years of experience in the security and networking space, including internships at the MIT Lincoln Lab and Forcepoint Federal, as well as in the financial technology sector, such as with Jane Street and Synchrony. She is also very passionate about equity and education, having researched state-wide STEM education reforms for underrepresented populations across the country. Foziea starts to feel self-conscious after writing a few sentences about herself in the third person, and encourages anyone reading this not to be shy about connecting on LinkedIn or saying hi in person!
- Focus: Delta Compression Implementation, command line interface, pcap comparision/testing
- Bio: KB is an undergraduate Computer Science Major at the University of Illinois Urbana-Champaign. He has a strong interest in the system level software development, machine learning, and financial technology.
- Focus: Payload Compression, creating randomized sample pcap generator
Shria Halkoda · LinkedIn · Resume
- Bio: Shria is a Computer Science Major at the University of Illinois Urbana-Champaign with a minor in Statistics with a focus in optimization technologies! She is an eager engineer with a passion for developing innovative solutions. Shria has honed her skills including implementing machine learning web tools and building augmented reality apps through past internships and research roles, and leading via leading the high-performance computing systems at EPOCH Machine Learning Cluster. Outside of her academic and technical pursuits, Shria is deeply passionate about sustainable innovation, exemplified by her work with Tobelli LLC, where she leverages biodegradable packaging solutions to empower underserved farming communities. She is always open to new connections and opportunities! She will be working in Chicago as a CME Group SWE Intern in Summer 2025.
- Focus: Trading Message Partitioner, creating tool for recording metrics
Jasmine Liu · LinkedIn · Resume
- Bio: Jasmine is an undergraduate Computer Science Major at the University of Illinois Urbana-Champaign with a minor in Statistics. She is a motivated problem solver interested in high frequency trading, machine learning, finance, and sustainable design and passionate about diversity and mentorship. In the past, she has interned at the U.S. Navy and Stanford University doing machine learning and robotics research. She also serves as a Data Structures Course Assistant and is also Website Lead and Mentorship Chair for ACM@UIUC. She will be working in Chicago as a SWE intern at JPMC in Spring and Summer 2025.
- Focus: Field-Based Partitioner, packet counter tool for testing
- Authors
- Introduction
- Keywords
- About
- Current Architecture
- Getting Started
- User Guide
- Tools & Resources
- File Structure
- Project Components
- Projected Architecture
- Roadmap
- License
The IEX Market Data PCAP Compressor represents a novel approach to managing and optimizing high-volume market data storage through specialized compression techniques. This project addresses the growing challenges faced by financial institutions and market data providers in efficiently storing and processing large volumes of IEX market data packets while maintaining data integrity and accessibility.
A file format used to save network traffic data captured during network analysis. It contains detailed information about packets transmitted over a network, essential for understanding and debugging communication protocols.
A U.S. stock exchange that provides detailed market data feeds such as DEEP (Detailed Edge Execution Protocol), which includes trading messages, price updates, and system event data.
A data compression method that reduces file size without any loss of original data, ensuring perfect reconstruction of the original content.
A programming technique allowing multiple parts of a program (threads) to run concurrently, significantly improving performance for resource-intensive operations like compression.
A method of organizing data based on specific fields (e.g., IP address, timestamp) for better statistical analysis and compression efficiency.
A technique that stores differences (or deltas) between sequential data points instead of entire data sets, reducing redundancy and storage requirements.
The actual data or message content transmitted within a network packet, excluding headers or metadata.
A standardized communication format used in financial markets to transmit information about trades, orders, and market conditions12. It typically contains essential data such as price, volume, and transaction details, enabling efficient exchange of information between market participants
In today's high-frequency trading environment, managing and storing massive volumes of market data efficiently is crucial. Our system provides a sophisticated solution for compressing IEX market data feeds while ensuring:
- Lossless data compression
- High-speed processing through multi-threading
- Maintenance of data integrity
- Easy accessibility of compressed data
Built using modern C++17/20 and Python, with robust libraries such as libpcap and Boost, the system offers a scalable and maintainable solution for market data compression.
Our solution combines four innovative compression approaches and is parallelizable through a multi-threaded architecture. At its core, a Trading Message Partitioner efficiently processes and categorizes IEX market data packets based on message types, from system events to price updates, creating optimized streams for different categories of market messages. This is complemented by a sophisticated Payload Compression Engine that analyzes and compresses repetitive payload information while maintaining protocol validation, working alongside a Field-Based Partitioner that categorizes packets across multiple dimensions including IP addresses, ports, protocols, and timestamps.
The system's multi-threaded design enables parallel processing of these different compression algorithms, optimizing performance and resource utilization while maintaining the integrity of the market data. A Delta Compression component completes the architecture, efficiently encoding differences between consecutive packets rather than storing complete packet information, significantly reducing storage requirements while ensuring lossless reconstruction. This comprehensive approach not only addresses current market data storage challenges but also provides a flexible foundation for future enhancements and adaptations to evolving market data formats and requirements due to its modular nature, allowing this to be built on in future iterations for market data from different exchanges.
flowchart LR
subgraph Input
U[User Option]
P[PCAP File]
end
subgraph Compression
A[Algorithm Selector]
A1[Algo 1]
A2[Algo 2]
A3[Algo 3]
A4[Algo 4]
Z[ZLIB]
end
subgraph Output
C[Compressed File]
end
subgraph Decompression
DZ[ZLIB Decompress]
DA[Algorithm Decompress]
end
subgraph Result
R[Reconstructed PCAP]
end
U & P --> A
A --> |Option 1| A1
A --> |Option 2| A2
A --> |Option 3| A3
A --> |Option 4| A4
A1 & A2 & A3 & A4 --> Z
Z --> C
C --> DZ
DZ --> DA
DA --> R
style A fill:#f9f,stroke:#333
style Z fill:#bbf,stroke:#333
style DZ fill:#bbf,stroke:#333
style DA fill:#f9f,stroke:#333
We offer four seperate compression algorithms that each compress at different speeds with different maximal space savings. The user from the command line inputs an option to indicate which algorithm to run, along with a pcap file. We run the pcap through the user's algorithm of choice, then further compress that file with zlib. Lastly we output the compressed file.
To decompress, the user would again run the decompression algorithm, indicating which decompression algoirthm to use. Our decompression algorithm will first decompress the file from zlib, then decompress that file using the custom decompression algorithm, enabling us to lossleslly reconstruct the original pcap file.
-
Clone the Repository
git clone https://gitlab.engr.illinois.edu/ie421_hft_fall_2024_group_06/group_06_project.git cd group_06_project
-
Install Dependencies
# Install system dependencies sudo apt-get update sudo apt-get install -y build-essential cmake python3.11 python3-pip # Install library dependencies sudo apt-get install -y libpcap-dev libboost-all-dev zlib1g-dev # Install Python dependencies pip3 install pathlib tqdm psutil pyshark
-
Build the Project
g++ -std=c++17 main.cpp -o compression_tool -lpcap -I/usr/local/include
To compress:
./all <COMPRESS> <input.pcap> <pick one option: -a -b -c -d>
To decompress:
./all <DECOMPRESS> <input.pcap> <pick one option: -a -b -c -d>
-a : delta compression -b: payload compression -c: field-based compression -d: trading message partitioner
- Python 3.11: Used for developing the randomized PCAP generator and data analysis scripts
- C++17/20: Primary language for implementing core compression algorithms and performance-critical components
- Bash: Used for automation scripts and build system management
- C18: Required for low-level PCAP handling and system interactions
- Google Colab: Enabled collaborative development and testing of Python components
- Git/GitHub: Version control and collaborative development platform1
- GDB: Debug tool for C++ components and memory analysis1
- TSan: Thread Sanitizer for detecting race conditions in multi-threaded code1
- Virtual Machine: Used for testing compression performance against large PCAP datasets
- YubiKey: Hardware security key for secure SSH authentication to development servers
- GNU Compiler for C++: Primary compiler providing optimized binary generation
- CMake 3.20+: Cross-platform build system configuration and management
- libpcap: Essential for reading and processing PCAP network capture files1
- Zlib: Provides core compression algorithms and utilities1
- pcap: Required for packet capture and analysis functionality1
- Pathlib: Python library for cross-platform file path handling1
- Tqdm: Progress bar implementation for long-running operations1
- Psutil: System and process monitoring during compression1
- Pyshark: Python wrapper for Wireshark dissectors
- pip 23.2.1: Used for installing dependencies in the program that generates randomized PCAP data
- Catch2: Modern C++ testing framework for unit and integration testing
- Wireshark: Used for analyzing PCAP data
- IEX DEEPS BOOK: Used for identifying specific trading data
- Prof. David Lariviere: Professor Customer Service! set up our VM and answered our questions!
- Laptop with at least 50 GB of free disk space
build/
data/
- Pre-loaded sample pcap of shortened IEX Data for the user to utilizedocs/
- author resumessrc/
- Heart of the projectalgorithms/
delta_compression/
dict_trading_msg/
field_compression/
kb_compressor/
testing/
- comparator scriptsMakefile
main.cpp
README.md
The project follows a modular architecture with four main components:
-
Trading Message Partitioner
- Processes IEX market data packets
- Dynamically multi-thread implementation that categorizes by trading message type
- Creates separate compressed output streams (.bin) for each message type
- Zips the compressed .bin files
-
Payload Compression Engine
- Analyzes IEX trading messages
- Compresses repetitive payload information
- Manages protocol validation
- Handles template storage
-
Field-Based Compressionr
- Iterates through packets using pcap_next, extracting headers and payloads.
- Extracts and stores common fields from first packet
- Verifies it matches the template
- Save only the payload and metadata
-
Delta Compression System
- Implements efficient delta encoding for timestamps and size
- Takes value froms first packet, then delta encodes remaining packets in file with it
- Initial value is stored in the filename of the output
- Provides lossless reconstruction
The ideal system implements a three-layer architecture:
flowchart TD
subgraph Input[Input Layer]
PCAP[PCAP Files]
end
subgraph Processing[Processing Layer]
subgraph ParallelComp[Parallel Compression Components]
DC[Delta Compression]
PCE[Payload Compression]
FBP[Field-Based Compression]
end
subgraph Final[Final Compression]
TMP[Trading Message Partitioner]
end
end
subgraph Output[Output Layer]
CO[Compressed Output Files]
Stats[Performance Statistics]
end
PCAP --> DC & PCE & FBP
DC --> TMP
PCE --> TMP
FBP --> TMP
TMP --> CO
TMP --> Stats
classDef processing fill:#f9f,stroke:#333,stroke-width:2px
classDef input fill:#bbf,stroke:#333,stroke-width:2px
classDef output fill:#bfb,stroke:#333,stroke-width:2px
class ParallelComp,Final processing
class Input input
class Output output
The architecture consists of:
-
Input Layer:
- Handles PCAP file ingestion
- Validates input data integrity
-
Processing Layer:
- Parallel Compression Components:
- Delta Compression: Compresses differences between consecutive packets
- Payload Compression Engine: Compresses repetitive payload data
- Field-Based Partitioner: Organizes data by fields
- Final:
- Trading Message Partitioner: Categorizes market data messages
- Parallel Compression Components:
-
Output Layer:
- Produces compressed output files
- Generates performance statistics and logs
- Testing on larger datasets was difficult because we crashed the VM and due to other constraints
- Merging our algorithms was difficult because of all the custom file types we were using
In the most optimal system, a user would be able to run a singular, unique pcap through all the algorithms in succession to maximize the storage optimization on the original pcap.
The current implementation of our tool is specifically tailored to IEX DEEP (Direct Edge Express Protocol) market data feeds. Going forward, development would be focused on supporting other exchange data, as well as IEX TOPS, that allows the compression system to handle market data from multiple exchanges.
By implementing a thread pool architecture, these algorithms - Delta Compression, Payload Compression, and Field-Based Compression - can operate concurrently. An optimal thread allocation would be 2-3 threads per compression component, suggesting a total pool of 6-9 threads for the core compression operations. This parallel implementation would significantly reduce latency.
A promising integration to our compression system would be the implementation of Dynamic Markov Chain modeling specifically for price data compression. This approach would analyze price movement patterns and create adaptive prediction models for more efficient encoding of price changes, all losslessly. Given the inherent patterns in market price movements and the success of Markov models in similar applications, we estimate this addition could improve our total compression ratio for price-related messages, particularly during periods of regular trading activity.
This project is licensed under the MIT License - see the LICENSE file for details.
Distributed under the MIT License. See LICENSE
for more information.
Note: This project is actively maintained by the University of Illinois Urbana-Champaign High Frequency Trading Team. For support or questions, please open an issue in the GitLab repository.