Skip to content

kcho2027/Market_Data_Feeds_Compression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IEX Market Data Feeds Compression

A high-performance, multi-threaded compression system for IEX market data feeds, optimizing storage while maintaining data integrity and accessibility.

Report Bug · Request Feature

Authors

Foziea Garada · LinkedIn · Resume

  • Bio: Foziea is a senior studying Computer Science at the University of Illinois Urbana-Champaign, as well as a published researcher, TEDx speaker, and startup cofounder. She has years of experience in the security and networking space, including internships at the MIT Lincoln Lab and Forcepoint Federal, as well as in the financial technology sector, such as with Jane Street and Synchrony. She is also very passionate about equity and education, having researched state-wide STEM education reforms for underrepresented populations across the country. Foziea starts to feel self-conscious after writing a few sentences about herself in the third person, and encourages anyone reading this not to be shy about connecting on LinkedIn or saying hi in person!
  • Focus: Delta Compression Implementation, command line interface, pcap comparision/testing

KB Cho · LinkedIn · Resume

  • Bio: KB is an undergraduate Computer Science Major at the University of Illinois Urbana-Champaign. He has a strong interest in the system level software development, machine learning, and financial technology.
  • Focus: Payload Compression, creating randomized sample pcap generator

Shria Halkoda · LinkedIn · Resume

  • Bio: Shria is a Computer Science Major at the University of Illinois Urbana-Champaign with a minor in Statistics with a focus in optimization technologies! She is an eager engineer with a passion for developing innovative solutions. Shria has honed her skills including implementing machine learning web tools and building augmented reality apps through past internships and research roles, and leading via leading the high-performance computing systems at EPOCH Machine Learning Cluster. Outside of her academic and technical pursuits, Shria is deeply passionate about sustainable innovation, exemplified by her work with Tobelli LLC, where she leverages biodegradable packaging solutions to empower underserved farming communities. She is always open to new connections and opportunities! She will be working in Chicago as a CME Group SWE Intern in Summer 2025.
  • Focus: Trading Message Partitioner, creating tool for recording metrics

Jasmine Liu · LinkedIn · Resume

  • Bio: Jasmine is an undergraduate Computer Science Major at the University of Illinois Urbana-Champaign with a minor in Statistics. She is a motivated problem solver interested in high frequency trading, machine learning, finance, and sustainable design and passionate about diversity and mentorship. In the past, she has interned at the U.S. Navy and Stanford University doing machine learning and robotics research. She also serves as a Data Structures Course Assistant and is also Website Lead and Mentorship Chair for ACM@UIUC. She will be working in Chicago as a SWE intern at JPMC in Spring and Summer 2025.
  • Focus: Field-Based Partitioner, packet counter tool for testing

Table of Contents

Introduction

The IEX Market Data PCAP Compressor represents a novel approach to managing and optimizing high-volume market data storage through specialized compression techniques. This project addresses the growing challenges faced by financial institutions and market data providers in efficiently storing and processing large volumes of IEX market data packets while maintaining data integrity and accessibility.

Keywords

PCAP (Packet Capture)

A file format used to save network traffic data captured during network analysis. It contains detailed information about packets transmitted over a network, essential for understanding and debugging communication protocols.

IEX (Investors Exchange)

A U.S. stock exchange that provides detailed market data feeds such as DEEP (Detailed Edge Execution Protocol), which includes trading messages, price updates, and system event data.

Lossless Compression

A data compression method that reduces file size without any loss of original data, ensuring perfect reconstruction of the original content.

Multi-Threading

A programming technique allowing multiple parts of a program (threads) to run concurrently, significantly improving performance for resource-intensive operations like compression.

Field-Based Partitioning

A method of organizing data based on specific fields (e.g., IP address, timestamp) for better statistical analysis and compression efficiency.

Delta Compression

A technique that stores differences (or deltas) between sequential data points instead of entire data sets, reducing redundancy and storage requirements.

Payload

The actual data or message content transmitted within a network packet, excluding headers or metadata.

Trading Message

A standardized communication format used in financial markets to transmit information about trades, orders, and market conditions12. It typically contains essential data such as price, volume, and transaction details, enabling efficient exchange of information between market participants

About

In today's high-frequency trading environment, managing and storing massive volumes of market data efficiently is crucial. Our system provides a sophisticated solution for compressing IEX market data feeds while ensuring:

  • Lossless data compression
  • High-speed processing through multi-threading
  • Maintenance of data integrity
  • Easy accessibility of compressed data

Built using modern C++17/20 and Python, with robust libraries such as libpcap and Boost, the system offers a scalable and maintainable solution for market data compression.

Our solution combines four innovative compression approaches and is parallelizable through a multi-threaded architecture. At its core, a Trading Message Partitioner efficiently processes and categorizes IEX market data packets based on message types, from system events to price updates, creating optimized streams for different categories of market messages. This is complemented by a sophisticated Payload Compression Engine that analyzes and compresses repetitive payload information while maintaining protocol validation, working alongside a Field-Based Partitioner that categorizes packets across multiple dimensions including IP addresses, ports, protocols, and timestamps.

The system's multi-threaded design enables parallel processing of these different compression algorithms, optimizing performance and resource utilization while maintaining the integrity of the market data. A Delta Compression component completes the architecture, efficiently encoding differences between consecutive packets rather than storing complete packet information, significantly reducing storage requirements while ensuring lossless reconstruction. This comprehensive approach not only addresses current market data storage challenges but also provides a flexible foundation for future enhancements and adaptations to evolving market data formats and requirements due to its modular nature, allowing this to be built on in future iterations for market data from different exchanges.

Current Architecture

flowchart LR
    subgraph Input
        U[User Option]
        P[PCAP File]
    end

    subgraph Compression
        A[Algorithm Selector]
        A1[Algo 1]
        A2[Algo 2]
        A3[Algo 3]
        A4[Algo 4]
        Z[ZLIB]
    end

    subgraph Output
        C[Compressed File]
    end

    subgraph Decompression
        DZ[ZLIB Decompress]
        DA[Algorithm Decompress]
    end

    subgraph Result
        R[Reconstructed PCAP]
    end

    U & P --> A
    A --> |Option 1| A1
    A --> |Option 2| A2
    A --> |Option 3| A3
    A --> |Option 4| A4
    A1 & A2 & A3 & A4 --> Z
    Z --> C
    C --> DZ
    DZ --> DA
    DA --> R

    style A fill:#f9f,stroke:#333
    style Z fill:#bbf,stroke:#333
    style DZ fill:#bbf,stroke:#333
    style DA fill:#f9f,stroke:#333
Loading

We offer four seperate compression algorithms that each compress at different speeds with different maximal space savings. The user from the command line inputs an option to indicate which algorithm to run, along with a pcap file. We run the pcap through the user's algorithm of choice, then further compress that file with zlib. Lastly we output the compressed file.

To decompress, the user would again run the decompression algorithm, indicating which decompression algoirthm to use. Our decompression algorithm will first decompress the file from zlib, then decompress that file using the custom decompression algorithm, enabling us to lossleslly reconstruct the original pcap file.

Getting Started

  1. Clone the Repository

    git clone https://gitlab.engr.illinois.edu/ie421_hft_fall_2024_group_06/group_06_project.git
    cd group_06_project
  2. Install Dependencies

    # Install system dependencies
    sudo apt-get update
    sudo apt-get install -y build-essential cmake python3.11 python3-pip
    
    # Install library dependencies
    sudo apt-get install -y libpcap-dev libboost-all-dev zlib1g-dev
    
    # Install Python dependencies
    pip3 install pathlib tqdm psutil pyshark
  3. Build the Project

    g++ -std=c++17 main.cpp -o compression_tool -lpcap -I/usr/local/include
    

User Guide

To compress:

./all <COMPRESS> <input.pcap> <pick one option: -a -b -c -d>

To decompress:

./all <DECOMPRESS> <input.pcap> <pick one option: -a -b -c -d>

-a : delta compression -b: payload compression -c: field-based compression -d: trading message partitioner

Tools & Resources

I. Tools

A. Programming Languages

  • Python 3.11: Used for developing the randomized PCAP generator and data analysis scripts
  • C++17/20: Primary language for implementing core compression algorithms and performance-critical components
  • Bash: Used for automation scripts and build system management
  • C18: Required for low-level PCAP handling and system interactions

B. Development Tools

  • Google Colab: Enabled collaborative development and testing of Python components
  • Git/GitHub: Version control and collaborative development platform1
  • GDB: Debug tool for C++ components and memory analysis1
  • TSan: Thread Sanitizer for detecting race conditions in multi-threaded code1
  • Virtual Machine: Used for testing compression performance against large PCAP datasets
  • YubiKey: Hardware security key for secure SSH authentication to development servers

C. Build System

  • GNU Compiler for C++: Primary compiler providing optimized binary generation
  • CMake 3.20+: Cross-platform build system configuration and management

D. Libraries

  • libpcap: Essential for reading and processing PCAP network capture files1
  • Zlib: Provides core compression algorithms and utilities1
  • pcap: Required for packet capture and analysis functionality1
  • Pathlib: Python library for cross-platform file path handling1
  • Tqdm: Progress bar implementation for long-running operations1
  • Psutil: System and process monitoring during compression1
  • Pyshark: Python wrapper for Wireshark dissectors

E. Management System

  • pip 23.2.1: Used for installing dependencies in the program that generates randomized PCAP data

F. Testing Frameworks

  • Catch2: Modern C++ testing framework for unit and integration testing

II. Resources

  • Wireshark: Used for analyzing PCAP data
  • IEX DEEPS BOOK: Used for identifying specific trading data
  • Prof. David Lariviere: Professor Customer Service! set up our VM and answered our questions!

III. Suggested Hardware

  • Laptop with at least 50 GB of free disk space

File Structure

  • build/
  • data/ - Pre-loaded sample pcap of shortened IEX Data for the user to utilize
  • docs/- author resumes
  • src/ - Heart of the project
    • algorithms/
      • delta_compression/
      • dict_trading_msg/
      • field_compression/
      • kb_compressor/
    • testing/ - comparator scripts
    • Makefile
    • main.cpp
  • README.md

Project Components

The project follows a modular architecture with four main components:

  1. Trading Message Partitioner

    • Processes IEX market data packets
    • Dynamically multi-thread implementation that categorizes by trading message type
    • Creates separate compressed output streams (.bin) for each message type
    • Zips the compressed .bin files
  2. Payload Compression Engine

    • Analyzes IEX trading messages
    • Compresses repetitive payload information
    • Manages protocol validation
    • Handles template storage
  3. Field-Based Compressionr

    • Iterates through packets using pcap_next, extracting headers and payloads.
    • Extracts and stores common fields from first packet
    • Verifies it matches the template
    • Save only the payload and metadata
  4. Delta Compression System

    • Implements efficient delta encoding for timestamps and size
    • Takes value froms first packet, then delta encodes remaining packets in file with it
    • Initial value is stored in the filename of the output
    • Provides lossless reconstruction

Projected Architecture

The ideal system implements a three-layer architecture:

flowchart TD
    subgraph Input[Input Layer]
        PCAP[PCAP Files]
    end

    subgraph Processing[Processing Layer]
        subgraph ParallelComp[Parallel Compression Components]
            DC[Delta Compression]
            PCE[Payload Compression]
            FBP[Field-Based Compression]
        end

        subgraph Final[Final Compression]
            TMP[Trading Message Partitioner]
        end
    end

    subgraph Output[Output Layer]
        CO[Compressed Output Files]
        Stats[Performance Statistics]
    end

    PCAP --> DC & PCE & FBP
    DC --> TMP
    PCE --> TMP
    FBP --> TMP
    TMP --> CO
    TMP --> Stats

    classDef processing fill:#f9f,stroke:#333,stroke-width:2px
    classDef input fill:#bbf,stroke:#333,stroke-width:2px
    classDef output fill:#bfb,stroke:#333,stroke-width:2px

    class ParallelComp,Final processing
    class Input input
    class Output output
Loading

The architecture consists of:

  1. Input Layer:

    • Handles PCAP file ingestion
    • Validates input data integrity
  2. Processing Layer:

    • Parallel Compression Components:
      • Delta Compression: Compresses differences between consecutive packets
      • Payload Compression Engine: Compresses repetitive payload data
      • Field-Based Partitioner: Organizes data by fields
    • Final:
      • Trading Message Partitioner: Categorizes market data messages
  3. Output Layer:

    • Produces compressed output files
    • Generates performance statistics and logs

Roadmap

Challenges

  • Testing on larger datasets was difficult because we crashed the VM and due to other constraints
  • Merging our algorithms was difficult because of all the custom file types we were using

Next Steps!

Successive Compression

In the most optimal system, a user would be able to run a singular, unique pcap through all the algorithms in succession to maximize the storage optimization on the original pcap.

Exchange Modularity

The current implementation of our tool is specifically tailored to IEX DEEP (Direct Edge Express Protocol) market data feeds. Going forward, development would be focused on supporting other exchange data, as well as IEX TOPS, that allows the compression system to handle market data from multiple exchanges.

Multithreaded Implementation

By implementing a thread pool architecture, these algorithms - Delta Compression, Payload Compression, and Field-Based Compression - can operate concurrently. An optimal thread allocation would be 2-3 threads per compression component, suggesting a total pool of 6-9 threads for the core compression operations. This parallel implementation would significantly reduce latency.

Dynamic Markov Chaining for Price Compression

A promising integration to our compression system would be the implementation of Dynamic Markov Chain modeling specifically for price data compression. This approach would analyze price movement patterns and create adaptive prediction models for more efficient encoding of price changes, all losslessly. Given the inherent patterns in market price movements and the success of Markov models in similar applications, we estimate this addition could improve our total compression ratio for price-related messages, particularly during periods of regular trading activity.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Distributed under the MIT License. See LICENSE for more information.


Note: This project is actively maintained by the University of Illinois Urbana-Champaign High Frequency Trading Team. For support or questions, please open an issue in the GitLab repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5