Reinforcement Learning for Trade Execution in Limit Order Books

This repository provides a implementation of a logistic-normal policy for optimal trade execution using reinforcement learning based on recent publications.

The Trade Execution Problem

When an institutional trader needs to sell a large position, they face a fundamental trade-off:

Immediate execution through market orders provides certainty but incurs significant market impact costs (Almgren & Chriss, 2001)
Patient execution through limit orders reduces impact but risks non-execution and adverse price movements (Cartea & Jaimungal, 2015)
Time constraints impose a deadline by which the position must be liquidated

Traditional approaches like Time-Weighted Average Price (TWAP) or Volume-Weighted Average Price (VWAP) use predetermined schedules that don't adapt to market conditions. This implementation uses reinforcement learning to discover adaptive policies that respond to real-time market dynamics, building on recent advances in RL for optimal execution (Ning et al., 2021; Hafsi & Vittori, 2024).

Scientific Contributions

This codebase implements a "Logistic-Normal (LN) style" policy parameterization (Cheridito & Weiss 2025) that addresses several technical challenges in applying RL to trade execution:

Action space design: The agent's actions must satisfy the constraint that allocations sum to at most the remaining inventory. This is achieved through a simplex action space, extending ideas from Pan et al. (2022) on hybrid action spaces.
Continuous-discrete mapping: The policy outputs continuous values that are mapped to discrete lot allocations through a bijective transformation, similar to but more elegant than the dual-policy approach in Pan et al. (2022).
Market microstructure modeling: The LOB simulator captures essential features including price-time priority, queue dynamics, and multiple trader types, following the agent-based approach of Karpe et al. (2020).

Mathematical Framework

Policy Parameterization

The core innovation is the use of a Logistic-Normal distribution to parameterize actions on the (K+1)-simplex. Given a state s, the policy network outputs parameters μ(s) ∈ ℝᴷ, and actions are sampled via:

Sample x ~ N(μ(s), σ²I) from a multivariate Gaussian
Transform to simplex via the bijection h: ℝᴷ → Sₖ

The transformation h is defined as:

h(x)ₖ = exp(xₖ) / (1 + Σⱼ exp(xⱼ)), for k = 0,...,K-1
h(x)ₖ = 1 / (1 + Σⱼ exp(xⱼ)), for k = K

This ensures actions always form a valid probability distribution over allocation choices.

Reward Structure

The reward function balances execution price against market impact:

r = r̄ - γₙ × p_b(0) / M₀

where:

r̄ is the cash flow from executed trades
γₙ is the number of lots sold in step n
p_b(0) is the initial best bid price
M₀ is the initial inventory

This formulation penalizes aggressive trading that moves prices unfavorably while rewarding successful execution.

Policy Optimization

The policy parameters θ are optimized using the REINFORCE algorithm with a learned value function baseline, following the actor-critic approach demonstrated effective in Micheli & Monod (2024):

∇L_π = 𝔼[∇ log φ_θ(x|s) × A(s,a)]

where φ_θ is the Gaussian density and A is the advantage function estimated via Monte Carlo returns. This approach differs from the Double Deep Q-Learning used in Ning et al. (2021) by directly optimizing a continuous policy rather than discretizing the action space.

System Architecture

rlte/
├── config.py              # Global parameters and hyperparameters
├── env_lob.py            # Limit order book simulation environment
├── utils.py              # Mathematical transformations and utilities
├── train_ln.py           # Training loop with vectorized execution
├── baselines.py          # Reference strategies (TWAP, Submit & Leave)
├── evaluate.py           # Performance evaluation framework
└── agents/
    └── ln_actor_critic.py  # Neural network architecture

Core Components

Order Book Environment (env_lob.py)

Implements a realistic LOB with price-time priority matching
Tracks agent queue positions for accurate fill probability modeling
Simulates three types of market participants:
- Noise traders: Random arrivals following Poisson processes
- Tactical traders: Momentum-based traders responding to order imbalance
- Strategic traders: Large traders with scheduled execution

Policy Network (ln_actor_critic.py)

Actor network: Maps states to Gaussian parameters μ(s)
Critic network: Estimates state values V(s) for variance reduction
Orthogonal initialization with specialized bias settings for stable training

Training Infrastructure (train_ln.py)

Parallelized data collection across 128 environments
Linear variance schedule: σ decreases from 1.0 to 0.1 over training
Efficient batched gradient computation

Market Models

The implementation includes three increasingly complex market environments, following the progression from pure noise to multi-agent systems demonstrated in Karpe et al. (2020) and refined in Cheridito & Weiss (2025):

Noise Market

A baseline environment with only random trading activity. Order arrivals follow Poisson processes with intensities calibrated to real market data (Abergel & Jedidi, 2013). This environment tests the agent's ability to optimize execution timing without strategic opponents.

Tactical Market

Extends the noise market with momentum traders who respond to order book imbalance, similar to the adaptive agents in de Meer Pardo et al. (2022). The imbalance measure I is computed as:

I = (V_bid - V_ask) / (V_bid + V_ask)

where volumes are weighted by exponentially decaying factors. Tactical traders increase activity in the direction of detected momentum, creating more realistic price dynamics.

Strategic Market

The most realistic setting includes large institutional traders executing on fixed schedules. These traders place substantial orders at predetermined times, creating predictable but significant market impact that the agent must learn to navigate.

State Representation

The agent observes both public market information and private execution state, following the microstructure-aware approach of Lin & Beling (2020) but with additional feature engineering based on domain knowledge:

Market Features

Price levels: Normalized best bid and ask prices
Volume profile: Quantities available at the first K-1 price levels (similar to level-2 data in Lin & Beling, 2020)
Order flow: Net market and limit order activity in the previous interval
Price dynamics: Mid-price changes capturing momentum

Private Features

Time remaining: Fraction of execution horizon elapsed (t/T)
Inventory: Remaining lots normalized by initial position (M(t)/M₀)
Active orders: Current limit order placements and queue positions
Inventory distribution: Allocation across price levels (γ vector)

All features undergo careful normalization to ensure numerical stability during training. Queue positions are clipped to [-50, 50] and price levels to [-K, K] to bound the input space.

Training Methodology

Algorithm Overview

The training follows a proximal policy optimization approach, with practical considerations for scalability drawn from Byun et al. (2023):

Collect trajectories using current policy across parallel environments
Compute returns via Monte Carlo estimation over N=10 step windows
Estimate advantages using the learned value function (actor-critic architecture per Micheli & Monod, 2024)
Update networks via gradient ascent on the policy objective
Decay exploration by linearly reducing σ over H=400 iterations

Hyperparameters

Key settings optimized through experimentation:

Parameter	Value	Justification
`N_STEPS`	10	Balances temporal credit assignment with variance
`DT`	15.0s	Matches typical institutional execution granularity
`K_SIMPLEX`	6	Provides sufficient price level granularity
`PAR_ENVS`	128	Enables diverse experience collection
`HIDDEN`	128	Network capacity for function approximation
`ADAM_LR`	5e-4	Learning rate balanced for stability

Convergence Monitoring

Training progress is tracked through:

Average episodic returns (smoothed with exponential moving average)
Policy entropy (implicitly controlled via σ schedule)
Value function loss as a proxy for state visitation coverage

Experimental Setup

Baseline Strategies

The implementation includes two classical baseline strategies for comparison, following evaluation standards from Hafsi & Vittori (2024):

Submit & Leave (SL) Places the entire inventory as a limit order at the best ask price at t=0. Any unfilled quantity is executed via market order at the terminal time T. This strategy minimizes active management but risks poor execution, serving as a passive baseline.

Time-Weighted Average Price (TWAP) Divides the inventory into N equal blocks and submits each block over successive time intervals. This provides execution certainty but ignores market conditions. TWAP represents the industry standard benchmark (Byun et al., 2023).

The paper also compares against a Dirichlet-policy baseline (not included in this implementation), which uses an alternative probabilistic approach to the simplex action space. The logistic-normal parameterization demonstrates superior performance across all test scenarios.

Performance Metrics

Expected execution cost: Mean reward across evaluation episodes
Execution risk: Standard deviation of rewards
Fill rate: Percentage executed via limit orders vs. market orders
Price improvement: Execution price relative to arrival price

Implementation Details

Order Matching Engine

The LOB simulator implements realistic market mechanics:

Price-time priority: Orders at better prices execute first; ties broken by arrival time
Queue position tracking: Agent orders maintain position relative to non-agent flow
Partial fills: Large market orders consume liquidity across multiple price levels

Numerical Considerations

Several techniques ensure stable training:

Orthogonal initialization: Reduces gradient correlation in deep networks
Bias initialization: Policy output layer initialized at -1 to encourage initial exploration
Feature normalization: All inputs scaled to approximately [-1, 1] range
Gradient clipping: Applied when gradients exceed threshold (not shown in base code)

Computational Efficiency

The implementation prioritizes training speed through:

Vectorized environment stepping (128 parallel instances)
Batched neural network operations
Efficient memory allocation for trajectory storage
NumPy/PyTorch operation fusion where possible

Usage Guide

Installation Requirements

# Core dependencies
pip install numpy torch matplotlib

# Optional for GPU acceleration
pip install torch --index-url https://download.pytorch.org/whl/cu118

Training a Model

from rlte.train_ln import train_ln

# Train on tactical market with 60 lot position
results = train_ln(market="tactical", M0=60, device="cuda")

# Access trained policy (requires checkpoint loading)
# policy = load_checkpoint("tactical_M60.pt")

Evaluating Strategies

from rlte.baselines import run_TWAP, run_SL
from rlte.evaluate import eval_all

# Compare baseline strategies
twap_mean, twap_std = run_TWAP(market="noise", M0=20, episodes=1000)
sl_mean, sl_std = run_SL(market="noise", M0=20, episodes=1000)

# Full evaluation across markets
results = eval_all(
    markets=("noise", "tactical", "strategic"),
    lots=(20, 60),
    device="cuda"
)

Custom Market Configuration

from rlte.env_lob import LOBExecutionEnv
import rlte.config as C

# Modify market parameters
C.LAMBDA_M = 0.15  # Increase market order intensity
C.NOISE_SCALE = 2.5  # Increase order size variability

# Create custom environment
env = LOBExecutionEnv(market="tactical", M0=40, seed=42)
state = env.reset()

Research Extensions

This implementation provides a foundation for several research directions:

Alternative policy parameterizations: Explore mixture distributions or normalizing flows
Multi-asset execution: Extend to portfolio liquidation problems
Adversarial training: Include strategic opponents in the training loop
Transfer learning: Adapt policies across different market conditions
Risk constraints: Incorporate value-at-risk or conditional value-at-risk limits

Technical Notes

Reproducibility

All random processes use seeded generators for reproducible results. The configuration seed (C.SEED) propagates to:

Environment initialization
Neural network initialization
Training batch sampling
Evaluation episode generation

Memory Requirements

Memory usage scales with parallel environments and inventory size:

Approximate RAM: 100MB + (PAR_ENVS × M₀ × 0.5MB)
GPU memory for networks: ~200MB with default architecture
Trajectory buffer: PAR_ENVS × STEPS_PER_ENV × state_size

Performance Considerations

Training time depends on hardware and configuration:

CPU training: ~2-4 hours for 400 iterations
GPU training: ~30-60 minutes with CUDA-enabled PyTorch
Evaluation: ~1-2 minutes per 1000 episodes

References and Attribution

This implementation is based on:

Cheridito & Weiss (2025). Actor-critic RL for trade execution in a full LOB simulator with tactical/strategic agents; actions are simplex allocations over market/limit orders using a logistic-normal policy. Shows gains vs TWAP/SL and a Dirichlet-policy baseline.

Related Literature

Order Placement in LOBs and Agent-Based Market Impact

Pan et al. (IJCAI 2022) – Hybrid Action-Space RL for Optimal Execution. Explicitly tackles limit-order placement: a continuous policy scopes a region, a discrete policy chooses the tick; focuses on LOB microstructure and the continuous/discrete duality of prices. This work provides the closest precedent for handling the hybrid nature of order placement decisions.
Lin & Beling (IJCAI 2020) – End-to-End PPO for Optimal Execution. Uses raw level-2 LOB inputs and learns execution without feature engineering; assumes only temporary impact. Provides a useful baseline for comparing feature-engineered vs. end-to-end approaches.
Karpe et al. (2020) – Multi-Agent RL in a Realistic LOB Market Simulation (ABIDES). Uses ABIDES to train execution agents in an agent-based market; compares to TWAP and discusses convergence behavior. Demonstrates the importance of realistic market simulators for training robust policies.
de Meer Pardo et al. (2022) – A Modular Framework for RL Optimal Execution. Details environment design (observations, action processing, execution, rewards) for RL optimal execution; includes limit-order simulation against TWAP. Provides architectural patterns adopted in this implementation.

RL with Market Impact Models

Ning, Lin & Jaimungal (2018 arXiv; 2021 Applied Mathematical Finance) – Double Deep Q-Learning for Optimal Execution. DDQN over execution actions using LOB/market features; widely cited baseline for discrete action spaces in execution.
Macrì & Lillo (2024) – RL for Optimal Execution when Liquidity is Time-Varying. Trains DDQN inside an Almgren-Chriss-type impact model with time-varying/latent liquidity; validates against analytical solutions when available.
Micheli & Monod (2024) – Deep RL for Online Optimal Execution Strategies. DDPG with transient (non-Markovian) impact via decay kernels; shows the actor-critic can learn the optimal schedule and adapt online. Influences our choice of actor-critic architecture.
Moallemi & Wang (2022) – A RL Approach to Optimal Execution. Frames execution timing as an optimal stopping problem (when to cross the spread); complements order-placement approaches with timing optimization.
Hendricks & Wilcox (2014) – RL Extension to Almgren-Chriss. Early proof-of-concept that RL can adapt an AC schedule using microstructure signals to reduce implementation shortfall.
Hafsi & Vittori (2024) – Optimal Execution with RL. Modern treatment using LOB-derived features for execution over a finite horizon; provides contemporary benchmarks and evaluation protocols.

Practice-Oriented Applications

Byun et al. (2023, FinTech) – Practical Application of DRL to Optimal Trade Execution. Reports generalization across ~50 stocks and long horizons; compares against VWAP in production-like settings. Demonstrates scalability considerations for real deployment.

Theoretical Foundations

Cartea & Jaimungal (2015, Quantitative Finance) – Optimal Execution with Limit and Market Orders. Canonical model optimizing both order types, provides structural baseline for comparing RL approaches that choose between market/limit placement.

Implementation Heritage

The codebase synthesizes techniques from the above literature:

State representation follows the microstructure-aware approach of Pan et al. (2022) and Lin & Beling (2020)
Agent-based simulation adopts patterns from Karpe et al. (2020) for realistic market dynamics
Actor-critic architecture builds on Micheli & Monod (2024) for continuous control
Evaluation protocol follows standards established by Byun et al. (2023) and Hafsi & Vittori (2024)

The unique contribution of this implementation (Cheridito & Weiss 2025) is the logistic-normal policy parameterization on the simplex, which elegantly handles the constrained action space while maintaining differentiability for policy gradient methods. This approach outperforms both traditional baselines (TWAP, Submit & Leave) and alternative RL parameterizations (Dirichlet policy).

Code structure follows best practices for reproducible RL research, with clear separation between environment dynamics, learning algorithms, and evaluation protocols. German comments in the source files are preserved from the original research implementation for historical context.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
baselines.py		baselines.py
config.py		config.py
env_lob.py		env_lob.py
evaluate.py		evaluate.py
readme.md		readme.md
train_ln.py		train_ln.py
utils.py		utils.py

cmarvinzurich/RL-LOB

Folders and files

Latest commit

History

Repository files navigation