mcSCRAM: Monte Carlo SCRAM

mcSCRAM is a fork of @rakhimov Olzhas Rakhimov's SCRAM that extends the original probabilistic risk assessment tool with multicore CPU, GPU-accelerated Monte Carlo simulation capabilities using AdaptiveCpp's SYCL backend.

Caution

**⚠️ ALPHA ** This project is under active development. The APIs are unstable, interfaces may change without notice until the first release.

Objectives

The primary goals of this project include:

Parallel Monte Carlo Implementation: Developing SYCL-based kernels for massively parallel sampling across GPU compute units
Statistical Precision: Implementing advanced uncertainty quantification with confidence interval estimation
Hardware Optimization: Exploring memory-efficient data structures and optimal kernel configurations for various accelerator architectures
Performance Characterization: Benchmarking scalability and computational efficiency improvements over traditional CPU-based approaches

Technical Implementation

Monte Carlo Engine

The core contribution lies in the parallel Monte Carlo implementation featuring:

Philox PRNG: Counter-based pseudorandom number generation enabling perfect parallelization without synchronization overhead
Bit-packed Sampling: Memory-efficient boolean storage minimizing bandwidth requirements during GPU execution
Layered Graph Execution: Topologically sorted PRA model evaluation with dependency-aware scheduling

Hardware Acceleration

SYCL Backend: Cross-platform acceleration via AdaptiveCpp supporting CUDA, ROCm, Intel oneAPI, and OpenCL
Work-group Optimization: Dynamic kernel configuration adaptation for different hardware architectures
Memory Coalescing: Optimized access patterns for GPU memory hierarchies

Memory Management Architecture

mcSCRAM implements a strict USM-only memory strategy that eliminates SYCL buffers entirely, following AdaptiveCpp performance recommendations:

Device USM for High-Throughput Data

// Large contiguous allocations for computational data
bitpack_t* buffer_block = sycl::malloc_device<bitpack_t>(
    num_events * num_bitpacks, queue);

Sample Data: All Monte Carlo sample data resides in device memory
Contiguous Layout: Single large allocations reduce fragmentation
Zero-Copy: Computational kernels access data directly without transfers

Shared USM for Metadata and Control

// Small metadata structures accessible from host
gate_t* gates = sycl::malloc_shared<gate_t>(num_gates, queue);

Graph Metadata: Event and gate configurations in shared memory
Pointer Arrays: Input/output buffer references for kernel dispatch
Host Access: Configuration and results accessible without explicit copies

Performance Benefits

Eliminated Buffer Overhead: No accessor creation or runtime dependency analysis
Predictable Memory Layout: Static allocation patterns enable optimal caching
Reduced Host Latency: Direct pointer access vs. buffer submission queues

Bit-Packing Optimization

Monte Carlo simulations are memory bandwidth-bound. mcSCRAM addresses this through aggressive bit-packing:

template<typename bitpack_t>  // typically uint64_t
static bitpack_t generate_samples(const sampler_args &args) {
    constexpr uint8_t bits_in_bitpack = sizeof(bitpack_t) * 8;  // 64 bits
    constexpr uint8_t samples_per_pack = bits_in_bitpack / bernoulli_bits_per_generation;
    // Pack 64 boolean samples into single 64-bit integer
}

Memory Bandwidth Optimization

64:1 Compression: 64 boolean samples packed into single uint64_t
Coalesced Access: Contiguous memory layout maximizes GPU memory throughput
Cache Efficiency: Reduced memory footprint improves L1/L2 cache utilization

Configurable Dimensions

Batch Size: Number of simulation trials processed simultaneously
Sample Size: Bit-packs per batch (configurable: 16, 32, 64 typical)
Dynamic Sizing: Runtime optimization based on device memory and compute capabilities

Optimized ATLEAST Gate Implementation

mcSCRAM implements a direct bit-counting algorithm for ATLEAST gates (k-out-of-n logic) that significantly outperforms traditional AND/OR expansion methods. For example, a 3-out-of-5 ATLEAST gate, if expanded, requires 10 intermediate AND/OR gates. mcSCRAM's implementation uses a single kernel with 5 input reads and parallel bit accumulation.

// Per-bit accumulation instead of combinatorial expansion
sycl::marray<bitpack_t, NUM_BITS> accumulated_counts(0);
for (auto i = 0; i < num_inputs; ++i) {
    const bitpack_t val = inputs[i][index];
    #pragma unroll
    for (auto bit_idx = 0; bit_idx < NUM_BITS; ++bit_idx) {
        accumulated_counts[bit_idx] += ((val >> bit_idx) & 1);
    }
}

No Combinatorial Explosion: Traditional ATLEAST implementations expand k-out-of-n into complex trees of AND/OR gates (C(n,k) combinations)
Parallel Bit Processing: All 64 bits processed simultaneously vs. sequential popcount operations
Memory Efficiency: Single pass through inputs vs. multiple intermediate gate evaluations
Optimal GPU Utilization: Vectorized accumulation operations match GPU SIMD architecture

Example:

Build and Installation

Container-based Development (Recommended)

The project provides multi-stage Docker builds for different phases.

For CUDA runtime support, install the NVIDIA Container Toolkit.
For AMD/ROCm runtime support, install the AMD Container Toolkit. -- alternatively, you can try mapping devices /dev/dri, /dev/kfd, but ymmw.
For Intel GPUs (discrete or embedded), you just need to map devices /dev/dri.

# Development environment with full toolchain
docker build --target devimage -t mc-scram:dev .
docker run -it --rm --gpus all -v $(pwd):/workspace mc-scram:dev

## for intel, amd 
docker run -it --rm --device=/dev/dri -v $(pwd):/workspace mc-scram:dev

# Production runtime (minimal dependencies)
docker build --target scramruntime -t mc-scram:runtime .

Build arguments for configurations:

CMAKE_BUILD_TYPE: Debug, Release, RelWithDebInfo, MinSizeRel (default: Release)
APP_MALLOC_TYPE: tcmalloc, jemalloc, malloc (default: tcmalloc)

Native Build

Requirements:

CMake ≥ 3.18.4
C++23 compiler:
- Clang ≥ 18.0 ✅
- GCC ≥ 7.1 ⚠️ Untested
- AppleClang ≥ 9.0 ⚠️ Untested
- Intel ≥ 18.0.1 ⚠️ Untested
AdaptiveCpp ≥ 25.2.0
Memory allocator:
- tcmalloc (default)
- jemalloc
- malloc
Drivers for your CUDA/ROCm/OpenCL/ZE/OpenMP runtimes

Auto-Fetched

LibXML2 (with LZMA, ZLIB, and ICONV support for compressed .xml.gz files)
Boost 1.88.0 libraries (automatically fetched via FetchContent):
- program_options, filesystem, system, random, range
- exception, multi_index, accumulators, multiprecision
- icl, math, dll, regex, unit_test_framework

git clone --recursive https://github.com/a-earthperson/mcSCRAM.git
cd mc-scram
mkdir build && cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DMALLOC_TYPE=tcmalloc \
  -DBUILD_TESTS=ON \
  -DOPTIMIZE_FOR_NATIVE=ON
make -j$(nproc)

CMake Build Options

Option	Description	Default	Values
`CMAKE_BUILD_TYPE`	Build configuration	Release	Debug, Release, RelWithDebInfo, MinSizeRel
`MALLOC_TYPE`	Memory allocator	tcmalloc	tcmalloc, jemalloc, malloc
`BUILD_TESTS`	Build test suite	ON	ON, OFF
`WITH_COVERAGE`	Enable coverage instrumentation	OFF	ON, OFF
`WITH_PROFILE`	Enable profiling instrumentation	OFF	ON, OFF
`OPTIMIZE_FOR_NATIVE`	Build with -march=native	ON	ON, OFF
`BUILD_SHARED_LIBS`	Build shared libraries	OFF	ON, OFF

Usage

# Container execution
docker run --rm --gpus all \
  -v $(pwd)/input:/input \
  mc-scram:runtime --monte-carlo --num-trials 1000000 /input/model.xml

# Native binary
./scram --monte-carlo --num-trials 1000000 \
        --confidence-intervals input/model.xml

Parameters

Usage:    mcscram [options] input-files...

Monte Carlo Options:
  --monte-carlo                         enable monte carlo sampling
  -N [ --num-trials ] double (=0)       bernoulli trials [N ∈ ℕ, 0=auto]
  --early-stop                          stop on convergence (implied if N=0)
  --seed int (=372)                     philox-4x32-10 seed
  -d [ --delta ] double (=0.001)        compute as ε=δ·p̂ [δ > 0]
  -b [ --burn-in ] double (=1048576)    trials before convergence check [0=off]
  -a [ --confidence ] double            two-sided conf. lvl [α ∈ (0,1)] (0.99)

Graph Compilation Options:
  --no-kn                               expand k/n to and/or [off]
  --no-xor                              expand xor to and/or [off]
  --nnf                                 compile to negation normal form [off]
  -c [ --compilation-passes ] int (=2)  0=off 1=null-only 2=optimize 
                                        3+=multipass

Debug Options:
  -w [ --watch ]                        enable watch mode [off]
  -h [ --help ]                         display this help message
  --no-report                           dont generate analysis report
  -p [ --oracle ] double (=-1)          true µ [µ ∈ [0,∞), -1=off]
  --preprocessor                        stop analysis after preprocessing
  --print                               print analysis results to terminal
  --serialize                           serialize the input model and exit
  -V [ --verbosity ] int                set log verbosity
  -v [ --version ]                      display version information

Legacy Options:
  --project path                        project analysis config file
  --allow-extern                        **UNSAFE** allow external libraries
  --validate                            validate input files without analysis
  --pdag                                perform qualitative analysis with PDAG
  --bdd                                 perform qualitative analysis with BDD
  --zbdd                                perform qualitative analysis with ZBDD
  --mocus                               perform qualitative analysis with MOCUS
  --prime-implicants                    calculate prime implicants
  --probability                         perform probability analysis
  --importance                          perform importance analysis
  --uncertainty                         perform uncertainty analysis
  --ccf                                 compute common-cause failures
  --sil                                 compute safety-integrity-level metrics
  --rare-event                          use the rare event approximation
  --mcub                                use the MCUB approximation
  -l [ --limit-order ] int              upper limit for the product order
  --cut-off double                      cut-off probability for products
  --mission-time double                 system mission time in hours
  --time-step double                    timestep in hours
  --num-quantiles int                   number of quantiles for distributions
  --num-bins int                        number of bins for histograms
  -o [ --output ] path                  output file for reports
  --no-indent                           omit indented whitespace in output XML

Example Run

ACPP_VISIBILITY_MASK=cuda \
ACPP_ADAPTIVITY_LEVEL=2 \
ACPP_ALLOCATION_TRACKING=1 \
ACPP_DEBUG_LEVEL=0 \
ACPP_PERSISTENT_RUNTIME=1 \
ACPP_USE_ACCELERATED_CPU=on \
mcscram \
--pdag \
--monte-carlo \
--probability \
--oracle 0.000713018 \
--compilation-passes 5 \
--watch \
../../../input/Aralia/baobab2.xml

[burn-in]     ::      (ε)= 3.026e-06 |      (ε₀)= 7.129e-07 :: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% [00m:01s<00m:00s] [1/1]                                                                        
[convergence] ::      (ε)= 7.135e-07 |      (ε₀)= 7.135e-07 :: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% [00m:02s<00m:00s] [18/18]                                                                      
[log10-conv]  :: log10(ε)= 9.212e-04 | log10(ε₀)= 1.000e-03 :: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% [00m:00s<00m:00s] [4/4]                                                                        
[estimate]    :: p01= 7.12748e-04  |  p05= 7.12919e-04  |  mu = 7.13462e-04  |  p95= 7.14005e-04  |  p99= 7.14175e-04  |                                                                            
[diagnostics] :: z=  1.602e+00 | p_val=  1.092e-01 | CI95=T | CI99=T | n_req=9287175163 | n_rat=  1.001e+00                                                                                         
[accuracy]    :: true(p)= 7.130e-04 | Δ=  4.436e-07 | δ=  6.222e-04 | b=  4.436e-07 | MSE=  1.968e-13 | log10(Δ)= -6.353e+00 | |log10|=  2.701e-04                                                  
[throughput]  :: 5.34 it/s | 42.79 Gbit/it | 228.57 Gbit/s | 492.36 Mbit/node/it | 2.57 Gbit/node/s                                                                                                 
[info-gain]   :: 0.235169 bit/s | 0.043719 bit/iter | Σ 19.346960 bit

Runtime Environment Variables

AdaptiveCpp environment variables control hardware acceleration behavior, debugging output, and performance tuning. For detailed performance optimization guidance, see the AdaptiveCpp Performance Tuning Guide.

Variable	Description	Values	Default
`ACPP_VISIBILITY_MASK`	Controls which backends are available for execution	`cuda`, `rocm`, `opencl`, `lz`, `omp`, combinations (e.g., `cuda,opencl`), or `all`	`all`
`ACPP_DEBUG_LEVEL`	Controls runtime debug output verbosity	`0` (silent), `1` (fatal), `2` (errors/warnings), `3` (info), `4` extra	`0`
`ACPP_ADAPTIVITY_LEVEL`	Controls JIT kernel optimization and runtime adaptivity	`0` (static), `1` (basic), `2` (standard)	`2`
`ACPP_ALLOCATION_TRACKING`	Enables memory allocation tracking for debugging	`0` (disabled), `1` (enabled)	`0`
`ACPP_PERSISTENT_RUNTIME`	Keeps the runtime running between successive calls	`0` (disabled), `1` (enabled)	`1`

Usage Examples

# Production: CUDA backend with minimal output
export ACPP_VISIBILITY_MASK=cuda ACPP_DEBUG_LEVEL=0
./scram --monte-carlo input/model.xml

# Development: Multiple backends with error reporting and memory tracking
export ACPP_VISIBILITY_MASK=cuda,opencl ACPP_DEBUG_LEVEL=2 ACPP_ALLOCATION_TRACKING=1
./scram --monte-carlo input/model.xml

# Container usage with environment variables
docker run --rm --gpus all \
  -e ACPP_VISIBILITY_MASK=cuda \
  -e ACPP_DEBUG_LEVEL=1 \
  -v $(pwd)/input:/input \
  mc-scram:runtime --monte-carlo /input/model.xml

Performance Considerations

JIT Optimization and Warm-up

mcSCRAM uses AdaptiveCpp's generic compilation target, which performs runtime JIT optimization:

First Run: Kernels compile and optimize for your specific hardware
Subsequent Runs: Optimized kernels load from cache (~/.acpp/apps/)
Recommendation: Run 3-4 iterations to reach peak performance
Adaptivity Level ≥ 2: Enables aggressive optimizations including constant propagation for invariant kernel arguments

# First run - includes JIT compilation time
time ./scram --monte-carlo --num-trials 1000000 input/model.xml

# Subsequent runs - optimized kernel execution
time ./scram --monte-carlo --num-trials 1000000 input/model.xml

Cache Management

When upgrading AdaptiveCpp or GPU drivers, clear the kernel cache to benefit from improvements:

# Clear JIT kernel cache
rm -rf ~/.acpp/apps/*

Memory Layout Optimization

For large models with memory constraints:

Monitor VRAM: Use nvidia-smi, nvtop or similar tools to track memory usage

Backend-Specific Tuning

CUDA/HIP: Optimal for discrete GPUs with high memory bandwidth
OpenCL: Cross-platform compatibility, may require driver-specific tuning
Level Zero: Optimized for Intel discrete GPUs, experimental for integrated GPUs

Contributing

Please see CONTRIBUTING.md for development guidelines and ICLA.md for contributor license requirements.

Licensing

This program is free software distributed under the GNU Affero General Public License v3.0 (AGPL v3).

Important Note: The original SCRAM code (from Olzhas Rakhimov) remains under GPL v3, while mcSCRAM enhancements and new code are licensed under AGPL v3. When combined, the entire project is governed by AGPL v3 terms.

Key implications of AGPL v3:

✅ Freedom to use for any purpose, including research and commercial applications
✅ Freedom to study and modify the source code
✅ Freedom to distribute copies and modifications
⚠️ Copyleft requirement: Derivative works must also be licensed under AGPL v3
⚠️ Source disclosure: Distributed binaries must include or provide access to source code
⚠️ Network provision: If you run AGPL code on a server accessible over a network, you must provide source code to users

For users and researchers:

Publication of results does not require AGPL compliance
Modifications for personal research do not require public release
If you provide the software as a network service, users must be able to access the source code

For developers and redistributors:

Must preserve copyright notices and license terms
Must provide source code when distributing binaries
Must provide source code when offering the software as a network service
Cannot incorporate into proprietary software without AGPL compliance

For commercial users:

Must comply with AGPL if distributing the software or offering it as a service
Network-accessible deployments require source code provision to users

Educational resources on AGPL v3:

Acknowledgments

Original SCRAM: Copyright (C) 2014-2018 Olzhas Rakhimov
Repository: https://github.com/rakhimov/scram
Synthetic Models: OpenPRA Initiative contributors
Testing Infrastructure: Fault tree benchmarks from various PRA/PSA research groups

Development Status

Analysis Type	Status	DirectEval Support	Action Required
FaultTreeAnalysis	✅ Core	Full support	None
ProbabilityAnalysis	✅ Core	Full support	Add `tallies()` accessor
ImportanceAnalysis	⚠️ Stub	Partial	Full implementation needed
UncertaintyAnalysis	❌ Missing	No support	New specialization needed
EventTreeAnalysis	✅ Core	Full support	None
CCF Analysis	✅ Preprocessing	Full support	None
SIL Analysis	✅ Embedded	Full support	None
Cut Set Analysis	⚠️ Stub	Intentional stub	None (by design)

Name		Name	Last commit message	Last commit date
Latest commit History 3,701 Commits
.travis		.travis
benchmarks		benchmarks
cmake		cmake
doc		doc
generated		generated
input		input
scripts		scripts
share		share
src		src
targets/scram		targets/scram
tests		tests
.appveyor.yml		.appveyor.yml
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.codecov.yml		.codecov.yml
.coveragerc		.coveragerc
.cppdep.yml		.cppdep.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.landscape.yml		.landscape.yml
.style.yapf		.style.yapf
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CPPLINT.cfg		CPPLINT.cfg
Dockerfile		Dockerfile
ICLA.md		ICLA.md
LICENSE		LICENSE
README.md		README.md
crowdin.yml		crowdin.yml
docker-compose.yml		docker-compose.yml
requirements-dev.txt		requirements-dev.txt
requirements-tests.txt		requirements-tests.txt

License

a-earthperson/mcSCRAM

Folders and files

Latest commit

History

Repository files navigation

mcSCRAM: Monte Carlo SCRAM

Objectives

Technical Implementation

Monte Carlo Engine

Hardware Acceleration

Memory Management Architecture

Device USM for High-Throughput Data

Shared USM for Metadata and Control

Performance Benefits

Bit-Packing Optimization

Memory Bandwidth Optimization

Configurable Dimensions

Optimized ATLEAST Gate Implementation

Build and Installation

Container-based Development (Recommended)

Native Build

Requirements:

Auto-Fetched

CMake Build Options

Usage

Parameters

Example Run

Runtime Environment Variables

Usage Examples

Performance Considerations

JIT Optimization and Warm-up

Cache Management

Memory Layout Optimization

Backend-Specific Tuning

Contributing

Licensing

Acknowledgments

Development Status

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages