Binary Variational Autoencoder for Molecular Synthesis Route Generation
A consolidated Python package for generating novel molecules and synthesis routes using binary VAE with molecular optimization capabilities.
brxngenerator combines cutting-edge machine learning with computational chemistry to:
- Generate novel molecular structures using binary latent representations
- Optimize molecular properties (QED, logP, synthetic accessibility)
- Design synthesis routes with template-based reaction planning
- Evaluate generation quality with comprehensive metrics
- β¨ Binary VAE Architecture: Discrete latent space for improved molecular generation
- π§ͺ Chemistry Integration: RDKit-based molecular validation and property computation
- π― Property Optimization: Gurobi-based QUBO optimization for molecular properties
- π Comprehensive Metrics: MOSES-compatible evaluation with novelty, uniqueness, and SA scoring
- π₯οΈ Multi-Device Support: CUDA, MPS (Apple Silicon), and CPU compatibility
- β‘ Optimized Training: Mixed precision, early stopping, and progress tracking
# Python 3.8+ required
conda create -n brxngenerator python=3.8
conda activate brxngenerator
# Essential packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install rdkit numpy scipy scikit-learn tqdm
# Optimization (optional but recommended)
pip install gurobi-optimods
# Visualization (optional)
pip install matplotlib seaborn pandas
git clone <repository-url>
cd brxngenerator
pip install -e .
Place your gurobi.lic
file in the project root for molecular property optimization features.
brxngenerator/
βββ brxngenerator/ # Core package
β βββ chemistry/ # Chemical utilities
β β βββ chemistry_core.py # Consolidated chemistry utilities
β β βββ fragments/ # Fragment processing
β β βββ reactions/ # Reaction handling
β βββ core/ # VAE implementation
β β βββ vae.py # Binary VAE models
β β βββ binary_vae_utils.py # Training utilities
β βββ models/ # Neural architectures
β β βββ models.py # Encoders, decoders, networks
β βββ metrics/ # Evaluation metrics
β β βββ metrics.py # Molecular and latent metrics
β βββ utils/ # Core utilities
β β βββ core.py # Config and device management
β βββ optimization/ # Property optimization
βββ data/ # Training data
βββ weights/ # Model checkpoints
βββ trainvae.py # Training script
βββ sample.py # Sampling script
βββ mainstream.py # Optimization pipeline
βββ README.md # Project overview
Place your molecular reaction data in the data/
directory:
# Expected format: data/data.txt
# Each line should contain reaction SMILES or molecular data
# Basic training with recommended settings
CUDA_VISIBLE_DEVICES=0 python trainvae.py -n 1
# For different model sizes:
python trainvae.py -n 0 # Small model (100,100,2)
python trainvae.py -n 1 # Recommended (200,100,2)
python trainvae.py -n 4 # Larger latent (200,200,2)
python trainvae.py -n 7 # Largest model (500,300,5)
Parameter Sets Available:
- Set 0:
(100,100,2)
- Small/fast training - Set 1:
(200,100,2)
- Recommended balance - Set 4:
(200,200,2)
- Larger latent space - Set 5:
(200,300,2)
- Large latent space - Set 7:
(500,300,5)
- Largest model
# Generate molecules using your trained model
CUDA_VISIBLE_DEVICES=0 python sample.py -n 1 \
--w_save_path weights/bvae_best_model_with.pt \
--subset 500
# Run property optimization pipeline
python mainstream.py --seed 1
The project provides 5 standardized evaluation metrics:
- Fraction of chemically valid generated molecules
- Uses RDKit sanitization and validation
- Fraction of unique molecules among valid ones
- Based on canonical SMILES deduplication
- Fraction of molecules not in training set
- Measures true generative capability vs. memorization
- Quantitative Estimate of Drug-likeness
- Values >0.67 considered drug-like
- Synthetic Accessibility Score
- 1-3: easy to synthesize, 6+: difficult
Key training parameters you can modify:
# In trainvae.py or via command line
batch_size = 1000 # Larger batches work well with GPU
patience = 10 # Early stopping patience
learning_rate = 0.001 # Learning rate
beta = 1.0 # KL divergence weight
The system automatically detects the best device:
# Force CPU usage
export DISABLE_MPS=1
# Specify GPU
export CUDA_VISIBLE_DEVICES=0
# Check device detection
python -c "from brxngenerator import get_device; print(get_device())"
For limited GPU memory:
# Use smaller model
python trainvae.py -n 0
# Reduce dataset size
python trainvae.py -n 1 --subset 1000
# Reduce batch size (edit in script)
from brxngenerator import bFTRXNVAE, get_device
import torch
# Setup
device = get_device()
# Load vocabularies (implement based on your data)
# fragment_vocab, reactant_vocab, template_vocab = load_vocabularies()
# Initialize model
model = bFTRXNVAE(
fragment_vocab=fragment_vocab,
reactant_vocab=reactant_vocab,
template_vocab=template_vocab,
hidden_size=200,
latent_size=100,
depth=2,
device=device
).to(device)
# Load trained weights
checkpoint = torch.load('weights/bvae_best_model_with.pt', map_location=device)
model.load_state_dict(checkpoint)
from brxngenerator import Evaluator
# Initialize evaluator
evaluator = Evaluator(latent_size=100, model=model)
# Generate molecules
ft_latent = evaluator.generate_discrete_latent(50) # Half latent size
rxn_latent = evaluator.generate_discrete_latent(50)
for product, reaction in evaluator.decode_from_prior(ft_latent, rxn_latent, n=10):
if product:
print(f"Generated molecule: {product}")
from brxngenerator import compute_molecular_metrics
# Evaluate generated molecules
generated_smiles = ['CCO', 'CC(=O)O', 'c1ccccc1', ...]
training_smiles = ['CCO', 'CC(C)O', ...] # Your training set
metrics = compute_molecular_metrics(
generated_smiles=generated_smiles,
training_smiles=training_smiles
)
print(f"Validity: {metrics['validity']:.3f}")
print(f"Uniqueness: {metrics['uniqueness']:.3f}")
print(f"Novelty: {metrics['novelty']:.3f}")
print(f"Average QED: {metrics['avg_qed']:.3f}")
print(f"Average SA: {metrics['avg_sa']:.3f}")
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.cuda.amp import GradScaler, autocast
# Setup training
optimizer = optim.Adam(model.parameters(), lr=0.001)
scaler = GradScaler() # For mixed precision
for epoch in range(epochs):
model.train()
total_loss = 0
for batch in train_loader:
optimizer.zero_grad()
with autocast(): # Mixed precision
loss, *other_losses = model(batch, beta=1.0, temp=0.4)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
total_loss += loss.item()
print(f"Epoch {epoch}, Loss: {total_loss/len(train_loader):.4f}")
# Basic training
python trainvae.py -n 1
# With options
python trainvae.py -n 1 --subset 5000 --patience 15
# Quick test (small dataset, early stopping)
python trainvae.py -n 0 --subset 500 --patience 3
# Basic sampling
python sample.py -n 1 --w_save_path weights/bvae_best_model_with.pt
# Limited sampling for testing
python sample.py -n 1 --w_save_path weights/bvae_best_model_with.pt --subset 100
# Property optimization
python mainstream.py --seed 42
# Multiple runs with different seeds
for seed in {1..5}; do
python mainstream.py --seed $seed
done
# Solution 1: Use smaller model
python trainvae.py -n 0
# Solution 2: Reduce data size
python trainvae.py -n 1 --subset 1000
# Solution 3: Use CPU
export CUDA_VISIBLE_DEVICES=""
# Disable MPS acceleration
export DISABLE_MPS=1
python trainvae.py -n 1
# Ensure proper installation
pip install -e .
# Check Python path
python -c "import brxngenerator; print('Installation OK')"
# Reinstall RDKit
conda install -c conda-forge rdkit
# Verify installation
python -c "from rdkit import Chem; print('RDKit OK')"
# Check license file
ls gurobi.lic
# Set environment variable
export GRB_LICENSE_FILE=/path/to/gurobi.lic
# Test Gurobi
python -c "import gurobipy; print('Gurobi OK')"
- Use GPU: 10-50x speedup over CPU
- Larger Batches: Better GPU utilization (1000-3000)
- Mixed Precision: Automatic on GPU, faster training
- Early Stopping: Prevents overfitting, saves time
- Parameter Sets: Start with Set 1, scale up as needed
# Monitor training progress
# Check generated_reactions.txt for sample outputs
# Watch loss curves in terminal output
# Monitor GPU usage: nvidia-smi -l 1
# Validation checks
assert model.latent_size == 100 # Check model configuration
assert len(data_pairs) > 0 # Ensure data loaded
assert torch.cuda.is_available() # GPU availability
- Ensure reaction data is clean and properly formatted
- Use subset for initial testing:
--subset 1000
- Validate data loading before full training
- Start with parameter set 1 (balanced performance)
- Use set 0 for quick prototyping
- Scale to larger sets (4, 7) for production
- Always use early stopping (patience=10)
- Monitor validation loss trends
- Save multiple checkpoints for comparison
- Generate at least 1000 molecules for reliable metrics
- Include training set for novelty computation
- Compare multiple model configurations
- Use GPU for best performance
- Set up proper logging and monitoring
- Implement proper error handling
- MOSES Benchmark: Polykovskiy et al., Frontiers in Pharmacology (2020)
- QED Drug-likeness: Bickerton et al., Nature Chemistry (2012)
- Synthetic Accessibility: Ertl & Schuffenhauer, J. Cheminformatics (2009)
- RDKit Documentation: https://www.rdkit.org/docs/
- PyTorch Documentation: https://pytorch.org/docs/
- Check this guide for common solutions
- Review error messages carefully
- Test with smaller datasets first
- Check device compatibility
- Maintain the consolidated architecture
- Update imports when adding new modules
- Include proper documentation
- Test on multiple devices when possible
This project uses a consolidated architecture for maintainability:
chemistry/chemistry_core.py
- All chemistry utilitiesmodels/models.py
- All neural network componentsmetrics/metrics.py
- All evaluation metricsutils/core.py
- Configuration and device management
When extending the project, follow this consolidation pattern.
Happy molecular generation! π§¬β¨
For additional support, please refer to the consolidated module documentation within each file.