This repository implements a comprehensive radiomics feature extraction pipeline for Head and Neck Cancer (HNC) CT imaging analysis. The pipeline extracts quantitative imaging biomarkers following Image Biomarker Standardization Initiative (IBSI) guidelines, enabling objective tumor characterization and outcome prediction from medical imaging data.
- 🔍 Automated RT Structure Detection: Intelligent search and validation of tumor contours across DICOM RT structure sets
- 📊 IBSI-Compliant Feature Extraction: 100+ standardized radiomics features (shape, first-order, texture)
- 🎯 Batch Processing: Efficient pipeline for large-scale cohort analysis
- 📈 Quality Control: Built-in data validation and error handling
- 🔬 Research-Ready: Output format compatible with machine learning frameworks
- Tumor characterization and staging
- Treatment response prediction
- Survival analysis and prognostic modeling
- Radiogenomics studies
- Clinical decision support systems
- Installation
- Quick Start
- Pipeline Components
- Dataset Requirements
- Usage Guide
- Output Description
- Configuration
- Troubleshooting
- Citation
- Contributing
- License
- Python 3.8 or higher
- Conda or pip package manager
- Minimum 8GB RAM
- DICOM dataset with CT scans and RT structure sets
# Create dedicated environment
conda create -n pyradiomics python=3.8
conda activate pyradiomics
# Install dependencies
pip install pyradiomics
pip install rt-utils
pip install SimpleITK
pip install pydicom
pip install pandas
pip install numpy
pip install jupyter# Create virtual environment
python -m venv radiomics_env
source radiomics_env/bin/activate # On Windows: radiomics_env\Scripts\activate
# Install dependencies
pip install -r requirements.txtimport radiomics
import SimpleITK as sitk
import rt_utils
print(f"PyRadiomics version: {radiomics.__version__}")
print(f"SimpleITK version: {sitk.Version.VersionString()}")
print("Installation successful!")Organize your DICOM data following this structure:
maastro dataset/
├── Images/
│ ├── PATIENT-001/
│ │ ├── CT/ # CT DICOM series
│ │ └── RTSTRUCT/ # RT structure set
│ ├── PATIENT-002/
│ └── ...
Before feature extraction, identify the correct structure nomenclature in your dataset:
jupyter notebook check_RTstructure_name.ipynbThis notebook will:
- Scan all patient RT structure files
- List available structure names (GTV, CTV, PTV, etc.)
- Generate frequency statistics
- Provide code recommendations for your specific dataset
Example Output:
Structure Names Found:
'GTV-1': 180 patients (90.0%)
'CTV': 175 patients (87.5%)
'PTV': 200 patients (100.0%)
Run the main extraction pipeline:
jupyter notebook extract_radiomics_features.ipynbConfigure the target structure and processing mode:
TARGET_ROI = "GTV-1" # Update based on check_RTstructure output
FEATURE_MODE = "original" # Options: "original", "wavelet", "log", "all"
OUTPUT_CSV = "radiomics_features.csv"The pipeline generates a CSV file with extracted features:
import pandas as pd
# Load results
df = pd.read_csv("radiomics_features.csv")
print(f"Patients analyzed: {len(df)}")
print(f"Features extracted: {len(df.columns) - 2}")
print(f"\nFeature categories:")
print(" - Shape features: ~14")
print(" - First-order features: ~18")
print(" - Texture features: ~75")Purpose: Dataset reconnaissance and structure nomenclature discovery
Functionality:
- Scans all patient RT structure files
- Extracts structure name lists
- Generates frequency distributions
- Identifies naming inconsistencies
- Provides structure selection code recommendations
Output:
- Console report with structure statistics
- Code snippets optimized for your dataset
When to Use:
- First time analyzing a new dataset
- Dataset contains multiple structure variants
- Uncertain about structure naming conventions
Purpose: High-throughput radiomics feature extraction
Functionality:
- Reads DICOM CT series using SimpleITK
- Preserves spatial metadata (spacing, origin, direction)
- Handles multi-slice volumes
- Locates target ROI in RT structure sets
- Converts vector contours to binary masks
- Performs spatial registration with CT
Extracts IBSI-compliant features:
| Category | Features | Description |
|---|---|---|
| Shape | 14 | Geometric properties (volume, surface area, sphericity) |
| First-Order | 18 | Intensity statistics (mean, median, skewness, kurtosis) |
| GLCM | 24 | Gray Level Co-occurrence Matrix (texture) |
| GLRLM | 16 | Gray Level Run Length Matrix (texture) |
| GLSZM | 16 | Gray Level Size Zone Matrix (texture) |
| GLDM | 14 | Gray Level Dependence Matrix (texture) |
| NGTDM | 5 | Neighboring Gray Tone Difference Matrix (texture) |
Total Features: ~107 (original mode)
- Data validation and error handling
- Missing value detection
- Success rate reporting
Output:
- CSV file with complete feature matrix
- Processing logs and quality metrics
Your dataset must follow this standardized organization:
base_folder/
├── PATIENT-001/
│ ├── CT/ # Or "CT SCAN" (configurable)
│ │ ├── slice001.dcm
│ │ ├── slice002.dcm
│ │ └── ...
│ └── RTSTRUCT/
│ └── RS.1.2.3.dcm
├── PATIENT-002/
└── ...
CT Series:
- Modality: CT
- Consistent slice spacing
- Complete series (no missing slices)
- Hounsfield Unit calibration
RT Structure Sets:
- Modality: RTSTRUCT
- Contains target tumor structures (GTV, CTV, etc.)
- Spatially registered with CT series
- Valid contour geometry
Common nomenclature patterns:
- Primary Tumor: GTV, GTV-1, GTV-Primary, GTV_1
- Clinical Target: CTV, CTV-High, CTV-Low
- Planning Target: PTV, PTV-1, PTV54, PTV60
- Organs at Risk: Lung_L, Lung_R, Heart, Esophagus, SpinalCord
# 1. Import libraries
import os
from pathlib import Path
# 2. Configure paths
base_folder = "/path/to/maastro_dataset/Images"
os.chdir(base_folder)
# 3. Set analysis parameters
TARGET_ROI = "GTV-1"
FEATURE_MODE = "original"
OUTPUT_CSV = "radiomics_features.csv"
# 4. Run batch processing
df = process_multiple_patients(
base_folder=base_folder,
output_csv=OUTPUT_CSV,
target_roi=TARGET_ROI,
feature_mode=FEATURE_MODE
)# Mode 1: Original (Fast, ~107 features)
FEATURE_MODE = "original"
# Execution time: ~5-10 min per 100 patients
# Use case: Large cohorts, initial exploration
# Mode 2: Wavelet (Comprehensive, ~856 features)
FEATURE_MODE = "wavelet"
# Execution time: ~40-60 min per 100 patients
# Use case: Deep phenotyping, high-dimensional analysis
# Mode 3: LoG Filtering (~428 features)
FEATURE_MODE = "log"
# Execution time: ~20-30 min per 100 patients
# Use case: Multi-scale texture analysis
# Mode 4: All Transformations (~2000+ features)
FEATURE_MODE = "all"
# Execution time: ~2-3 hours per 100 patients
# Use case: Comprehensive feature discoveryIf your dataset has multiple GTV variants:
# Option 1: Accept any GTV
for s in structures:
if 'GTV' in s.upper():
selected_structure = s
break
# Option 2: Priority-based selection
gtv_priority = ['GTV-1', 'GTV', 'GTV-Primary']
for preferred in gtv_priority:
if preferred in structures:
selected_structure = preferred
breakBefore batch processing, test on one patient:
test_patient = "PATIENT-001"
features = process_patient_folder(
test_patient,
target_roi="GTV-1",
feature_mode="original"
)
# Save test results
import pandas as pd
df_test = pd.DataFrame([features])
df_test.to_csv("test_output.csv", index=False)
print(f"Features extracted: {len(features) - 2}")The output CSV file contains:
PatientID | ROI_Name | original_shape_Volume | original_firstorder_Mean | ...
----------|----------|----------------------|-------------------------|-----
PATIENT-001 | GTV-1 | 15847.3 | -142.5 | ...
PATIENT-002 | GTV-1 | 22103.8 | -156.2 | ...
Features follow PyRadiomics nomenclature:
{imageType}_{featureClass}_{featureName}
Examples:
- original_shape_Volume
- original_firstorder_Mean
- original_glcm_Contrast
- wavelet-LHL_glcm_Correlation (if wavelet mode enabled)
Shape Features (14):
Volume,SurfaceArea,Sphericity,CompactnessMajorAxisLength,MinorAxisLength,LeastAxisLengthElongation,Flatness,SphericalDisproportion
First-Order Features (18):
Mean,Median,StandardDeviation,VarianceSkewness,Kurtosis,Energy,EntropyMinimum,Maximum,Range,RootMeanSquared
Texture Features (~75):
- GLCM: Contrast, Correlation, Energy, Homogeneity, etc.
- GLRLM: Run length patterns and distributions
- GLSZM: Size zone characteristics
- GLDM: Dependence metrics
- NGTDM: Gray tone differences
If your dataset uses different folder names:
# In process_patient_folder() function
ct_folder = Path(patient_folder) / "CT SCAN" # Change from "CT"
rtstruct_folder = Path(patient_folder) / "STRUCTURES" # Change from "RTSTRUCT"Customize feature extraction parameters:
from radiomics import featureextractor
extractor = featureextractor.RadiomicsFeatureExtractor()
# Modify settings
extractor.settings['binWidth'] = 25 # Intensity bin width
extractor.settings['resampledPixelSpacing'] = [1, 1, 1] # Resampling
extractor.settings['interpolator'] = 'sitkBSpline' # Interpolation method
# Disable specific feature classes
extractor.disableAllFeatures()
extractor.enableFeatureClassByName('firstorder')
extractor.enableFeatureClassByName('shape')ValueError: ROI 'GTV-1' not found in any RTSTRUCT file
Solution:
- Run
check_RTstructure_name.ipynbto identify correct structure names - Update
TARGET_ROIparameter to match your dataset - Check for case sensitivity (e.g., "GTV-1" vs "gtv-1")
ValueError: Mask size (512, 512, 120) doesn't match CT size (512, 512, 118)
Solution:
- Verify CT and RTSTRUCT are from same imaging session
- Check for missing CT slices
- Ensure proper spatial registration
MemoryError: Unable to allocate array
Solution:
- Process in smaller batches
- Use "original" mode instead of "all"
- Increase system RAM or use high-memory computing resources
Error: No DICOM files in RTSTRUCT directory
Solution:
- Verify RTSTRUCT files are .dcm format
- Check file permissions
- Ensure complete data transfer from source
For Large Datasets (>500 patients):
- Use Original Mode: Faster execution, sufficient for most analyses
- Parallel Processing: Modify code to use multiprocessing
- Checkpoint System: Save intermediate results periodically
- Cloud Computing: Consider AWS/GCP for scalability
If you use this pipeline in your research, please cite:
@software{hnc_radiomics_pipeline,
author = {Shaikh, Hasan},
title = {HNC Radiomics: Quantitative Imaging Analysis Pipeline},
year = {2025},
url = {https://github.com/hash123shaikh/hnc-radiomics}
}@article{vanGriethuysen2017,
title={Computational Radiomics System to Decode the Radiographic Phenotype},
author={van Griethuysen, Joost JM and Fedorov, Andriy and Parmar, Chintan and Hosny, Ahmed and Aucoin, Nicole and Narayan, Vivek and Beets-Tan, Regina GH and Fillion-Robin, Jean-Christophe and Pieper, Steve and Aerts, Hugo JWL},
journal={Cancer Research},
volume={77},
number={21},
pages={e104--e107},
year={2017},
publisher={AACR}
}@article{Zwanenburg2020,
title={The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High-Throughput Image-based Phenotyping},
author={Zwanenburg, Alex and Valli{\`e}res, Martin and Abdalah, Mahmoud A and others},
journal={Radiology},
volume={295},
number={2},
pages={328--338},
year={2020},
publisher={Radiological Society of North America}
}We welcome contributions to improve this pipeline!
- Fork the Repository
- Create Feature Branch:
git checkout -b feature/improvement - Commit Changes:
git commit -m "Add new feature" - Push to Branch:
git push origin feature/improvement - Submit Pull Request
- Multi-threading for batch processing
- Additional feature extraction modes
- Integration with other imaging modalities (MRI, PET)
- Automated quality control metrics
- Visualization tools for feature analysis
- Docker containerization
- Cloud deployment scripts
Found a bug? Please report it:
- Use GitHub Issues
- Provide minimal reproducible example
- Include error messages and system information
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Hasan Shaikh
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
- PyRadiomics Development Team for the feature extraction framework
- MAASTRO Clinic for dataset resources
- Image Biomarker Standardization Initiative (IBSI) for standardization guidelines
- Medical Imaging Community for open-source tools and collaboration
Author: Hasan Shaikh
Email: hasanshaikh3198@gmail.com
GitHub: @hash123shaikh
LinkedIn: Connect with me
For questions, suggestions, or collaboration opportunities, please reach out!
Last Updated: January 2025
Version: 1.0.0
Status: Active Development
Made with ❤️ for the Medical Imaging Research Community