A Customized Version of the Original SUPIR Project
- Removed the heavy LLaVA implementation.
- Added safetensors support.
- Updated dependencies.
- Replaced SoftMax with SDPA for default attention.
- Removed
use_linear_control_scale (linear_s_stage2)anduse_linear_cfg_scale (linear_CFG)arguments.- Uses the start and end scale values to determine whether linear scaling will be used/have effect or not.
- Renamed arguments to make settings a bit more intuitive (more alignment with kijai's SUPIR ComfyUI custom nodes)
spt_linear_CFG->cfg_scale_starts_cfg->cfg_scale_endspt_linear_s_stage2->control_scale_starts_stage2->control_scale_end
- Added
--skip_denoise_stageargument to bypass the artifact removal preprocessing step that uses the specialized VAE denoise encoder. This usually ends up with the image slightly softened (before sampling stage) since you do not want artifacts to be considered detail to be enhanced. You might want to skip this step if your image is already high quality. - Refactor: Renamed symbol
upsaclein original code toupscale - Moved CLIP paths to a yaml config file.
- Exposed
sampler_tile_sizeandsampler_tile_strideto make them overridable when usingTiledRestoreEDMSampler - SUPIR Settings saved into PNGInfo metadata
- Parallel processing for Tiled VAE encoding/decoding
- Improved memory management. On each run, it clears unused GPU (VRAM), cleans up Python's leftover crap, and releases unused RAM back to the system (Linux only).
Processing Times (seconds) with Models Preloaded
VRAM Usage : ~12GB
Note: Performance will vary depending on system specs beyond the GPU (CPU speed, memory bandwidth, etc.), so treat this only as a rough guide.
| GPU Model | 1024×1024 | 2048×2048 | 3072×3072 |
|---|---|---|---|
| H100 | 15 s | 95 s | 243 s |
| RTX Pro 6000 | 10 s | 71 s | 190 s |
| RTX 5090 | 14 s | 97 s | 254 s |
| RTX 4090 | 18 s | 133 s | 329 s |
| RTX 3090 | 26 s | 206 s | 560 s |
I’ve found max upscale between 2048×2048 and 4096x4096 to be the sweet spot for refinement work. 4096x4096 can yield smoother results (depending on the image), but it will be a LOT slower.
When working with large but imperfect images (for example, 45MP+ negative scans that are old or grainy), I split them into 2048×2048 tiles. This lets me refine each section independently while still preserving fine detail. The trade-off is that each tile requires its own prompt, along with some careful blending in Photoshop. Using overlaps between tiles makes this process easier. While it adds extra manual work, the payoff is much greater control. You can adjust prompts to suit the unique details of each region, whether that is faces, textures, text, or backgrounds, instead of relying on a single global prompt that may not work well for the entire image.
Best for you to experiment, if you have the patience.
- Python 3.12
- Git
git clone https://github.com/yushan777/SUPIR-Demo.git
cd SUPIR-Demo
# For Linux only
chmod +x *.sh# Linux
./install_linux_local.sh
# Linux (Vast.ai)
./install_vastai.sh
# Windows
install_win_local.batYou can download the models at the same time while the venv is being installed (in a separate terminal)
# Linux
./download_models.sh
# Windows
download_models.batℹ️ See more information
If you prefer to Download the models manually or in your own time below are the links.
Additionally, if you already have these models then you can simply symlink them to the locations to save on storage space.
For captioning input image in the Gradio demo.
SmolVLM-500M-InstructPlace all files intomodels/SmolVLM-500M-Instruct
Unless you have more than 24GB of VRAM, you should download the FP16 variants FP16 Versions
SUPIR-v0Q (FP16)SUPIR-v0F (FP16)
Download and place the model files in themodels/SUPIR/directory.
FP32 Versions
SUPIR-v0Q (FP32)SUPIR-v0F (FP32)
Download and place the model files in themodels/SUPIR/directory.
- CLIP Encoder-1
Place inmodels/CLIP1 - CLIP Encoder-2
Place inmodels/CLIP2
- Juggernaut-XL_v9_RunDiffusionPhoto_v2
Place inmodels/SDXL
You can use your own preferred SDXL Model. One that specialises in realism, photographic will work better.
There are two SUPIR model variants: v0Q and v0F.
-
SUPIR-v0Q The v0Q model (Quality) is trained on a wide range of degradations, making it robust and effective across varied real-world scenarios. However, this broad generalization comes at a cost—when applied to images with only mild degradation, v0Q might overcompensate, hallucinate or alter details that are already mostly intact. This behavior stems from its training bias toward assuming significant visual damage.
-
SUPIR-v0F In contrast, the v0F model (Fidelity) is specifically trained on lighter degradation patterns. Its Stage1 encoder is tuned to better preserve fine details and structure, resulting in restorations that are more faithful to the input when the degradation is minimal. As a result, v0F is the preferred choice for high-fidelity restoration where subtle preservation is more critical than aggressive enhancement.
- If necessary, edit Custom Path for Checkpoints. Otherwise leave these alone.
* [options/SUPIR_v0.yaml] --> SDXL_CKPT, SUPIR_CKPT_Q, SUPIR_CKPT_F. * [options/SUPIR_v0_tiled.yaml] --> SDXL_CKPT, SUPIR_CKPT_Q, SUPIR_CKPT_F.
# Linux
source venv/bin/activate
python3 run_supir_gradio.py
# or you can start it with the bash script (contains the above two commands)
chmod +x launch_gradio.sh
./launch_gradio.sh
# =======================================
# Windows
venv\Scripts\activate.bat
python run_supir_gradio.pyDefault Settings can be set in the file defaults.json. If it doesn't exist, just copy and rename defaults_example.json
# for cli test
python3 run_supir.py --img_path 'input/bottle.png' --save_dir ./output --SUPIR_sign Q --upscale 2 --use_tile_vae --loading_half_params
python3 run_supir.py \
--img_path 'input/woman-low-res-sq.jpg' \
--save_dir ./output \
--SUPIR_sign Q \
--upscale 2 \
--seed 1234567891 \
--img_caption 'A woman has dark brown eyes, dark curly hair wearing a dark scarf on her head. She is wearing a black shirt on with a pattern on it. The wall behind her is brown and green.' \
--edm_steps=50 \
--s_churn=5 \
--cfg_scale_start=2.0 \
--cfg_scale_end=4.0 \
--control_scale_start=0.9 \
--control_scale_end=0.9 \
--loading_half_params \
--use_tile_vaeSampler: TiledRestoreEDMSampler
Tiled VAE: True
Number of Workers: 1
Linux, 64GB RAM
| Upscale | 4090 Time |
4090 VRAM |
4080 Time |
4080 VRAM |
4070 Time |
4070 VRAM |
|---|---|---|---|---|---|---|
| 2x | 111 secs | 14.0GB | 227 secs | 13.7GB | 301 secs | 11.7GB |
| 3x | 315 secs | 14.1GB | 475 secs | 13.8GB | 652 secs | 11.7GB |
| 4x | 606 secs | 14.6GB | 910 secs | 13.9GB | 1625 secs | 11.7GB |
| 5x | 992 secs | 15.0GB | 1492 secs | 14.6GB | OOM | OOM |
| Argument | Description |
|---|---|
img_path |
Path to the input image. (required) |
save_dir |
Directory to save the output. |
SUPIR_sign |
Model type. Options: ['F', 'Q']Default: 'Q'Q model (Quality) Trained on diverse, heavy degradations, making it robust for real-world damage. However, it may overcorrect or hallucinate when used on lightly degraded images due to its bias toward severe restoration. F model (Fidelity) Optimized for mild degradations, preserving fine details and structure. Ideal for high-fidelity tasks where subtle restoration is preferred over aggressive enhancement. |
skip_denoise_stage |
Skips the VAE Denoiser Stage. Default: 'False'Bypass the artifact removal preprocessing step that uses the specialized VAE denoise encoder. This usually ends up with the image slightly softened (if you inspected it at this stage). This is to avoid SUPIR treating low-res/compression artifacts as detail to be enhanced. You may wish to skip this step if: - 1) You want do do your own pre-processing OR - 2) Input image is clean and free of low-res/compression artifacts or other degradations - Can sometimes make closeups of skin textures a bit unnatural. |
sampler_mode |
Sampler choice. Options: ['TiledRestoreEDMSampler', 'RestoreEDMSampler']Default: 'TiledRestoreEDMSampler' (uses less VRAM) |
seed |
Random seed for reproducibility. Default: 1234 |
Use Upscale to.. |
If on, use Update to width and Update to height values for upscaling. If off, then Upscale by factor will be used. |
Upscale to width |
Upscale input image width to specified dimension if Use Upscale to.. is on. Minimum: 1024 |
Upscale to height |
Upscale input image height to specified dimension if Use Upscale to.. is on. Minimum: 1024 |
Upscale by |
Upscale factor for the input image. Default: 2 Upscaling of the input image is performed before the denoising and sampling stage. Both dimensions are multiplied by the upscale value. If the smaller of the dimensions is still < 1024px, the image is further enlarged to minimum of 1024px (aspect ratio maintained). |
*** |
Notes about Upscaling: The reason for the minimum of 1024 is to give SDXL a comfortable working resolution. Note that dimensions are snapped to the nearest multiple of 64. The sweet spot seems to be between 2x and 4x (1024x1024) or 4x and 8x (512x512). Beyond that, the quality begins to collapse. The higher the scale factor, the slower the process. |
min_size |
Minimum output resolution. Default: 1024 |
num_samples |
Number of images to generate per input. Default: 1 |
img_caption |
Specific caption for the input image. Default: ''This caption is combined with a_prompt. |
a_prompt |
Additional positive prompt (appended to input caption). Default: Cinematic, High Contrast, highly detailed, taken using a Canon EOS R camera, hyper detailed photo - realistic maximum detail, 32k, Color Grading, ultra HD, extreme meticulous detailing, skin pore detailing, hyper sharpness, perfect without deformations. |
n_prompt |
Negative prompt. Default: painting, oil painting, illustration, drawing, art, sketch, cartoon, CG Style, 3D render, unreal engine, blurring, dirty, messy, worst quality, low quality, frames, watermark, signature, jpeg artifacts, deformed, lowres, over-smooth |
edm_steps |
Number of diffusion steps. Default: 50 |
s_churn |
controls how much extra randomness is added during the process. This helps the model explore more options and avoid getting stuck on a limited result. Default: 50: No noise (deterministic)1–5: Mild/moderate6–10+: Strong |
s_noise |
Scales s_churn noise strength. Default: 1.003Slightly < 1: More stable Slightly > 1: More variation |
cfg_scale_start |
Prompt guidance strength start. Default: 2.0 |
cfg_scale_end |
Prompt guidance strength end. Default: 41.0: Weak (ignores prompt)7.5: Strong (follows prompt closely)If cfg_scale_start and cfg_scale_end have the same value, no scaling occurs. When these values differ, linear scheduling is applied from start to end. They can also be reversed for creative strategies. |
CFG Sweep |
Enables a mode to test a range of CFG scale values. When checked, it will generate multiple images, each with a different CFG scale, starting from CFG Scale Start to CFG Scale End. The seed is fixed during the sweep to ensure comparability between images. |
CFG Sweep Step |
The increment used to step from the start to the end CFG scale value during a sweep. |
CFG Sweep Direction |
Defines how the start and end value pairs are varied during a sweep. - Forward: The start value increases while the end value stays fixed. Example: 2/8 → 3/8 → 4/8 → 5/8 ... - Backward: The end value decreases while the start value stays fixed. Example: 2/8 → 2/7 → 2/6 → 2/5 ... |
Control Guidance Scale |
Guides how strongly the overall structure of the input image is preserved. The process moves from a start scale (at the beginning, with high noise) to an end scale (at the end, with low noise). - Control Scale Start: Structural guidance strength at the beginning of the process. Lower values allow more creative freedom early on. - Control Scale End: Structural guidance strength at the end of the process. Higher values ensure the final details conform closely to the original image. - Example: start=0.0 / end=1.0 begins with high creativity (ignoring the original structure) and ends by strictly adhering to the original image's structure for the final result. |
control_scale_start |
Structural guidance from input image, start strength. Default: 0.9 |
control_scale_end |
Structural guidance from input image, end strength. Default: 0.90.0: Disabled0.1–0.5: Light0.6–1.0: Balanced (default)1.1–1.5+: Very strongSame value = fixed. Different values = scheduled. |
restoration_scale |
Early-stage restoration strength. Controls how strongly the model pulls the structure of the output image back toward the original image. Only applies during the early stages of sampling when the noise level is high. Default: 0 (disabled). |
color_fix_type |
Color adjustment method. Default: 'Wavelet'Options: ['None', 'AdaIn', 'Wavelet'] |
loading_half_params |
Loads the SUPIR model weights in half precision (FP16). Default: FalseReduces VRAM usage and increases speed at the cost of slight precision loss. |
diff_dtype |
Precision to use for the diffusion model only. Allows overriding default precision independently, unless loading_half_params is set.Default: 'fp16'Options: ['fp32', 'fp16', 'bf16'] |
ae_dtype |
Autoencoder precision. Default: 'bf16'Options: ['fp32', 'bf16'] |
use_tile_vae |
Enables tile-based encoding/decoding for memory efficiency with large images. Default: False |
encoder_tile_size |
Tile size when encoding (when use_tile_vae is enabled). TileVAE code has recommended tile sizes based on available VRAM if a CUDA device is available. Encoder: - VRAM > 16GB: 3072 - VRAM > 12GB: 2048 - VRAM > 8GB: 1536 - VRAM <= 8GB: 960 - No GPU: 512 |
decoder_tile_size |
Tile size when encoding (when use_tile_vae is enabled). TileVAE code has recommended tile sizes based on available VRAM if a CUDA device is available. Decoder: - VRAM > 30GB: 256 - VRAM > 16GB: 192 - VRAM > 12GB: 128 - VRAM > 8GB: 96 - VRAM <= 8GB: 64 - No GPU: 64 |
Number of Workers |
Number of parallel CPU processes for VAE encoding/decoding. Improves speed on multi-core CPUs by efficiently preparing data for the GPU. Default: 4 |
sampler_tile_size |
Tile size for TiledRestoreEDMSampler.This is the size of each tile that the image is divided into during tiled sampling. Example: tile_size of 128 → image is split into 128×128 pixel tiles. |
sampler_tile_stride |
Tile stride for TiledRestoreEDMSampler.Controls overlap between tiles during sampling. Smaller tile_stride = more overlap, better blending, more compute.Larger tile_stride = less overlap, faster, may cause seams.Overlap = tile_size - tile_strideExamples: - tile_size = 128, stride = 64 → 64 px overlap. |
Images from Pixabay
Original SUPIR Repository
Kijai's SUPIR Custom Nodes for ComfyUI




