Joy Caption is a ComfyUI custom node powered by the LLaVA model for efficient, stylized image captioning. Caption Tools nodes handle batch image processing and automatic separation of caption text.
-
2025/08/22: Update ComfyUI-JoyCaption to v2.0.0 ( update.md )
- Added GGUF model support for better performance and compatibility
- Fixed critical parameter handling issues (top_k, image format)
- Improved console output (no more base64 spam)
- Enhanced error handling and stability
- Added JoyCaption GGUF and JoyCaption GGUF (Advanced) nodes
- Enhanced GGUF Support: Comprehensive support for 12 quantization levels (Q2_K to F16)
- llama_cpp_install folder: Complete installation guides and automated scripts for llama-cpp-python
- Simplified Installation: One-click llama-cpp-python installation with automatic CUDA support
- Cross-Platform Support: Windows, macOS, and Linux installation guides
- Performance Improvements: Optimized model loading and memory management
-
2025/06/15: Update ComfyUI-JoyCaption to v1.2.0 ( update.md )
- Enhanced CUDA performance optimization
- Improved memory management strategies
- Added Global Cache mode for faster processing
-
2025/06/07: Update ComfyUI-JoyCaption to v1.1.1 ( update.md )
-
2025/06/05: Update ComfyUI-JoyCaption to v1.1.0 ( update.md )
-
Add Caption tools
- Simple and user-friendly interface
- Multiple caption types support
- Flexible length control
- Memory optimization options
- GGUF model support - Efficient quantized models for better performance and lower memory usage
- Automatic model download - Models will be downloaded automatically and properly renamed on first use
- Support for multiple caption types
- Support for advanced customization options
- Multiple precision options (fp32, bf16, fp16, 8-bit, 4-bit)
- Enhanced memory management with Global Cache mode
- Clean console output (no debug spam)
- Robust error handling and parameter validation
- llama_cpp_install folder - Complete installation guides and automated scripts for llama-cpp-python
- Clone this repository to your
ComfyUI/custom_nodes
directory:
cd ComfyUI/custom_nodes
git clone https://github.com/1038lab/ComfyUI-JoyCaption.git
- Install the required dependencies:
cd ComfyUI/custom_nodes/ComfyUI-JoyCaption
pip install -r requirements.txt
- For GGUF Models (Recommended): Install llama-cpp-python with CUDA support:
# Option 1: Automated installation (Recommended)
python llama_cpp_install/llama_cpp_install.py
# Option 2: Manual installation
pip install -r requirements_gguf.txt
Installation Guides Available:
The models will be automatically downloaded and renamed on first use, or you can manually download them:
Model | Link | Memory Usage |
---|---|---|
JoyCaption Beta One | Download | ~8-16GB |
JoyCaption Alpha Two | Download | ~8-16GB |
Model | Size | Memory Usage | Quality | Recommended For | Link |
---|---|---|---|---|---|
JoyCaption Beta One (Q2_K) | 3.18GB | ~4GB | Good | Low VRAM (6GB+) | Download |
JoyCaption Beta One (Q3_K_S) | 3.66GB | ~5GB | Good+ | Budget systems (8GB+) | Download |
JoyCaption Beta One (Q3_K_M) | 4.02GB | ~5GB | Better | Balanced performance | Download |
JoyCaption Beta One (Q3_K_L) | 4.32GB | ~6GB | Better+ | Good quality/size ratio | Download |
JoyCaption Beta One (IQ4_XS) | 4.48GB | ~6GB | Very Good | Recommended (8GB+) | Download |
JoyCaption Beta One (Q4_K_S) | 4.69GB | ~6GB | Very Good | Quality focused | Download |
JoyCaption Beta One (Q4_K_M) | 4.92GB | ~7GB | Very Good+ | Balanced choice | Download |
JoyCaption Beta One (Q5_K_S) | 5.60GB | ~7GB | Excellent | High quality (10GB+) | Download |
JoyCaption Beta One (Q5_K_M) | 5.73GB | ~8GB | Excellent+ | Premium quality | Download |
JoyCaption Beta One (Q6_K) | 6.60GB | ~8GB | Near Original | Maximum quality (12GB+) | Download |
JoyCaption Beta One (Q8_0) | 8.54GB | ~10GB | Original- | Full precision alternative | Download |
JoyCaption Beta One (F16) | 16.1GB | ~18GB | Original | Full precision (24GB+) | Download |
Model | Size | Source | Link |
---|---|---|---|
JoyCaption Beta One (Q4_K) | 4.92GB | concedo | Download |
JoyCaption Beta One (Q8_0) | 8.54GB | concedo | Download |
JoyCaption Beta One (F16) | 16.1GB | concedo | Download |
Note: GGUF models also require the vision projection model:
After downloading, place the model files in your ComfyUI/models/LLM/GGUF
directory.
- Add the "JoyCaption" node from the
🧪AILab/📝JoyCaption
category - Connect an image source to the node
- Select the model file (defaults to
llama-joycaption-beta-one-hf-llava
) - Adjust the parameters as needed
- Run the workflow
- Add the "JoyCaption (Advanced)" node from the
🧪AILab/📝JoyCaption
category - Connect an image source to the node
- Select the caption type
- Adjust the parameters as needed
- Run the workflow
- Add the "JoyCaption GGUF" node from the
🧪AILab/📝JoyCaption-GGUF
category - Connect an image source to the node
- Select the GGUF model (e.g., "JoyCaption Beta One (IQ4_XS)")
- Choose processing mode (GPU/CPU)
- Select caption style and length
- Run the workflow
- Add the "JoyCaption GGUF (Advanced)" node from the
🧪AILab/📝JoyCaption-GGUF
category - Connect an image source to the node
- Configure all generation parameters (temperature, top_p, top_k, etc.)
- Set custom prompts if needed
- Run the workflow
This node allows you to load multiple images from a directory for batch processing:
Parameter | Description |
---|---|
image_dir | Directory path containing images |
batch_size | Number of images to load (0 = all images) |
start_from | Start from Nth image (1 = first image) |
sort_method | Image loading order: sequential/reverse/random |
This node saves generated captions to text files:
Parameter | Description |
---|---|
string | Caption text to save |
image_path | Path to the image being captioned |
image | (Optional) Image to save alongside the caption |
custom_output_path | (Optional) Custom output directory path |
custom_file_name | (Optional) Custom filename (without extension) |
overwrite | If true, will overwrite existing files; if false, will add a number to make the filename unique |
Parameter | Description | Default | Range |
---|---|---|---|
Model | The JoyCaption model to use | llama-joycaption-beta-one-hf-llava |
- |
Memory Control | Memory optimization settings | Default | Default (fp32), Balanced (8-bit), Maximum Savings (4-bit) |
Caption Type | Caption style selection | Descriptive | Descriptive, Descriptive (Casual), Straightforward, Tags, Technical, Artistic |
Caption Length | Output length control | medium | any, very short, short, medium, long, very long |
Parameter | Description | Default | Range |
---|---|---|---|
Model | GGUF model to use | JoyCaption Beta One (IQ4_XS) | Q2_K, IQ4_XS, Q5_K_M, Q6_K |
Processing Mode | Hardware acceleration | GPU | GPU, CPU |
Caption Type | Caption style selection | Descriptive | Descriptive, Descriptive (Casual), Straightforward, Tags, Technical, Artistic |
Caption Length | Output length control | any | any, very short, short, medium, long, very long |
Memory Management | Model memory handling | Keep in Memory | Keep in Memory, Clear After Run, Global Cache |
Parameter | Description | Default | Range |
---|---|---|---|
Model | GGUF model to use | JoyCaption Beta One (IQ4_XS) | Q2_K, IQ4_XS, Q5_K_M, Q6_K |
Processing Mode | Hardware acceleration | GPU | GPU, CPU |
Max New Tokens | Maximum tokens to generate | 512 | 1-2048 |
Temperature | Generation randomness | 0.6 | 0.0-2.0 |
Top-p | Nucleus sampling | 0.9 | 0.0-1.0 |
Top-k | Top-k sampling | 0 | 0-100 |
Custom Prompt | Custom prompt template | "" | Any text |
Memory Management | Model memory handling | Keep in Memory | Keep in Memory, Clear After Run, Global Cache |
Mode | Precision | Memory Usage | Speed | Quality | Recommended GPU |
---|---|---|---|---|---|
Default | fp32 | ~16GB | 1x | Best | 24GB+ |
Default | bf16 | ~8GB | 1.5x | Excellent | 16GB+ |
Default | fp16 | ~8GB | 2x | Very Good | 16GB+ |
Balanced | 8-bit | ~4GB | 2.5x | Good | 12GB+ |
Maximum Savings | 4-bit | ~2GB | 3x | Acceptable | 8GB+ |
Quantization | Model Size | Memory Usage | Speed | Quality | Recommended GPU |
---|---|---|---|---|---|
Q2_K | 3.18GB | ~4GB | 4x | Good | 6GB+ |
Q3_K_S | 3.66GB | ~5GB | 3.5x | Good+ | 8GB+ |
Q3_K_M | 4.02GB | ~5GB | 3.5x | Better | 8GB+ |
Q3_K_L | 4.32GB | ~6GB | 3x | Better+ | 8GB+ |
IQ4_XS | 4.48GB | ~6GB | 3x | Very Good | 8GB+ |
Q4_K_S | 4.69GB | ~6GB | 3x | Very Good | 8GB+ |
Q4_K_M | 4.92GB | ~7GB | 2.5x | Very Good+ | 10GB+ |
Q5_K_S | 5.60GB | ~7GB | 2.5x | Excellent | 10GB+ |
Q5_K_M | 5.73GB | ~8GB | 2.5x | Excellent+ | 12GB+ |
Q6_K | 6.60GB | ~8GB | 2x | Near Original | 12GB+ |
Q8_0 | 8.54GB | ~10GB | 1.8x | Original- | 16GB+ |
F16 | 16.1GB | ~18GB | 1.5x | Original | 24GB+ |
Note: GGUF models provide better performance and lower memory usage compared to standard models. IQ4_XS offers the best balance of quality and efficiency.
Parameter | Description | Default | Range |
---|---|---|---|
Extra Options | Additional feature options | [] | Multiple options |
Person Name | Name for person descriptions | "" | Any text |
Max New Tokens | Maximum tokens to generate | 512 | 1-2048 |
Temperature | Generation temperature | 0.6 | 0.0-2.0 |
Top-p | Sampling parameter | 0.9 | 0.0-1.0 |
Top-k | Top-k sampling | 0 | 0-100 |
Precision | Use fp32 for best quality, bf16 for balanced performance, fp16 for maximum speed. 8-bit and 4-bit quantization provide significant memory savings with minimal quality impact |
Mode | Description | Recommended Use |
---|---|---|
Global Cache | Keeps model in memory for fastest processing | High VRAM GPUs (16GB+) |
Keep in Memory | Maintains model in memory until workflow ends | Medium VRAM GPUs (8GB+) |
Clear After Run | Clears model after each generation | Low VRAM GPUs (<8GB) |
Setting | Recommendation |
---|---|
Model Choice | Use GGUF models for better performance and lower memory usage. IQ4_XS offers the best balance of quality and efficiency |
Memory Mode | Based on GPU VRAM: 24GB+ use Global Cache, 12GB+ use Keep in Memory, 8GB+ use Clear After Run |
Processing Mode | Use GPU mode for faster processing, CPU mode for systems with limited VRAM |
Input Resolution | The model works best with images of 512x512 or higher resolution |
Memory Usage | If you encounter memory issues, try GGUF Q2_K model or use Clear After Run mode |
Performance | For batch processing, consider using Global Cache mode with GGUF models if you have sufficient VRAM |
Temperature | Lower values (0.6-0.7) for more stable results, higher values (0.8-1.0) for more diverse results |
Top-k Parameter | Set to 0 to disable, or use values like 40-50 for more focused generation |
This implementation uses LLaVA-based image captioning models, optimized to provide fast and accurate image descriptions.
Model features:
- Free and open weights
- Uncensored content coverage
- Broad style diversity (digital art, photoreal, anime, etc.)
- Fast processing with bfloat16 precision
- High-quality caption generation
- Memory-efficient operation
- Consistent results across various image types
- Support for multiple caption styles
- Optimized for training diffusion models
The models are trained on diverse datasets, ensuring:
- Balanced representation across different image types
- High accuracy in various scenarios
- Robust performance with complex scenes
- Support for both SFW and NSFW content
- Equal coverage of different styles and concepts
- ✅ GGUF model support for better performance
- ✅ Support for 12 GGUF quantization formats (Q2_K to F16)
- ✅ Enhanced error handling and parameter validation
- ✅ Clean console output (no debug spam)
- ✅ Improved memory management options
- ✅ Batch processing support (Caption Tools)
- ✅ Performance optimizations for CPU processing (GGUF CPU mode)
- ✅ Enhanced batch processing features with memory management
- More caption style options and custom prompts
- Advanced fine-tuning support for custom models
- Real-time processing optimizations
- Multi-language caption support
- Integration with more vision-language models
- Original JoyCaption Model: fancyfeast - Creator of the base JoyCaption models
- GGUF Quantized Models: mradermacher - Provider of comprehensive GGUF quantized models (Q2_K to F16)
- Alternative GGUF Models: concedo - Provider of alternative GGUF models with vision projection
- ComfyUI Integration: 1038lab - Developer of this ComfyUI custom node
- LLaVA Framework: Microsoft Research - Base vision-language model architecture
- llama-cpp-python: abetlen - Python bindings for llama.cpp
This repository's code is released under the GPL-3.0 License.