Open-source AI-gen video detection model
Please see the blog post for more details. You can find the model weight at our Huggingface Hub repository.
Install the package with its dependencies:
pip install cakelens-v5
The package provides a command line tool cakelens
for easy video detection:
# Using Hugging Face Hub (recommended)
cakelens video.mp4
# Using local model file
cakelens video.mp4 --model-path model.pt
--model-path
: Path to the model checkpoint file (optional - will load from Hugging Face Hub if not provided)--batch-size
: Batch size for inference (default: 1)--device
: Device to run inference on (cpu
,cuda
,mps
) - auto-detected if not specified--verbose, -v
: Enable verbose logging--output
: Output file path for results (JSON format)
# Basic detection (uses Hugging Face Hub)
cakelens video.mp4
# Using local model file
cakelens video.mp4 --model-path model.pt
# With custom batch size and device
cakelens video.mp4 --batch-size 4 --device cuda
# Save results to JSON file
cakelens video.mp4 --output results.json
# Verbose output
cakelens video.mp4 --verbose
The tool provides:
- Real-time prediction percentages for each label
- Final mean predictions across all frames
- Option to save results in JSON format
- Detailed logging (with
--verbose
flag)
You can also use the detection functionality programmatically in your Python code:
import pathlib
from cakelens.detect import Detector
from cakelens.model import Model
# Create model and load from Hugging Face Hub
model = Model()
# load the model weights from Hugging Face Hub
model.load_from_huggingface_hub()
# or, if you have a local model file:
# model.load_state_dict(torch.load("model.pt")["model_state_dict"])
# Create detector
detector = Detector(
model=model,
batch_size=1,
device="cpu" # or "cuda", "mps", or None for auto-detection
)
# Run detection
video_path = pathlib.Path("video.mp4")
verdict = detector.detect(video_path)
# Access results
print(f"Video: {verdict.video_filepath}")
print(f"Frame count: {verdict.frame_count}")
print("Predictions:")
for i, prob in enumerate(verdict.predictions):
print(f" Label {i}: {prob * 100:.2f}%")
The model can detect the following labels:
- AI_GEN: Is the video AI-generated or not?
- ANIME_1D: Is the video in 2D anime style?
- ANIME_2D: Is the video in 3D anime style?
- VIDEO_GAME: Does the video look like a video game?
- KLING: Is the video generated by Kling?
- HIGGSFIELD: Is the video generated by Higgsfield?
- WAN: Is the video generated by Wan?
- MIDJOURNEY: Is the video generated using images from Midjourney?
- HAILUO: Is the video generated by Hailuo?
- RAY: Is the video generated by Ray?
- VEO: Is the video generated by Veo?
- RUNWAY: Is the video generated by Runway?
- SORA: Is the video generated by Sora?
- CHATGPT: Is the video generated using images from ChatGPT?
- PIKA: Is the video generated by Pika?
- HUNYUAN: Is the video generated by Hunyuan?
- VIDU: Is the video generated by Vidu?
Note: The AI_GEN label is the most accurate as it has the most training data. Other labels have limited training data and may be less accurate.
The PR curve of the model is shown below:
At threshold 0.5, the model has an precision of 0.77 and a recall of 0.74. The dataset size is 5,093 videos for training and 498 videos for validation. Please note that the model is not perfect and may make mistakes.