|
| 1 | +# Robust Video Matting (RVM) |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +<p align="center">English | <a href="README_zh_Hans.md">中文</a></p> |
| 6 | + |
| 7 | +Official repository for the paper [Robust High-Resolution Video Matting with Temporal Guidance](https://peterl1n.github.io/RobustVideoMatting/). RVM is specifically designed for robust human video matting. Unlike existing neural models that process frames as independent images, RVM uses a recurrent neural network to process videos with temporal memory. RVM can perform matting in real-time on any videos without additional inputs. It achieves **4K 76FPS** and **HD 104FPS** on an Nvidia GTX 1080 Ti GPU. The project was developed at [ByteDance Inc.](https://www.bytedance.com/) |
| 8 | + |
| 9 | +<br> |
| 10 | + |
| 11 | +## News |
| 12 | + |
| 13 | +* [Aug 25 2021] Source code and pretrained models are published. |
| 14 | +* [Jul 27 2021] Paper is accepted by WACV 2022. |
| 15 | + |
| 16 | +<br> |
| 17 | + |
| 18 | +## Showreel |
| 19 | +Watch the showreel video ([YouTube](https://youtu.be/Jvzltozpbpk), [Bilibili](https://www.bilibili.com/video/BV1Z3411B7g7/)) to see the model's performance. |
| 20 | + |
| 21 | +<p align="center"> |
| 22 | + <a href="https://youtu.be/Jvzltozpbpk"> |
| 23 | + <img src="documentation/image/showreel.gif"> |
| 24 | + </a> |
| 25 | +</p> |
| 26 | + |
| 27 | +All footage in the video are available in [Google Drive](https://drive.google.com/drive/folders/1VFnWwuu-YXDKG-N6vcjK_nL7YZMFapMU?usp=sharing) and [Baidu Pan](https://pan.baidu.com/s/1igMteDwN5rO1Sn7YIhBlvQ) (code: tb3w). |
| 28 | + |
| 29 | +<br> |
| 30 | + |
| 31 | + |
| 32 | +## Demo |
| 33 | +* [Webcam Demo](https://peterl1n.github.io/RobustVideoMatting/#/demo): Run the model live in your browser. Visualize recurrent states. |
| 34 | +* [Colab Demo](https://colab.research.google.com/drive/10z-pNKRnVNsp0Lq9tH1J_XPZ7CBC_uHm?usp=sharing): Test our model on your own videos with free GPU. |
| 35 | + |
| 36 | +<br> |
| 37 | + |
| 38 | +## Download |
| 39 | + |
| 40 | +We recommend MobileNetv3 models for most use cases. ResNet50 models are the larger variant with small performance improvements. Our model is available on various inference frameworks. See [inference documentation](documentation/inference.md) for more instructions. |
| 41 | + |
| 42 | +<table> |
| 43 | + <thead> |
| 44 | + <tr> |
| 45 | + <td>Framework</td> |
| 46 | + <td>Download</td> |
| 47 | + <td>Notes</td> |
| 48 | + </tr> |
| 49 | + </thead> |
| 50 | + <tbody> |
| 51 | + <tr> |
| 52 | + <td>PyTorch</td> |
| 53 | + <td> |
| 54 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3.pth">rvm_mobilenetv3.pth</a><br> |
| 55 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50.pth">rvm_resnet50.pth</a> |
| 56 | + </td> |
| 57 | + <td> |
| 58 | + Official weights for PyTorch. <a href="documentation/inference.md#pytorch">Doc</a> |
| 59 | + </td> |
| 60 | + </tr> |
| 61 | + <tr> |
| 62 | + <td>TorchHub</td> |
| 63 | + <td> |
| 64 | + Nothing to Download. |
| 65 | + </td> |
| 66 | + <td> |
| 67 | + Easiest way to use our model in your PyTorch project. <a href="documentation/inference.md#torchhub">Doc</a> |
| 68 | + </td> |
| 69 | + </tr> |
| 70 | + <tr> |
| 71 | + <td>TorchScript</td> |
| 72 | + <td> |
| 73 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_fp32.torchscript">rvm_mobilenetv3_fp32.torchscript</a><br> |
| 74 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_fp16.torchscript">rvm_mobilenetv3_fp16.torchscript</a><br> |
| 75 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50_fp32.torchscript">rvm_resnet50_fp32.torchscript</a><br> |
| 76 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50_fp16.torchscript">rvm_resnet50_fp16.torchscript</a> |
| 77 | + </td> |
| 78 | + <td> |
| 79 | + If inference on mobile, consider export int8 quantized models yourself. <a href="documentation/inference.md#torchscript">Doc</a> |
| 80 | + </td> |
| 81 | + </tr> |
| 82 | + <tr> |
| 83 | + <td>ONNX</td> |
| 84 | + <td> |
| 85 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_fp32.onnx">rvm_mobilenetv3_fp32.onnx</a><br> |
| 86 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_fp16.onnx">rvm_mobilenetv3_fp16.onnx</a><br> |
| 87 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50_fp32.onnx">rvm_resnet50_fp32.onnx</a><br> |
| 88 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50_fp16.onnx">rvm_resnet50_fp16.onnx</a> |
| 89 | + </td> |
| 90 | + <td> |
| 91 | + Tested on ONNX Runtime with CPU and CUDA backends. Provided models use opset 12. <a href="documentation/inference.md#onnx">Doc</a>, <a href="https://github.com/PeterL1n/RobustVideoMatting/tree/onnx">Exporter</a>. |
| 92 | + </td> |
| 93 | + </tr> |
| 94 | + <tr> |
| 95 | + <td>TensorFlow</td> |
| 96 | + <td> |
| 97 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_tf.zip">rvm_mobilenetv3_tf.zip</a><br> |
| 98 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50_tf.zip">rvm_resnet50_tf.zip</a> |
| 99 | + </td> |
| 100 | + <td> |
| 101 | + TensorFlow 2 SavedModel. <a href="documentation/inference.md#tensorflow">Doc</a> |
| 102 | + </td> |
| 103 | + </tr> |
| 104 | + <tr> |
| 105 | + <td>TensorFlow.js</td> |
| 106 | + <td> |
| 107 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_tfjs_int8.zip">rvm_mobilenetv3_tfjs_int8.zip</a><br> |
| 108 | + </td> |
| 109 | + <td> |
| 110 | + Run the model on the web. <a href="https://peterl1n.github.io/RobustVideoMatting/#/demo">Demo</a>, <a href="https://github.com/PeterL1n/RobustVideoMatting/tree/tfjs">Starter Code</a> |
| 111 | + </td> |
| 112 | + </tr> |
| 113 | + <tr> |
| 114 | + <td>CoreML</td> |
| 115 | + <td> |
| 116 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_1280x720_s0.375_fp16.mlmodel">rvm_mobilenetv3_1280x720_s0.375_fp16.mlmodel</a><br> |
| 117 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_1280x720_s0.375_int8.mlmodel">rvm_mobilenetv3_1280x720_s0.375_int8.mlmodel</a><br> |
| 118 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_1920x1080_s0.25_fp16.mlmodel">rvm_mobilenetv3_1920x1080_s0.25_fp16.mlmodel</a><br> |
| 119 | + <a href="https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3_1920x1080_s0.25_int8.mlmodel">rvm_mobilenetv3_1920x1080_s0.25_int8.mlmodel</a><br> |
| 120 | + </td> |
| 121 | + <td> |
| 122 | + CoreML does not support dynamic resolution. Other resolutions can be exported yourself. Models require iOS 13+. <code>s</code> denotes <code>downsample_ratio</code>. <a href="documentation/inference.md#coreml">Doc</a>, <a href="https://github.com/PeterL1n/RobustVideoMatting/tree/coreml">Exporter</a> |
| 123 | + </td> |
| 124 | + </tr> |
| 125 | + </tbody> |
| 126 | +</table> |
| 127 | + |
| 128 | +All models are available in [Google Drive](https://drive.google.com/drive/folders/1pBsG-SCTatv-95SnEuxmnvvlRx208VKj?usp=sharing) and [Baidu Pan](https://pan.baidu.com/s/1puPSxQqgBFOVpW4W7AolkA) (code: gym7). |
| 129 | + |
| 130 | +<br> |
| 131 | + |
| 132 | +## PyTorch Example |
| 133 | + |
| 134 | +1. Install dependencies: |
| 135 | +```sh |
| 136 | +pip install -r requirements_inference.txt |
| 137 | +``` |
| 138 | + |
| 139 | +2. Load the model: |
| 140 | + |
| 141 | +```python |
| 142 | +import torch |
| 143 | +from model import MattingNetwork |
| 144 | + |
| 145 | +model = MattingNetwork('mobilenetv3').eval().cuda() # or "resnet50" |
| 146 | +model.load_state_dict(torch.load('rvm_mobilenetv3.pth')) |
| 147 | +``` |
| 148 | + |
| 149 | +3. To convert videos, we provide a simple conversion API: |
| 150 | + |
| 151 | +```python |
| 152 | +from inference import convert_video |
| 153 | + |
| 154 | +convert_video( |
| 155 | + model, # The model, can be on any device (cpu or cuda). |
| 156 | + input_source='input.mp4', # A video file or an image sequence directory. |
| 157 | + output_type='video', # Choose "video" or "png_sequence" |
| 158 | + output_composition='output.mp4', # File path if video; directory path if png sequence. |
| 159 | + output_video_mbps=4, # Output video mbps. Not needed for png sequence. |
| 160 | + downsample_ratio=None, # A hyperparameter to adjust or use None for auto. |
| 161 | + seq_chunk=12, # Process n frames at once for better parallelism. |
| 162 | +) |
| 163 | +``` |
| 164 | + |
| 165 | +4. Or write your own inference code: |
| 166 | +```python |
| 167 | +from torch.utils.data import DataLoader |
| 168 | +from torchvision.transforms import ToTensor |
| 169 | +from inference_utils import VideoReader, VideoWriter |
| 170 | + |
| 171 | +reader = VideoReader('input.mp4', transform=ToTensor()) |
| 172 | +writer = VideoWriter('output.mp4', frame_rate=30) |
| 173 | + |
| 174 | +bgr = torch.tensor([.47, 1, .6]).view(3, 1, 1).cuda() # Green background. |
| 175 | +rec = [None] * 4 # Initial recurrent states. |
| 176 | +downsample_ratio = 0.25 # Adjust based on your video. |
| 177 | + |
| 178 | +with torch.no_grad(): |
| 179 | + for src in DataLoader(reader): # RGB tensor normalized to 0 ~ 1. |
| 180 | + fgr, pha, *rec = model(src.cuda(), *rec, downsample_ratio) # Cycle the recurrent states. |
| 181 | + com = fgr * pha + bgr * (1 - pha) # Composite to green background. |
| 182 | + writer.write(com) # Write frame. |
| 183 | +``` |
| 184 | + |
| 185 | +5. The models and converter API are also available through TorchHub. |
| 186 | + |
| 187 | +```python |
| 188 | +# Load the model. |
| 189 | +model = torch.hub.load("PeterL1n/RobustVideoMatting", "mobilenetv3") # or "resnet50" |
| 190 | + |
| 191 | +# Converter API. |
| 192 | +convert_video = torch.hub.load("PeterL1n/RobustVideoMatting", "converter") |
| 193 | +``` |
| 194 | + |
| 195 | +Please see [inference documentation](documentation/inference.md) for details on `downsample_ratio` hyperparameter, more converter arguments, and more advanced usage. |
| 196 | + |
| 197 | +<br> |
| 198 | + |
| 199 | +## Training and Evaluation |
| 200 | + |
| 201 | +Please refer to the [training documentation](documentation/training.md) to train and evaluate your own model. |
| 202 | + |
| 203 | +<br> |
| 204 | + |
| 205 | +## Speed |
| 206 | + |
| 207 | +Speed is measured with `inference_speed_test.py` for reference. |
| 208 | + |
| 209 | +| GPU | dType | HD (1920x1080) | 4K (3840x2160) | |
| 210 | +| -------------- | ----- | -------------- |----------------| |
| 211 | +| RTX 3090 | FP16 | 172 FPS | 154 FPS | |
| 212 | +| RTX 2060 Super | FP16 | 134 FPS | 108 FPS | |
| 213 | +| GTX 1080 Ti | FP32 | 104 FPS | 74 FPS | |
| 214 | + |
| 215 | +* Note 1: HD uses `downsample_ratio=0.25`, 4K uses `downsample_ratio=0.125`. All tests use batch size 1 and frame chunk 1. |
| 216 | +* Note 2: GPUs before Turing architecture does not support FP16 inference, so GTX 1080 Ti uses FP32. |
| 217 | +* Note 3: We only measure tensor throughput. The provided video conversion script in this repo is expected to be much slower, because it does not utilize hardware video encoding/decoding and does not have the tensor transfer done on parallel threads. If you are interested in implementing hardware video encoding/decoding in Python, please refer to [PyNvCodec](https://github.com/NVIDIA/VideoProcessingFramework). |
| 218 | + |
| 219 | +<br> |
| 220 | + |
| 221 | +## Project Members |
| 222 | +* [Shanchuan Lin](https://www.linkedin.com/in/shanchuanlin/) |
| 223 | +* [Linjie Yang](https://sites.google.com/site/linjieyang89/) |
| 224 | +* [Imran Saleemi](https://www.linkedin.com/in/imran-saleemi/) |
| 225 | +* [Soumyadip Sengupta](https://homes.cs.washington.edu/~soumya91/) |
0 commit comments