Skip to content

Commit 4100305

Browse files
authored
【开源实习】X_CLIP模型应用开发 (#1694)
1 parent 51e0f25 commit 4100305

File tree

1 file changed

+379
-0
lines changed

1 file changed

+379
-0
lines changed
Lines changed: 379 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,379 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "4e2bd86c-ed80-44c0-9b52-7d470c82bf8c",
6+
"metadata": {},
7+
"source": [
8+
"## 安装必要的库\n",
9+
"> 请提前准备好mindspore和mindnlp的安装\n",
10+
"\n",
11+
"首先,安装所需的Python库:\n",
12+
"- `-q` 表述静默安装,不会出现很多的`Requirement already satisfied`等等"
13+
]
14+
},
15+
{
16+
"cell_type": "code",
17+
"execution_count": 1,
18+
"id": "c64baeda-7159-4436-b8ea-60b5a0a7cfa3",
19+
"metadata": {},
20+
"outputs": [
21+
{
22+
"name": "stdout",
23+
"output_type": "stream",
24+
"text": [
25+
"\u001b[33mWARNING: You are using pip version 21.0.1; however, version 24.2 is available.\n",
26+
"You should consider upgrading via the '/home/ma-user/anaconda3/envs/MindSpore/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n"
27+
]
28+
}
29+
],
30+
"source": [
31+
"!pip install -q huggingface_hub ipywidgets opencv-python"
32+
]
33+
},
34+
{
35+
"cell_type": "markdown",
36+
"id": "ca7abaa5-f492-4028-a7b2-d4d0fa0882e1",
37+
"metadata": {},
38+
"source": [
39+
"## 设置环境变量\n",
40+
"设置Hugging Face的国内镜像站点以加快下载速度:"
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": 2,
46+
"id": "6ea48e86-02ac-4259-813e-b6a26a739f95",
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"import os\n",
51+
"os.environ[\"HF_ENDPOINT\"] = \"https://hf-mirror.com\" "
52+
]
53+
},
54+
{
55+
"cell_type": "markdown",
56+
"id": "7bf038e6-af73-4245-9070-6caf552fecfd",
57+
"metadata": {},
58+
"source": [
59+
"## 下载加载视频\n",
60+
"- 使用`huggingface_hub`下载视频文件\n",
61+
"- 使用`ipywidgets`展示视频"
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": 3,
67+
"id": "00c9d670-4aa2-473d-9d3f-334c0a253bb0",
68+
"metadata": {},
69+
"outputs": [
70+
{
71+
"data": {
72+
"application/vnd.jupyter.widget-view+json": {
73+
"model_id": "d493073379764beca81e0eebb2229939",
74+
"version_major": 2,
75+
"version_minor": 0
76+
},
77+
"text/plain": [
78+
"Video(value=b'\\x00\\x00\\x00 ftypisom\\x00\\x00\\x02\\x00isomiso2avc1mp41\\x00\\x00\\x00\\x08free...', width='500')"
79+
]
80+
},
81+
"execution_count": 3,
82+
"metadata": {},
83+
"output_type": "execute_result"
84+
}
85+
],
86+
"source": [
87+
"from huggingface_hub import hf_hub_download\n",
88+
"from ipywidgets import Video\n",
89+
"\n",
90+
"file_path = hf_hub_download(\n",
91+
" repo_id=\"nielsr/video-demo\", filename=\"eating_spaghetti.mp4\", repo_type=\"dataset\"\n",
92+
")\n",
93+
"Video.from_file(file_path, width=500)"
94+
]
95+
},
96+
{
97+
"cell_type": "markdown",
98+
"id": "ea2e6ad0-bf9c-41f4-974a-0dd4fc082269",
99+
"metadata": {},
100+
"source": [
101+
"## 定义采样函数\n",
102+
"`sample_frame_indices`通过在给定的帧范围内随机选择一段视频片段,并返回这段片段的帧索引。\n",
103+
"- `clip_len`: 需要采样的帧数。\n",
104+
"- `frame_sample_rate`: 帧采样率,决定了采样的密度。\n",
105+
"- `seg_len`: 视频的总帧数。"
106+
]
107+
},
108+
{
109+
"cell_type": "code",
110+
"execution_count": 4,
111+
"id": "be6ef4ac-ed52-461c-a325-3b6c3d01146e",
112+
"metadata": {},
113+
"outputs": [],
114+
"source": [
115+
"import numpy as np\n",
116+
"\n",
117+
"np.random.seed(0)\n",
118+
"\n",
119+
"def sample_frame_indices(clip_len, frame_sample_rate, seg_len):\n",
120+
" # 计算转换后的长度(在给定采样率下,实际需要的帧数长度)\n",
121+
" converted_len = int(clip_len * frame_sample_rate)\n",
122+
" # 选择结束索引\n",
123+
" end_idx = np.random.randint(converted_len, seg_len)\n",
124+
" # 计算开始索引\n",
125+
" start_idx = end_idx - converted_len\n",
126+
" #使用np.linspace在开始索引和结束索引之间生成clip_len个等间距的索引\n",
127+
" indices = np.linspace(start_idx, end_idx, num=clip_len)\n",
128+
" # 使用np.clip确保所有索引都在有效范围内,并将它们转换为整数类型\n",
129+
" indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)\n",
130+
" return indices"
131+
]
132+
},
133+
{
134+
"cell_type": "markdown",
135+
"id": "d55a7435-b7db-4291-b5fa-d16136e5f120",
136+
"metadata": {},
137+
"source": [
138+
"## 读取视频帧\n",
139+
"利用OpenCV"
140+
]
141+
},
142+
{
143+
"cell_type": "code",
144+
"execution_count": 5,
145+
"id": "b045662d-58a9-42e2-9af3-3202c9006089",
146+
"metadata": {},
147+
"outputs": [],
148+
"source": [
149+
"import cv2\n",
150+
"def read_video(file_path, indices):\n",
151+
" # 打开视频文件\n",
152+
" cap = cv2.VideoCapture(file_path)\n",
153+
" # 初始化一个列表来存储帧\n",
154+
" frames = []\n",
155+
" # 遍历给定的帧索引\n",
156+
" for idx in indices:\n",
157+
" # 设置视频捕获对象到特定的帧位置\n",
158+
" cap.set(cv2.CAP_PROP_POS_FRAMES, idx)\n",
159+
" # 读取该帧\n",
160+
" ret, frame = cap.read()\n",
161+
" # 将读取的帧添加到帧列表中,并且转换通道,因为opencv是BGR\n",
162+
" if ret:\n",
163+
" frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # 转换为RGB\n",
164+
" frames.append(frame)\n",
165+
" # 释放视频捕获对象\n",
166+
" cap.release()\n",
167+
" # 将帧列表转换为NumPy数组并返回\n",
168+
" return np.array(frames)\n"
169+
]
170+
},
171+
{
172+
"cell_type": "markdown",
173+
"id": "be45a1c6-467e-4861-bd41-a854663b34b6",
174+
"metadata": {},
175+
"source": [
176+
"## 采样和读取视频\n",
177+
"采样8帧并读取:"
178+
]
179+
},
180+
{
181+
"cell_type": "code",
182+
"execution_count": 6,
183+
"id": "3bb1023a-7359-4788-9cef-f32fb2feef95",
184+
"metadata": {},
185+
"outputs": [
186+
{
187+
"data": {
188+
"text/plain": [
189+
"(8, 360, 640, 3)"
190+
]
191+
},
192+
"execution_count": 6,
193+
"metadata": {},
194+
"output_type": "execute_result"
195+
}
196+
],
197+
"source": [
198+
"cap = cv2.VideoCapture(file_path)\n",
199+
"seg_len = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n",
200+
"indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=seg_len)\n",
201+
"video = read_video(file_path, indices)\n",
202+
"video.shape"
203+
]
204+
},
205+
{
206+
"cell_type": "markdown",
207+
"id": "3923ae67-85a9-4a92-966c-d25c21ce5949",
208+
"metadata": {},
209+
"source": [
210+
"## 使用mindnlp库进行文本-视频匹配\n",
211+
"\n",
212+
"本次我们使用的是`X-CLIP`,一个将语言-图像基础模型适配于通用视频识别的新框架。\n",
213+
"- `X-CLIP`的整体结构与`CLIP`相似,采用两个编码器分别对文本和视频进行编码,然后通过比对这些特征来实现分类。\n",
214+
"- `X-CLIP`引入了一个轻量级、可“即插即用”的跨帧注意力模块,用于捕捉时间信息。\n",
215+
"- 此外,该模型使用视频提示(Prompt),可以生成具有区分能力的视觉提示,从而提升分类效果。因此,无需额外数据,`X-CLIP` 有效地利用了预训练的语言-图像模型,通过零样本或少样本学习实现视频识别。\n",
216+
"\n",
217+
"论文信息:\n",
218+
"> [**Expanding Language-Image Pretrained Models for General Video Recognition**](https://arxiv.org/abs/2208.02816)<br>\n",
219+
"> accepted by ECCV 2022 as an oral presentation<br>\n",
220+
"> Bolin Ni, [Houwen Peng](https://houwenpeng.com/), [Minghao Chen](https://silent-chen.github.io/), [Songyang Zhang](https://sy-zhang.github.io/), [Gaofeng Meng](https://people.ucas.ac.cn/~gfmeng), [Jianlong Fu](https://jianlong-fu.github.io/), [Shiming Xiang](https://people.ucas.ac.cn/~xiangshiming), [Haibin Ling](https://www3.cs.stonybrook.edu/~hling/)\n",
221+
"\n",
222+
"[[arxiv]](https://arxiv.org/abs/2208.02816)\n",
223+
"[[slides]](https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/xclip-slides.pptx)\n"
224+
]
225+
},
226+
{
227+
"cell_type": "code",
228+
"execution_count": 7,
229+
"id": "dfae1baa-0a63-4552-ac36-585896ac1f97",
230+
"metadata": {},
231+
"outputs": [
232+
{
233+
"name": "stderr",
234+
"output_type": "stream",
235+
"text": [
236+
"/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.\n",
237+
" setattr(self, word, getattr(machar, word).flat[0])\n",
238+
"/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.\n",
239+
" return self._float_to_str(self.smallest_subnormal)\n",
240+
"/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.\n",
241+
" setattr(self, word, getattr(machar, word).flat[0])\n",
242+
"/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.\n",
243+
" return self._float_to_str(self.smallest_subnormal)\n",
244+
"Building prefix dict from the default dictionary ...\n",
245+
"Loading model from cache /tmp/jieba.cache\n",
246+
"Loading model cost 1.286 seconds.\n",
247+
"Prefix dict has been built successfully.\n",
248+
"/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/transformers/tokenization_utils_base.py:1526: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted, and will be then set to `False` by default. \n",
249+
" warnings.warn(\n"
250+
]
251+
},
252+
{
253+
"name": "stdout",
254+
"output_type": "stream",
255+
"text": [
256+
"[MS_ALLOC_CONF]Runtime config: enable_vmm:True vmm_align_size:2MB\n"
257+
]
258+
},
259+
{
260+
"name": "stderr",
261+
"output_type": "stream",
262+
"text": [
263+
"[WARNING] CORE(90475,ffff864fc0b0,python):2024-10-24-20:59:31.816.322 [mindspore/core/utils/ms_context.cc:531] GetJitLevel] Set jit level to O2 for rank table startup method.\n"
264+
]
265+
}
266+
],
267+
"source": [
268+
"from mindnlp.transformers import XCLIPProcessor, XCLIPModel\n",
269+
"model_name = \"microsoft/xclip-base-patch32\"\n",
270+
"processor = XCLIPProcessor.from_pretrained(model_name)\n",
271+
"model = XCLIPModel.from_pretrained(model_name)"
272+
]
273+
},
274+
{
275+
"cell_type": "markdown",
276+
"id": "e820ebf1-67d6-4c54-accf-f490358b30e0",
277+
"metadata": {},
278+
"source": [
279+
"## 设置提示词和输入"
280+
]
281+
},
282+
{
283+
"cell_type": "code",
284+
"execution_count": 8,
285+
"id": "0d0d94f3-94e7-4624-9860-53dece33b536",
286+
"metadata": {},
287+
"outputs": [],
288+
"source": [
289+
"inputs = processor(text=[\"playing sports\", \"eating spaghetti\", \"go shopping\"], videos=list(video), return_tensors=\"ms\")"
290+
]
291+
},
292+
{
293+
"cell_type": "markdown",
294+
"id": "12e3b2cd-5f64-4dc8-bd50-b0e183322708",
295+
"metadata": {},
296+
"source": [
297+
"## 前向计算"
298+
]
299+
},
300+
{
301+
"cell_type": "code",
302+
"execution_count": 9,
303+
"id": "8fec4c69-b393-4f7a-9174-da1baf49f8d2",
304+
"metadata": {},
305+
"outputs": [],
306+
"source": [
307+
"outputs = model(**inputs)"
308+
]
309+
},
310+
{
311+
"cell_type": "code",
312+
"execution_count": 10,
313+
"id": "c358ec73-672f-412d-ac3a-bf749d52ff8a",
314+
"metadata": {},
315+
"outputs": [
316+
{
317+
"data": {
318+
"text/plain": [
319+
"Tensor(shape=[1, 3], dtype=Float32, value=\n",
320+
"[[ 1.26835327e+01, 2.11186066e+01, 1.28016310e+01]])"
321+
]
322+
},
323+
"execution_count": 10,
324+
"metadata": {},
325+
"output_type": "execute_result"
326+
}
327+
],
328+
"source": [
329+
"logits_per_video = outputs.logits_per_video # 这是视频-文本相似度得分\n",
330+
"logits_per_video"
331+
]
332+
},
333+
{
334+
"cell_type": "code",
335+
"execution_count": 11,
336+
"id": "a1b9e3fa-deac-4004-9961-813e51dd6da0",
337+
"metadata": {},
338+
"outputs": [
339+
{
340+
"data": {
341+
"text/plain": [
342+
"Tensor(shape=[1, 3], dtype=Float32, value=\n",
343+
"[[ 2.17016917e-04, 9.99538779e-01, 2.44221010e-04]])"
344+
]
345+
},
346+
"execution_count": 11,
347+
"metadata": {},
348+
"output_type": "execute_result"
349+
}
350+
],
351+
"source": [
352+
"# 我们可以使用 softmax 来获取标签概率\n",
353+
"from mindspore import ops\n",
354+
"ops.softmax(logits_per_video,1)"
355+
]
356+
}
357+
],
358+
"metadata": {
359+
"kernelspec": {
360+
"display_name": "MindSpore",
361+
"language": "python",
362+
"name": "mindspore"
363+
},
364+
"language_info": {
365+
"codemirror_mode": {
366+
"name": "ipython",
367+
"version": 3
368+
},
369+
"file_extension": ".py",
370+
"mimetype": "text/x-python",
371+
"name": "python",
372+
"nbconvert_exporter": "python",
373+
"pygments_lexer": "ipython3",
374+
"version": "3.9.10"
375+
}
376+
},
377+
"nbformat": 4,
378+
"nbformat_minor": 5
379+
}

0 commit comments

Comments
 (0)