【开源实习】X_CLIP模型应用开发 (#1694)

dyedd · web-flow · commit 410030571613 · 2024-11-04T15:47:30.000+08:00
diff --git a/applications/X_CLIP/Video_text_matching_with_X_CLIP.ipynb b/applications/X_CLIP/Video_text_matching_with_X_CLIP.ipynb
@@ -0,0 +1,379 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4e2bd86c-ed80-44c0-9b52-7d470c82bf8c",
+   "metadata": {},
+   "source": [
+    "## 安装必要的库\n",
+    "> 请提前准备好mindspore和mindnlp的安装\n",
+    "\n",
+    "首先，安装所需的Python库：\n",
+    "- `-q` 表述静默安装，不会出现很多的`Requirement already satisfied`等等"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "c64baeda-7159-4436-b8ea-60b5a0a7cfa3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[33mWARNING: You are using pip version 21.0.1; however, version 24.2 is available.\n",
+      "You should consider upgrading via the '/home/ma-user/anaconda3/envs/MindSpore/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "!pip install -q huggingface_hub ipywidgets opencv-python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca7abaa5-f492-4028-a7b2-d4d0fa0882e1",
+   "metadata": {},
+   "source": [
+    "## 设置环境变量\n",
+    "设置Hugging Face的国内镜像站点以加快下载速度："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "6ea48e86-02ac-4259-813e-b6a26a739f95",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"HF_ENDPOINT\"] = \"https://hf-mirror.com\"       "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7bf038e6-af73-4245-9070-6caf552fecfd",
+   "metadata": {},
+   "source": [
+    "## 下载加载视频\n",
+    "- 使用`huggingface_hub`下载视频文件\n",
+    "- 使用`ipywidgets`展示视频"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "00c9d670-4aa2-473d-9d3f-334c0a253bb0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d493073379764beca81e0eebb2229939",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Video(value=b'\\x00\\x00\\x00 ftypisom\\x00\\x00\\x02\\x00isomiso2avc1mp41\\x00\\x00\\x00\\x08free...', width='500')"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from huggingface_hub import hf_hub_download\n",
+    "from ipywidgets import Video\n",
+    "\n",
+    "file_path = hf_hub_download(\n",
+    "    repo_id=\"nielsr/video-demo\", filename=\"eating_spaghetti.mp4\", repo_type=\"dataset\"\n",
+    ")\n",
+    "Video.from_file(file_path, width=500)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea2e6ad0-bf9c-41f4-974a-0dd4fc082269",
+   "metadata": {},
+   "source": [
+    "## 定义采样函数\n",
+    "`sample_frame_indices`通过在给定的帧范围内随机选择一段视频片段，并返回这段片段的帧索引。\n",
+    "- `clip_len`: 需要采样的帧数。\n",
+    "- `frame_sample_rate`: 帧采样率，决定了采样的密度。\n",
+    "- `seg_len`: 视频的总帧数。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "be6ef4ac-ed52-461c-a325-3b6c3d01146e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "np.random.seed(0)\n",
+    "\n",
+    "def sample_frame_indices(clip_len, frame_sample_rate, seg_len):\n",
+    "    # 计算转换后的长度(在给定采样率下，实际需要的帧数长度)\n",
+    "    converted_len = int(clip_len * frame_sample_rate)\n",
+    "    # 选择结束索引\n",
+    "    end_idx = np.random.randint(converted_len, seg_len)\n",
+    "    # 计算开始索引\n",
+    "    start_idx = end_idx - converted_len\n",
+    "    #使用np.linspace在开始索引和结束索引之间生成clip_len个等间距的索引\n",
+    "    indices = np.linspace(start_idx, end_idx, num=clip_len)\n",
+    "    # 使用np.clip确保所有索引都在有效范围内，并将它们转换为整数类型\n",
+    "    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)\n",
+    "    return indices"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d55a7435-b7db-4291-b5fa-d16136e5f120",
+   "metadata": {},
+   "source": [
+    "## 读取视频帧\n",
+    "利用OpenCV"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "b045662d-58a9-42e2-9af3-3202c9006089",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cv2\n",
+    "def read_video(file_path, indices):\n",
+    "    # 打开视频文件\n",
+    "    cap = cv2.VideoCapture(file_path)\n",
+    "    # 初始化一个列表来存储帧\n",
+    "    frames = []\n",
+    "    # 遍历给定的帧索引\n",
+    "    for idx in indices:\n",
+    "        # 设置视频捕获对象到特定的帧位置\n",
+    "        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)\n",
+    "        # 读取该帧\n",
+    "        ret, frame = cap.read()\n",
+    "        # 将读取的帧添加到帧列表中，并且转换通道，因为opencv是BGR\n",
+    "        if ret:\n",
+    "            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # 转换为RGB\n",
+    "            frames.append(frame)\n",
+    "    # 释放视频捕获对象\n",
+    "    cap.release()\n",
+    "    # 将帧列表转换为NumPy数组并返回\n",
+    "    return np.array(frames)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be45a1c6-467e-4861-bd41-a854663b34b6",
+   "metadata": {},
+   "source": [
+    "## 采样和读取视频\n",
+    "采样8帧并读取："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "3bb1023a-7359-4788-9cef-f32fb2feef95",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(8, 360, 640, 3)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cap = cv2.VideoCapture(file_path)\n",
+    "seg_len = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n",
+    "indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=seg_len)\n",
+    "video = read_video(file_path, indices)\n",
+    "video.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3923ae67-85a9-4a92-966c-d25c21ce5949",
+   "metadata": {},
+   "source": [
+    "## 使用mindnlp库进行文本-视频匹配\n",
+    "\n",
+    "本次我们使用的是`X-CLIP`，一个将语言-图像基础模型适配于通用视频识别的新框架。\n",
+    "- `X-CLIP`的整体结构与`CLIP`相似，采用两个编码器分别对文本和视频进行编码，然后通过比对这些特征来实现分类。\n",
+    "- `X-CLIP`引入了一个轻量级、可“即插即用”的跨帧注意力模块，用于捕捉时间信息。\n",
+    "- 此外，该模型使用视频提示（Prompt），可以生成具有区分能力的视觉提示，从而提升分类效果。因此，无需额外数据，`X-CLIP` 有效地利用了预训练的语言-图像模型，通过零样本或少样本学习实现视频识别。\n",
+    "\n",
+    "论文信息：\n",
+    "> [**Expanding Language-Image Pretrained Models for General Video Recognition**](https://arxiv.org/abs/2208.02816)<br>\n",
+    "> accepted by ECCV 2022 as an oral presentation<br>\n",
+    "> Bolin Ni, [Houwen Peng](https://houwenpeng.com/), [Minghao Chen](https://silent-chen.github.io/), [Songyang Zhang](https://sy-zhang.github.io/), [Gaofeng Meng](https://people.ucas.ac.cn/~gfmeng), [Jianlong Fu](https://jianlong-fu.github.io/), [Shiming Xiang](https://people.ucas.ac.cn/~xiangshiming), [Haibin Ling](https://www3.cs.stonybrook.edu/~hling/)\n",
+    "\n",
+    "[[arxiv]](https://arxiv.org/abs/2208.02816)\n",
+    "[[slides]](https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/xclip-slides.pptx)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "dfae1baa-0a63-4552-ac36-585896ac1f97",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.\n",
+      "  setattr(self, word, getattr(machar, word).flat[0])\n",
+      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.\n",
+      "  return self._float_to_str(self.smallest_subnormal)\n",
+      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.\n",
+      "  setattr(self, word, getattr(machar, word).flat[0])\n",
+      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.\n",
+      "  return self._float_to_str(self.smallest_subnormal)\n",
+      "Building prefix dict from the default dictionary ...\n",
+      "Loading model from cache /tmp/jieba.cache\n",
+      "Loading model cost 1.286 seconds.\n",
+      "Prefix dict has been built successfully.\n",
+      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/transformers/tokenization_utils_base.py:1526: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted, and will be then set to `False` by default. \n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[MS_ALLOC_CONF]Runtime config:  enable_vmm:True  vmm_align_size:2MB\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[WARNING] CORE(90475,ffff864fc0b0,python):2024-10-24-20:59:31.816.322 [mindspore/core/utils/ms_context.cc:531] GetJitLevel] Set jit level to O2 for rank table startup method.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mindnlp.transformers import XCLIPProcessor, XCLIPModel\n",
+    "model_name = \"microsoft/xclip-base-patch32\"\n",
+    "processor = XCLIPProcessor.from_pretrained(model_name)\n",
+    "model = XCLIPModel.from_pretrained(model_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e820ebf1-67d6-4c54-accf-f490358b30e0",
+   "metadata": {},
+   "source": [
+    "## 设置提示词和输入"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "0d0d94f3-94e7-4624-9860-53dece33b536",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inputs = processor(text=[\"playing sports\", \"eating spaghetti\", \"go shopping\"], videos=list(video), return_tensors=\"ms\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12e3b2cd-5f64-4dc8-bd50-b0e183322708",
+   "metadata": {},
+   "source": [
+    "## 前向计算"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "8fec4c69-b393-4f7a-9174-da1baf49f8d2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "outputs = model(**inputs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "c358ec73-672f-412d-ac3a-bf749d52ff8a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Tensor(shape=[1, 3], dtype=Float32, value=\n",
+       "[[ 1.26835327e+01,  2.11186066e+01,  1.28016310e+01]])"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "logits_per_video = outputs.logits_per_video  # 这是视频-文本相似度得分\n",
+    "logits_per_video"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "a1b9e3fa-deac-4004-9961-813e51dd6da0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Tensor(shape=[1, 3], dtype=Float32, value=\n",
+       "[[ 2.17016917e-04,  9.99538779e-01,  2.44221010e-04]])"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# 我们可以使用 softmax 来获取标签概率\n",
+    "from mindspore import ops\n",
+    "ops.softmax(logits_per_video,1)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "MindSpore",
+   "language": "python",
+   "name": "mindspore"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}