Hi, Impressive work! I want to ask how to extract features from my own video-text datasets for finetuning model?