Stream Data Crawler

Crawl the audio content at Vtubers and Streamers on YouTube and Bilibili.

[EN|ZH]

Usage

Please adjust VarMap.py file in the corresponding directory before crawling raw data and generate the script from audio content.

----- Stream-Data-Crawler
    ----- raw-data
        ----- crawlers
            - __init__.py
            - BiliCrawl.py
            - TubeCrawl.py
        - BiliDataCollect.py
        - TubeDataCollect.py
        - VarMap.py
    ----- script
        - ScriptRecog.py
        - VarMap.py

This repositiry includes raw audio data collection and audio-to-text transcription.

Raw audio data collection (crawler)
Audio transcript (FunASR)

Raw Audio Data Collection (Crawler)

/raw-data/VarMap.py stores the save address, urls and headers required for crawling.

Variable Name	Functionality
AUDIO_SAVE_ADDRESS	Address for saving audio file
BILIBILI_URL	API url address of a playlist in bilibili.com
BILIBILI_HEADER	Header fot request audio content from Bilibili, usually no need to adjust
TUBE_CSV_ADDRESS	Address to csv file strong YouTube playlist and channel urls

Run BiliDataCollect.py and TubeDataCollect.py to collect the content from bilibili.com and YouTube.

TubeDataCollect.py collects contents from bith channels and playlists. Create a csv file to store the url and type (channel or playlist). See example csv.

BiliDataCollect.py only collects contents from playlists. To get the api address, inspect the webpage and then click video in the playlist. You can fild a web block called "playurl", copy the requst url as the value of BILIBILI_URL.

** Possibly due to the anti-crawler features of bilibili.com, the requests() function utilized in BiliDataCollect.py might not collect all the contents in the playlist. You can try to run the file multiple times to collet all the contents.

Audio Transcript (FunASR)

Audio transcription is developed based on FunASR. You can check the original repository at https://github.com/modelscope/FunASR.

/script/VarMap.py stores the saving address and model directory required for AI-based audio-to-text recognition.

Variable Name	Functionality
AUDIO_SAVE_ADDRESS	Address to saved audio file
SCRIPT_SAVE_ADDRESS	Address for saving audio transcripts
MODEL_DIRECTORY	Directory to access the ASR models. The directory can be local or from huggingface/modelscope hub

Run ScriptRecog.py to trancript the audio contents to texts. Unfortunately, the model should be loaded and deployed locally for now, so please check and choose the suitable model.

Catalog

Cloud model deployment support.
Google Colab inference example.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README		README
raw-data		raw-data
script		script
LICENSE		LICENSE
README-ZH.md		README-ZH.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stream Data Crawler

Usage

Raw Audio Data Collection (Crawler)

Audio Transcript (FunASR)

Catalog

About

Uh oh!

Releases

Packages

Languages

License

AI-Streamer/Stream-Data-Crawler

Folders and files

Latest commit

History

Repository files navigation

Stream Data Crawler

Usage

Raw Audio Data Collection (Crawler)

Audio Transcript (FunASR)

Catalog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages