Notice that in order to use this, you'll need both a YouTube API key and a Vertex AI API key. Former is required to fetch the transcripts, and latter to summarize them into articles.
- build_index.py: Reads all .md files in the "chapters" folder and creates an index based on individual tags (i.e. tag + all articles with the tag)
- build_index_consolidated.py: Includes semantic grouping generated with an LLM for game dev purposes. Replacing the TAG_MAP allows you to group tags together in the index.
- channel.py: (Requires YouTube API key) Lists the videos on a channel. Notice that in some cases, you'll need to dig up the channel ID by looking into the source code. You can generally find if by searching for capitalized "UC"
- chapterize.py: (Requires Vertex AI API key) Loops through transcripts in transcripts folder (generated by scrape.py) and processes them with an LLM - currently gemini-2.5-pro-preview-05-06 due to it's capability to handle large contexts and follow instructions; and stores the results in chapters folder.
- fetch_single.py: Fetches and prints the captions for a single video. The YouTube API can be flaky and sometimes doesn't return the captions. This uses a
youtube_transcript_api
package which doesn't require a YT API key. - fetch_via_list.py: (Requires YouTube API key) Takes in a list of URL + Title pairs (i.e.
https://www.youtube.com/watch?v=12345678 Random Video
) and attempts to fetch the captions. Essentially a retry mechanism. - get_tags.py: This just loops through the chapters to collect tags and outputs them in alphabetical order. Convenient if you want to copy-paste them to LLM for grouping.
- scrape.py: (Requires YouTube API key) Fetches transcripts for a given channel and stores them under transcripts folder with a caption and source URL at the top.
- Run
scrape.py
on your channelpython scrapy.py <YOUTUBE_CHANNEL_ID>
- Run
chapterize.py
- Run
build_index.py
orbuild_index_consolidated.py
(latter might require changing theTAG_MAP
Rest of the scripts are there for convenience.