Download entire Discourse forums as static HTML files, with support for browser navigation and imitates original Discourse URLs for drop-in link replacements. Useful for archivals and external exports of Discourse forums.
Downloading Glitch Support with 14k (14623) threads, including user pages and categories took 24 hours, 26 minutes, 39.368 seconds. Because the tool downloads serially, performance will be improved once concurrency is achieved, which is currently TODO. You can view a demo of the final output at https://glitchforum.khaleelgibran.com.
// warning
⚠️ This tool is still in beta, and was made to archive the Glitch Support (support.glitch.com) forum, and is currently in heavy development. It may also not work with all Discourse forums. Please report any issues you encounter on the GitHub issues page.
- Download all threads, posts, categories, users, and static pages from a Discourse forum, preserving link formats.
- Generates static HTML files with media embedded, along with JSON files so you can do cool stuff with the data.
- Supports downloading the threads with the latest activity, the last N threads (ordered by creation date), and all threads.
- Resumable downloads using a state file.
When you visit a Discourse page without JavaScript, it serves a clean and static HTML page with all the necessary content without all the garbage JavaScript. This makes it perfect for archiving, as you can download the HTML and media files statically without the need for JavaScript. Additionally, it uses Obelisk which downloads pages concurrently and supports directly embedding any and all media within the HTML files, so you can view the forum offline without needing to download any additional media files. There's literal bugs on my screen as I type this, the consequences of coding outdoors in the rain.
I am not smelly nerd because I provide EXEs.
-
Download the binary
Download the appropriate binary for your platform from the releases page. -
Or build from source
Requires Go 1.18+:git clone https://github.com/khalby786/discourse-downloader.git cd discourse-downloader go build src/*.go -o discourse_downloader
By default, the tool downloads the most recent threads and generates an index.html
in the downloads
directory.
./discourse_downloader [flags]
Flag | Default Value | Description |
---|---|---|
-duration |
recent |
Duration of posts to download, all downloads all the threads of the forum, latest downloads the last N threads created specified by the -downloadLast flag, and recent downloads the most recently updated threads (called "latest" in Discourse terms). Both recent and latest options do NOT follow state. |
-downloadStatic |
false |
Download static pages (about, guidelines, etc.). |
-downloadCategory |
false |
Download category pages, including a generated index of all threads belonging to each category. |
-baseURL |
https://support.glitch.com/ |
Base URL of the Discourse forum to archive. |
-baseHost |
support.glitch.com |
Base host of the Discourse forum. |
-stateFile |
state.json |
Path to the state file for tracking downloads. When you download all threads, this will help you resume progress if your downloads get interrupted. |
-userAgent |
DiscourseDownloaderBot/1.0/khaleelgibran.com |
User agent string for HTTP requests for downloading. |
-downloadDir |
downloads |
Directory to save the downloaded forum to. |
-downloadLast |
50 |
Number of latest threads to download (when using latest duration) |
Download the latest 100 threads:
./discourse_downloader -duration=latest -downloadLast=100
Download all threads and static pages:
./discourse_downloader -duration=all -downloadStatic=true
Specify a different forum:
./discourse_downloader -baseURL=https://meta.discourse.org/ -baseHost=meta.discourse.org
When the -duration
is set to recent
or latest
, the tool does not follow the state file, meaning it will always download the most recent threads without checking what has already been downloaded. This is useful for getting the latest updates but may result in duplicate entries in the threads.json
file, which is used to generate the index.html
page of all threads. To temporarily fix this, you can use a tool like jq
to remove duplicate entries and keep the latest ones in the threads.json
file.
jq 'reverse | unique_by(.id) | reverse' threads.json > tmp.json && mv tmp.json threads.json
- All downloaded content is saved in the directory specified by
-downloadDir
(default:downloads
). - The main page is generated as
index.html
with links to threads, categories, and static pages. - Each thread, category, and static page is saved as a separate HTML file, preserving Discourse URL structure.
- JSON files are generated for threads, categories, and users, allowing for easy data manipulation and magic stuff.
- Download threads concurrently, will reduce time.
- Bundle with Redbean to generate smaller zip files with an HTTP server for distribution.
- Add more configuration options for advanced users.
- Automatically remove duplicate entires in
threads.json
, as mentioned in the caveats section.
MIT License. See LICENSE for details.