discourse downloader

Download entire Discourse forums as static HTML files, with support for browser navigation and imitates original Discourse URLs for drop-in link replacements. Useful for archivals and external exports of Discourse forums.

Downloading Glitch Support with 14k (14623) threads, including user pages and categories took 24 hours, 26 minutes, 39.368 seconds. Because the tool downloads serially, performance will be improved once concurrency is achieved, which is currently TODO. You can view a demo of the final output at https://glitchforum.khaleelgibran.com.

// warning ⚠️

This tool is still in beta, and was made to archive the Glitch Support (support.glitch.com) forum, and is currently in heavy development. It may also not work with all Discourse forums. Please report any issues you encounter on the GitHub issues page.

features

Download all threads, posts, categories, users, and static pages from a Discourse forum, preserving link formats.
Generates static HTML files with media embedded, along with JSON files so you can do cool stuff with the data.
Supports downloading the threads with the latest activity, the last N threads (ordered by creation date), and all threads.
Resumable downloads using a state file.

how it works (if you care) (its okay if you dont, but still)

When you visit a Discourse page without JavaScript, it serves a clean and static HTML page with all the necessary content without all the garbage JavaScript. This makes it perfect for archiving, as you can download the HTML and media files statically without the need for JavaScript. Additionally, it uses Obelisk which downloads pages concurrently and supports directly embedding any and all media within the HTML files, so you can view the forum offline without needing to download any additional media files. There's literal bugs on my screen as I type this, the consequences of coding outdoors in the rain.

installation

I am not smelly nerd because I provide EXEs.

Download the binary
Download the appropriate binary for your platform from the releases page.

Or build from source
Requires Go 1.18+:

git clone https://github.com/khalby786/discourse-downloader.git
cd discourse-downloader
go build src/*.go -o discourse_downloader

usage

By default, the tool downloads the most recent threads and generates an index.html in the downloads directory.

./discourse_downloader [flags]

config flags

Flag	Default Value	Description
`-duration`	`recent`	Duration of posts to download, `all` downloads all the threads of the forum, `latest` downloads the last N threads created specified by the `-downloadLast` flag, and `recent` downloads the most recently updated threads (called "latest" in Discourse terms). Both `recent` and `latest` options do NOT follow state.
`-downloadStatic`	`false`	Download static pages (about, guidelines, etc.).
`-downloadCategory`	`false`	Download category pages, including a generated index of all threads belonging to each category.
`-baseURL`	`https://support.glitch.com/`	Base URL of the Discourse forum to archive.
`-baseHost`	`support.glitch.com`	Base host of the Discourse forum.
`-stateFile`	`state.json`	Path to the state file for tracking downloads. When you download `all` threads, this will help you resume progress if your downloads get interrupted.
`-userAgent`	`DiscourseDownloaderBot/1.0/khaleelgibran.com`	User agent string for HTTP requests for downloading.
`-downloadDir`	`downloads`	Directory to save the downloaded forum to.
`-downloadLast`	`50`	Number of latest threads to download (when using `latest` duration)

some examples

Download the latest 100 threads:

./discourse_downloader -duration=latest -downloadLast=100

Download all threads and static pages:

./discourse_downloader -duration=all -downloadStatic=true

Specify a different forum:

./discourse_downloader -baseURL=https://meta.discourse.org/ -baseHost=meta.discourse.org

caveats

When the -duration is set to recent or latest, the tool does not follow the state file, meaning it will always download the most recent threads without checking what has already been downloaded. This is useful for getting the latest updates but may result in duplicate entries in the threads.json file, which is used to generate the index.html page of all threads. To temporarily fix this, you can use a tool like jq to remove duplicate entries and keep the latest ones in the threads.json file.

jq 'reverse | unique_by(.id) | reverse' threads.json > tmp.json && mv tmp.json threads.json

output

All downloaded content is saved in the directory specified by -downloadDir (default: downloads).
The main page is generated as index.html with links to threads, categories, and static pages.
Each thread, category, and static page is saved as a separate HTML file, preserving Discourse URL structure.
JSON files are generated for threads, categories, and users, allowing for easy data manipulation and magic stuff.

todo

Download threads concurrently, will reduce time.
Bundle with Redbean to generate smaller zip files with an HTTP server for distribution.
Add more configuration options for advanced users.
Automatically remove duplicate entires in threads.json, as mentioned in the caveats section.

license

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

discourse downloader

features

how it works (if you care) (its okay if you dont, but still)

installation

usage

config flags

some examples

caveats

output

todo

license

About

Uh oh!

Releases

Packages

Languages

License

khalby786/discourse-downloader

Folders and files

Latest commit

History

Repository files navigation

discourse downloader

features

how it works (if you care) (its okay if you dont, but still)

installation

usage

config flags

some examples

caveats

output

todo

license

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages