Skip to content

khalby786/discourse-downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

discourse downloader

Download entire Discourse forums as static HTML files, with support for browser navigation and imitates original Discourse URLs for drop-in link replacements. Useful for archivals and external exports of Discourse forums.

Downloading Glitch Support with 14k (14623) threads, including user pages and categories took 24 hours, 26 minutes, 39.368 seconds. Because the tool downloads serially, performance will be improved once concurrency is achieved, which is currently TODO. You can view a demo of the final output at https://glitchforum.khaleelgibran.com.

// warning ⚠️

This tool is still in beta, and was made to archive the Glitch Support (support.glitch.com) forum, and is currently in heavy development. It may also not work with all Discourse forums. Please report any issues you encounter on the GitHub issues page.

features

  • Download all threads, posts, categories, users, and static pages from a Discourse forum, preserving link formats.
  • Generates static HTML files with media embedded, along with JSON files so you can do cool stuff with the data.
  • Supports downloading the threads with the latest activity, the last N threads (ordered by creation date), and all threads.
  • Resumable downloads using a state file.

how it works (if you care) (its okay if you dont, but still)

When you visit a Discourse page without JavaScript, it serves a clean and static HTML page with all the necessary content without all the garbage JavaScript. This makes it perfect for archiving, as you can download the HTML and media files statically without the need for JavaScript. Additionally, it uses Obelisk which downloads pages concurrently and supports directly embedding any and all media within the HTML files, so you can view the forum offline without needing to download any additional media files. There's literal bugs on my screen as I type this, the consequences of coding outdoors in the rain.

installation

I am not smelly nerd because I provide EXEs.

  1. Download the binary
    Download the appropriate binary for your platform from the releases page.

  2. Or build from source
    Requires Go 1.18+:

    git clone https://github.com/khalby786/discourse-downloader.git
    cd discourse-downloader
    go build src/*.go -o discourse_downloader

usage

By default, the tool downloads the most recent threads and generates an index.html in the downloads directory.

./discourse_downloader [flags]

config flags

Flag Default Value Description
-duration recent Duration of posts to download, all downloads all the threads of the forum, latest downloads the last N threads created specified by the -downloadLast flag, and recent downloads the most recently updated threads (called "latest" in Discourse terms). Both recent and latest options do NOT follow state.
-downloadStatic false Download static pages (about, guidelines, etc.).
-downloadCategory false Download category pages, including a generated index of all threads belonging to each category.
-baseURL https://support.glitch.com/ Base URL of the Discourse forum to archive.
-baseHost support.glitch.com Base host of the Discourse forum.
-stateFile state.json Path to the state file for tracking downloads. When you download all threads, this will help you resume progress if your downloads get interrupted.
-userAgent DiscourseDownloaderBot/1.0/khaleelgibran.com User agent string for HTTP requests for downloading.
-downloadDir downloads Directory to save the downloaded forum to.
-downloadLast 50 Number of latest threads to download (when using latest duration)

some examples

Download the latest 100 threads:

./discourse_downloader -duration=latest -downloadLast=100

Download all threads and static pages:

./discourse_downloader -duration=all -downloadStatic=true

Specify a different forum:

./discourse_downloader -baseURL=https://meta.discourse.org/ -baseHost=meta.discourse.org

caveats

When the -duration is set to recent or latest, the tool does not follow the state file, meaning it will always download the most recent threads without checking what has already been downloaded. This is useful for getting the latest updates but may result in duplicate entries in the threads.json file, which is used to generate the index.html page of all threads. To temporarily fix this, you can use a tool like jq to remove duplicate entries and keep the latest ones in the threads.json file.

jq 'reverse | unique_by(.id) | reverse' threads.json > tmp.json && mv tmp.json threads.json

output

  • All downloaded content is saved in the directory specified by -downloadDir (default: downloads).
  • The main page is generated as index.html with links to threads, categories, and static pages.
  • Each thread, category, and static page is saved as a separate HTML file, preserving Discourse URL structure.
  • JSON files are generated for threads, categories, and users, allowing for easy data manipulation and magic stuff.

todo

  • Download threads concurrently, will reduce time.
  • Bundle with Redbean to generate smaller zip files with an HTTP server for distribution.
  • Add more configuration options for advanced users.
  • Automatically remove duplicate entires in threads.json, as mentioned in the caveats section.

license

MIT License. See LICENSE for details.

About

Download entire Discourse forums as static HTML files, with support for browser navigation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages