This codebase relates to the GabLeaks release from Distributed Denial of Secrets. While the Gab data itself is limited distribution, this helper code is public.
The original release format is a Postgresql database dump that includes data in four tables: accounts, statuses, gabgroups, and verifications. However, the 'statuses', 'accounts', and 'gabgroups' tables contain a 'data' column where most of the relevant details are stored in JSON.
The mixed JSON / SQL format is not the most convenient for analysis. This script expands the data, creating new statuses_expanded, accounts_expanded, and gabgroups_expanded tables with individual typed fields instead of one big JSON blob. The schemas are as follows:
statuses_expanded
| Field | Type |
|---|---|
| id | bigint |
| bookmark_collection_id | bigint |
| card | jsonb |
| content | text |
| created_at | timestamp with time zone |
| emojis | jsonb |
| expires_at | timestamp with time zone |
| favourited | boolean |
| favourites_count | bigint |
| group_ | jsonb |
| has_quote | boolean |
| in_reply_to_account_id | bigint |
| in_reply_to_id | bigint |
| language | text |
| media_attachments | jsonb |
| mentions | jsonb |
| pinnable | boolean |
| pinnable_by_group | boolean |
| plain_markdown | text |
| poll | jsonb |
| quote | jsonb |
| quote_of_id | bigint |
| reblog | jsonb |
| reblogged | boolean |
| reblogs_count | bigint |
| replies_count | bigint |
| revised_at | text |
| rich_content | text |
| sensitive | boolean |
| spoiler_text | text |
| tags | jsonb |
| url | text |
| visibility | text |
accounts_expanded
| Field | Type |
|---|---|
| id | bigint |
| text | |
| password | text |
| name | text |
| bot | boolean |
| url | text |
| note | text |
| avatar | text |
| emojis | jsonb |
| fields | jsonb |
| header | text |
| is_pro | boolean |
| locked | boolean |
| is_donor | boolean |
| created | timestamp with time zone |
| is_investor | boolean |
| is_verified | boolean |
| display_name | text |
| avatar_static | text |
| header_static | text |
| statuses_count | bigint |
| followers_count | bigint |
| following_count | bigint |
| is_flagged_as_spam | boolean |
gabgroups_expanded
| Field | Type |
|---|---|
| id | int |
| password | text |
| url | text |
| slug | jsonb |
| tags | jsonb |
| title | text |
| created_at | timestamp with time zone |
| is_private | boolean |
| is_visible | boolean |
| description | text |
| is_archived | boolean |
| has_password | boolean |
| member_count | int |
| group_category | jsonb |
| cover_image_url | text |
| description_html | text |
First, ensure you already have the dataset loaded into Postgresql. Installing and configuring postgres, creating a database and user, and importing the database contents from a SQL dump, are out of scope for this README.
Second, install dependencies for this project:
pip3 install -r requirements.txt
Or install manually with:
pip3 install psycopg2 tqdm
Finally, launch the script like:
./expand.py <sql_host> <sql_user> <sql_db_name>
For example:
./expand.py localhost postgres gableaks
You'll be prompted for your postgres user's password, and then the script should take care of everything from there. There's a progress bar included - since there are about 39 million statuses, expanding the table took a little over two hours during development tests.