- The data collection section consists of the core retrieval scripts and datasets to be used across multiple parts of the research as explained in the data preparation section in our paper.
- The remaining RQ1-3 sections cover steps to preprocess the preliminary dataset and perform their respective analyses.
- See
requirements.txt
for the required Python packages.venv
is recommended for Python package management.
The following sections describe the data collection process for the raw dataset. Note that any further processing of the raw data has been included in their respective RQ sections.
BigQuery dataset retrieval scripts:
retrieve-hn-stories.sql
: Script used to retrieve Hacker News stories using the public Hacker News dataset.retrieve-hn-comments.sql
: Script used to retrieve Hacker News comments based on the retrieved Hacker News stories.retrieve-gh-metrics.sql
: Script used to retrieve GitHub metrics using the public GitHub Archive dataset. Requires theretrieve-hn-stories.sql
script to be run and for the dataset to be stored for referencing repository URL first.retrieve-gh-metadata.sql
: Script used to retrieve GitHub metadata based on the retrieved Hacker News stories. Metadata such as repository creation date is required certain parts such as RQ3 when identifying projects that were created at least 6 months prior to HN submission.
Preliminary AI and GitHub repository filtering script:
ai_keywords.txt
: List of AI keywords used in the script, with each keyword separated by newline.filter-ai-keywords.py
: Script used to filter only for stories whose title contain AI keywords defined in theai_keywords.txt
file.filter-gh-repos.py
: Script used to filter only for stories containing GitHub repositories URLs.remove-duplicate-urls.py
: Script used to remove entries containing duplicate GitHub repository URLs - selecting only the first story among the duplicates with the highest HN score. This is needed for RQ3 in order to not analyze duplicated GitHub repositories, but may find useful in general.
Note: The AI and GitHub filtering were done directly within the data collection stage due to efficiency as these were used throughout our research.
Each tag denotes different filtering criteria on the dataset, which includes the following:
gh
: All stories containing GitHub repository URL.gh-ai
: AI stories containing GitHub repository URL.gh-ai-[no-dupes]
: AI stories containing GitHub repository URL with no duplicates GitHub repository. This is because multiple HN stories can reference the same repository URL. Non-duplicated dataset is used in RQ3 where we must consider unique repository to analyze their metrics.gh-nonai
: Non-AI stories containing GitHub repository URL.ghh-nonai-[no-dupes]
: Non-AI stories containing GitHub repository URL with no duplicates GitHub repository.
Attribute | Description |
---|---|
by |
Username of the story's submitter. |
dead |
Indicates if the story is flagged as "dead" (removed). Empty if not dead. |
id |
Unique identifier for the story. |
score |
Upvote count (popularity score). |
text |
Self-text content of the story (empty if URL-based). |
time |
Unix epoch timestamp of submission time. |
timestamp |
Human-readable UTC timestamp of submission. |
title |
Title of the story. |
type |
Post type (e.g., story , poll , etc.). |
url |
External URL linked in the story (if applicable). |
Same tags as the hn_story-[tag].csv
dataset for each filtering criteria.
Attribute | Description |
---|---|
comment_id |
Unique identifier for the comment. |
commenter |
Username of the comment's author. |
comment_text |
Text content of the comment. |
comment_time |
UTC timestamp of when the comment was posted. |
parent_id |
ID of the parent post (matches story_id for top-level comments). |
story_id |
ID of the story the comment is associated with. |
story_title |
Title of the linked story. |
Metric tags include the following:
hn-gh-ai
: Metric data for HN GH-AI stories.hn-gh-ai-[6-months-before-after-hn]
: AI stories containing GitHub repository URL which has been created at least 6 months before HN submission as well as containing metrics 6 months afterwards. This is to ensure sufficient project activities, and was used for statistical analysis and DiD. Note that this does not contain relative months after HN submission, only a list of raw metrics for the repositories that fit the 6 months before and after criteria in the same format as the standard metrics file. This was generated by runningrq3/3a/filter-6months-before-after-hn.py
.
Attribute | Description |
---|---|
repo_full_name |
Full name of the GitHub repository in the format owner/repo . |
repo_url |
URL link to the GitHub repository. |
month |
Monthly timestamp (YYYY-MM) indicating the time period of the metrics. |
hn_submission_date |
Date and time when the repository was submitted to Hacker News (if any). |
source |
Source of submission information (e.g., "HN" for Hacker News). |
hn_score |
Hacker News score (upvotes minus flags) for the submission. |
stars |
Number of new GitHub stars the repository received in that month. |
forks |
Number of new forks created in that month. |
commits |
Number of commits made to the repository in that month. |
PRs |
Number of pull requests opened in that month. |
contributors |
Number of unique contributors active in that month. |
cumulative_stars |
Total number of GitHub stars the repository has received up to that month. |
cumulative_forks |
Total number of GitHub forks the repository has received up to that month. |
hn-gh-ai-5months-[no-dupes]
: Metric data for HN GH-AI stories which has been converted into relative 5 months after HN submission date. This was generated by runningrq3/3b-hn-effects-sentiment-growth/convert-to-relative-hn-date.py
. The relative metrics format is as follows:
Attribute | Description |
---|---|
by |
Username of the person who submitted the post on Hacker News. |
discussion_id |
Unique identifier of the Hacker News discussion thread. |
score |
Hacker News score (upvotes minus flags) at the time of scraping. |
timestamp |
Date and time when the Hacker News post was made. |
title |
Title of the Hacker News submission. |
url |
GitHub URL of the submitted repository. |
stars_at_submission |
GitHub stars at the time of Hacker News submission. |
stars_month_1 to _month_5 |
Stars gained in each of the first five months after submission. |
commits_at_submission |
Total commits in the repo at time of submission. |
commits_month_1 to _month_5 |
Number of commits in each of the five months after submission. |
pull_requests_at_submission |
Total pull requests at submission time. |
pull_requests_month_1 to _month_5 |
Number of pull requests opened in each post-submission month. |
forks_at_submission |
GitHub forks at the time of Hacker News submission. |
forks_month_1 to _month_5 |
Forks gained in each of the five months following submission. |
contributors_at_submission |
Number of contributors at the time of submission. |
contributors_month_1 to _month_5 |
Number of contributors in each post-submission month. |
Attribute | Description |
---|---|
repo_full_name |
GitHub repository identifier in owner/repo format. |
owner |
GitHub username/organization owning the repository. |
repo_name |
Name of the GitHub repository. |
repo_creation_date |
UTC timestamp of repository creation. |
contributor_count |
Total contributors to the repository. |
push_count |
Number of pushes (commits) to the repository. |
star_count |
Number of GitHub stars. |
fork_count |
Number of repository forks. |
Note: In
metadata-[tags].csv
, Repository stats (star_count
,fork_count
, etc.) reflect historical snapshots at the time of data collection.
analyze-gh-repo-creation-date.ipynb
: Analyze the distribution of GitHub repository creation dates.
analyze-gh-usernames.ipynb
: Analyze the number of GitHub repository owners who self-promote their project on Hacker News.
analyze-hn-stories.ipynb
: Visualize the number of Hacker News stories over time and keyword distribution.analyze-hn-comments.ipynb
: Analyze Hacker News comments based on the HN GH-AI stories.
analyze-hn-stories-lda-[colab].ipynb
: Perform topic modeling on Hacker News stories to identify dominant topics. This has been executed in Google Colab due to the heavy computational requirements from LDA.
sample-hn-comments-stratified.py
: Sample comments using stratified sampling from the HN AI comments.
finetuning-bert-hn-comment-[colab].ipynb
: Fine-tune BERT on the comment ground truth and evaluate performance using 5-fold cross-validation in Google Colab.finetuning-bert-hn-story-[colab].ipynb
: Fine-tune BERT on the story ground truth and evaluate performance using 5-fold cross-validation in Google Colab.finetuning-roberta-hn-comment-[colab].ipynb
: Fine-tune RoBERTa on the comment ground truth and evaluate performance using 5-fold cross-validation in Google Colab.finetuning-roberta-hn-story-[colab].ipynb
: Fine-tune RoBERTa on the story ground truth and evaluate performance using 5-fold cross-validation in Google Colab.finetuning-twitter-roberta-hn-comment-[colab].ipynb
: Fine-tune Twitter RoBERTa on the comment ground truth and evaluate performance using 5-fold cross-validation in Google Colab.finetuning-twitter-roberta-hn-story-[colab].ipynb
: Fine-tune Twitter RoBERTa on the story ground truth and evaluate performance using 5-fold cross-validation in Google Colab.
prompting-gpt-hn-ai-comment.ipynb
: Prompt GPT-4o mini to output sentiment and reason, and evaluate performance using weighted F1-score on the comment ground truth.prompting-gpt-hn-ai-story.ipynb
: Prompt GPT-4o mini to output sentiment and reason, and evaluate performance using weighted F1-score on the story ground truth.
hn_gh_ai_story_sentiment_analysis.ipynb
: Perform sentiment analysis on the HN GH-AI story.hn_gh_ai_comment_sentiment_analysis.ipynb
: Perform sentiment analysis on the HN GH-AI comment.hn_gh_non_ai_story_sentiment_analysis.ipynb
: Perform sentiment analysis on the HN GH story with no AI keywords.hn_gh_non_ai_comment_sentiment_analysis.ipynb
: Perform sentiment analysis on all comments in the HN GH story with no AI keywords.
analyze-sentiment-result.ipynb
: Visualize the story and comment sentiment results using area stacks and reaction using a heatmap.
statistical-test.ipynb
: Perform statistical tests between the HN GH-AI story and comment and the HN GH story and comment with no AI keywords.
- Consists of all CSV datasets used or created from scripts inside rq2
Data can either be:
story
: Hacker News storycomment
: Hacker News comment
Attribute | Description |
---|---|
discussion_id |
ID of Hacker News story |
title |
Title of the story |
url |
External URL linked in the story |
discussion_date |
Date and time of story submission |
comment_id |
ID of Hacker News comment |
comment_text |
Text content of the comment |
comment_date |
Date and time of when the comment was posted |
hu1_[data]_label |
Human investigator No.1 [data] label |
hu2_[data]_label |
Human investigator No.2 [data] label |
[data]_match |
Matching of [data] between hu1 and hu2 label |
[data]_consensus |
Final [data] consensus label |
Attribute | Description |
---|---|
discussion_id |
ID of Hacker News story |
title |
Title of the story |
url |
External URL linked in the story |
discussion_date |
Date and time of story submission |
hu1_story_label |
Human investigator No.1 story label |
hu2_story_label |
Human investigator No.2 story label |
story_match |
Matching of the story between hu1 and hu2 label |
story_consensus |
Final story consensus label |
senti_prompt0_2shot_gpt |
Story sentiment towards AI from GPT-4o mini |
reason_prompt0_2shot_gpt |
Reason for the assigned story sentiment from GPT-4o mini |
Attribute | Description |
---|---|
discussion_id |
ID of Hacker News story |
title |
Title of the story |
url |
External URL linked in the story |
discussion_date |
Date and time of story submission |
comment_id |
ID of Hacker News comment |
comment_text |
Text content of the comment |
comment_date |
Date and time of when the comment was posted |
hu1_comment_label |
Human investigator No.1 comment label |
hu2_comment_label |
Human investigator No.2 comment label |
comment_match |
Matching of the comment between hu1 and hu2 label |
comment_consensus |
Final comment consensus label |
story_consensus |
Final story consensus label |
senti_comment_prompt2_1shot_gpt |
Comment sentiment towards AI from GPT-4o mini |
reason_comment_prompt2_1shot_gpt |
Reason for the assigned comment sentiment from GPT-4o mini |
hn_gh_ai_story_sentiment.csv
- Story Sentiment and Reason from GPT-4o mini on HN-GH AI Story Dataset
Attribute | Description |
---|---|
story_id |
ID of Hacker News story |
title |
Title of the story |
url |
External URL linked in the story |
datetime |
Unix epoch timestamp of story submission time. |
story_sentiment |
Story sentiment towards AI from GPT-4o mini |
story_sentiment_reason |
Reason for the assigned story sentiment from GPT-4o mini |
hn_gh_ai_comment_sentiment.csv
- Comment Sentiment and Reason from GPT-4o mini on HN GH-AI Comment Dataset
Attribute | Description |
---|---|
comment_id |
ID of Hacker News comment |
comment_text |
Text content of the comment |
comment_datetime |
Unix epoch timestamp of comment posted time. |
story_id |
ID of Hacker News story |
story_title |
Title of the story |
url |
External URL linked in the story |
story_datetime |
Unix epoch timestamp of story submission time. |
comment_sentiment |
Comment sentiment towards AI from GPT-4o mini |
comment_sentiment_reason |
Reason for the assigned comment sentiment from GPT-4o mini |
story_sentiment |
Story sentiment towards AI from GPT-4o mini |
story_sentiment_reason |
Reason for the assigned story sentiment from GPT-4o mini |
hn_gh_non_ai_story_sentiment.csv
- Story Sentiment and Reason from GPT-4o mini on HN GH Story Dataset with No AI Keywords
Attribute | Description |
---|---|
story_id |
ID of Hacker News story |
title |
Title of the story |
url |
External URL linked in the story |
datetime |
Unix epoch timestamp of story submission time. |
story_sentiment |
Story sentiment towards technology from GPT-4o mini |
story_sentiment_reason |
Reason for the assigned story sentiment from GPT-4o mini |
hn_gh_non_ai_comment_sentiment.csv
- Comment Sentiment and Reason from GPT-4o mini on All Comments Replied to HN GH Story Dataset with No AI Keywords
Attribute | Description |
---|---|
comment_id |
ID of Hacker News comment |
comment_text |
Text content of the comment |
comment_datetime |
Unix epoch timestamp of comment posted time. |
story_id |
ID of Hacker News story |
story_title |
Title of the story |
url |
External URL linked in the story |
story_datetime |
Unix epoch timestamp of story submission time. |
comment_sentiment |
Comment sentiment towards technology from GPT-4o mini |
comment_sentiment_reason |
Reason for the assigned comment sentiment from GPT-4o mini |
analyze-historical-metrics.ipynb
: Analyze historical GitHub repository metrics over time. Initially filters outlier repositories prior to the analysis.stats-test-metrics.ipynb
: Perform statistical tests on the historical metrics data on the metric changes after HN submission. Initially filters for repositories that contain metrics at minimum 6 months before and after HN submission to ensure a sufficient time frame for repository activity.
Perform analysis and visualization on GitHub metric data base on their respective sentiment on Hacker News to find the relationship between Hacker News sentiment and GitHub metric growth
average_comment_sentiment_Calculator.ipynb
- Merge Hacker News comment dataset file into story dataset file via parent story ID
- Filter out stories without comment or have link to the same GitHub repository leaving out only the oldest one
- Calculate average comment sentiment of each story and classify overall sentiment from the result (positive, neutral, negative)
file_merger.ipynb
- Merge Hacker News dataset file with GitHub dataset file via story URL
- Display amount of data
metric_value_processor.ipynb
- Calculate accumulative value of each metric (exception being contributors)
- Calculate distant value in between each month as both raw and percentage
sentiment_categorizer.ipynb
- Categorize dataset base on sentiment (positive, neutral, negative) into seperate files
- Display amount of data in each group/file
outliers_filter.ipynb
- Remove outliers of each sentiment group using IQR method
metric_mean_display.ipynb
(optional)- Return or display mean values of metric values in each sentiment group
visualizer.ipynb
- Visualize the data as box plot (median) and point plot (mean)
This contains the HN GH-AI repository metrics converted into relative 5 months from the Hacker News submission date for each story. Month 0 denotes the month of Hacker News submission, month 1 denotes the month after the Hacker News submission, and so on. No duplicated repositories are included as some stories contain the same repository URL.
Attribute | Description |
---|---|
by | Owner of Hacker News discussion |
discussion_id | ID of Hacker News discussion |
timestamp | Date and time of posting |
title | Discussion title |
url | Link of a GitHub repository that the discussion refers to |
[metric]_at_submission | Total value of a specific GitHub repository's [metric] at the time of the Hacker News Discussion has been posted |
[metric]_at_month_ [number] | Additional value of GitHub repository's [metric] at [number] month after the Hacker News Discussion has been posted |
Attribute | Description |
---|---|
discussion_id | ID of Hacker News discussion |
title | Discussion title |
url | Link of a GitHub repository that the discussion refers to |
datetime | Date and time of posting |
story_sentiment | Sentiment of the discussion title (positive: 1, neutral: 0, negative: -1) |
story_sentiment_reason | Reason why the discussion title has such sentiment |
Attribute | Description |
---|---|
comment_id | ID of Hacker News comment |
commenter | User that posted the comment |
comment_text | Text or words of comment |
comment_datetime | Date and time that the comment has been posted |
story_id | ID of comment's parent discussion |
story_title | Title of comment's parent discussion |
url | Link of a GitHub repository that comment's parent discussion refers to |
story_date | Date and time that comment's parent discussion has been posted |
comment_sentiment | Sentiment of the comment (positive: 1, neutral: 0, negative: -1) |
comment_sentiment_reason | Reason why the discussion title has such sentiment |
Attribute | Description |
---|---|
discussion_id | ID of Hacker News discussion |
title | Discussion title |
url | Link of a GitHub repository that the discussion refers to |
datetime | Date and time of posting |
story_sentiment | Sentiment of the discussion title (positive: 1, neutral: 0, negative: -1) |
story_sentiment_reason | Reason why the discussion title has such sentiment |
average_comments_sentiment | Average sentiment value of the discussion's comments (story has no comment: -2) |
overall_comments_sentiment | Overall sentiment of the discussion as a whole (Positive: 1, Neutral: 0, Negative: -1, Story has no comment: -2) |
Attribute | Description |
---|---|
by | Owner of Hacker News discussion |
discussion_id | ID of Hacker News discussion |
timestamp | Date and time of posting |
title | Discussion title |
url | Link of a GitHub repository that the discussion refers to |
[metric]_at_submission | Total value of a specific GitHub repository's [metric] at the time of the Hacker News Discussion has been posted |
[metric]_at_month_ [number] | Additional value of GitHub repository's [metric] at [number] month after the Hacker News Discussion has been posted |
average_comments_sentiment | Average sentiment value of the discussion's comments |
overall_comments_sentiment | Sentiment value of the discussion as a whole (Positive: 1, Neutral: 0, Negative: -1) |
dist_ [metric]([time_period1]-[time_period2]) | Raw change in [metric] in between time of [time_period1] and [time_period2] after the Hacker News Discussion has been posted |
percent_ [metric]([time_period1]-[time_period2]) | Percentage change in [metric] in between time of [time_period1] and [time_period2] after the Hacker News Discussion has been posted |
sentiment | text version of overall_comments_sentiment (Positive, Neutral, Negative) |