Created and maintained by Honglin Bao, summer 2021 @ Michigan State Department of Communication, Computational Communication Group. Contact: baohlcs@gmail.com
Computational social science research necessitates the processing of massive amounts of textual data ranging from digital traces for social media research to publication data for Science of Science research. This GitHub repository will provide an overview of the most frequently used techniques in computational social science (notably political communication) for dealing with textual data: scraping to obtain datasets, pre-processing to clean the data, and finally, automatic classification.
I cover the following subjects:
- Scrapers: API-based or manually constructed tools for scraping websites or social media platforms such as Twitter/YouTube (check out the corresponding folder).
- Binary classification of Twitter posts to infer their ideology (republican or democrat) (check out the corresponding folder).
- Classification of social media comments into multiple classes to determine their toxicity degrees or sentiments (check out the corresponding folder).
- Several advanced techniques for dealing with unusual situations, such as insufficient text data or imbalanced text data across classes (refer to slides).
- Model evaluation: What metrics should we consider when evaluating a designed machine learning model? (refer to slides).
- A brief introduction to some fancy, famous, but heavy-weight deep learning models that have the potential to achieve highly accurate text classification performance (refer to slides).
Nota bene, 1, 2, and 3 are basic operations with accompanying code and detailed comments/explanations. 4, 5, and 6 are more advanced subjects with a substantial body of literature. Please refer to the slides for details.
Acknowledgment: The Summer Institutes in Computational Social Science 2021 (https://sicss.io/)
Appreciate and welcome any types of contribution/discussion/pulling requests.