Skip to content

The tutorial on scraping, processing, and classification of text-based digital trace data in Natural Language Processing and Computational Social Science.

Notifications You must be signed in to change notification settings

hlbao/classification_in_CSS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classification in Computational Social Science

Created and maintained by Honglin Bao, summer 2021 @ Michigan State Department of Communication, Computational Communication Group. Contact: baohlcs@gmail.com

Computational social science research necessitates the processing of massive amounts of textual data ranging from digital traces for social media research to publication data for Science of Science research. This GitHub repository will provide an overview of the most frequently used techniques in computational social science (notably political communication) for dealing with textual data: scraping to obtain datasets, pre-processing to clean the data, and finally, automatic classification.

I cover the following subjects:

  1. Scrapers: API-based or manually constructed tools for scraping websites or social media platforms such as Twitter/YouTube (check out the corresponding folder).
  2. Binary classification of Twitter posts to infer their ideology (republican or democrat) (check out the corresponding folder).
  3. Classification of social media comments into multiple classes to determine their toxicity degrees or sentiments (check out the corresponding folder).
  4. Several advanced techniques for dealing with unusual situations, such as insufficient text data or imbalanced text data across classes (refer to slides).
  5. Model evaluation: What metrics should we consider when evaluating a designed machine learning model? (refer to slides).
  6. A brief introduction to some fancy, famous, but heavy-weight deep learning models that have the potential to achieve highly accurate text classification performance (refer to slides).

Nota bene, 1, 2, and 3 are basic operations with accompanying code and detailed comments/explanations. 4, 5, and 6 are more advanced subjects with a substantial body of literature. Please refer to the slides for details.

Acknowledgment: The Summer Institutes in Computational Social Science 2021 (https://sicss.io/)

Appreciate and welcome any types of contribution/discussion/pulling requests.

About

The tutorial on scraping, processing, and classification of text-based digital trace data in Natural Language Processing and Computational Social Science.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages