At the end of this course, students should know how to:
- Access and leverage data stored in formats which are commonly used outside of statistics (HTML, JSON, XML, PDF, APIs) and transform these data to formats which are used for statistical analysis.
- Scrape data off of the internet and assemble it into a "tidy" format for visualization and analysis.
- Read in structured data from record-based formats (XML, JSON) and transform this data to a table-based format.
- Use optical character recognition and other tools to extract data from a PDF file systematically.
- Use an API to request data from an online service.
- Implement data cleaning and quality control measures to ensure that data is read in correctly.
- Develop skills for visualization and communication of complex data using interactive graphics. You will be able to
- Determine when an interactive chart is preferable to a static chart.
- Create an interactive chart using JavaScript-based tools such as Plotly, Observable.js, or Shiny.
- Integrate your interactive chart into a report or web page, along with supportive text describing the chart and important findings.
- Understand and leverage data management tools for storing and manipulating data, including
- Identifying situations where an external database is preferable to working with data in-memory.
- Accessing data in an external SQL, Parquet, or Arrow database.
- Discussing the trade offs between different tools for data management and different approaches to data storage.
- Design an analysis strategy for large data which does not fit into computer memory by selecting from strategies such as sampling and split-apply-combine.
See schedule.xlsx
- Configured in _quarto.yml
- Week by week files built automatically (code in code/gen-week-files-from-course-schedule.R, data in course-schedule.xlsx)
- Syllabus uses course-schedule.xlsx for topics, with due dates and semester dates specified in sheets in the spreadsheet.