Lecture Zoom Link: https://uncsph.zoom.us/j/96603579699?pwd=Ny9RUHFrL2lNQUUxZEJ1UTNmVFQvdz09 Lab Zoom: https://uncsph.zoom.us/j/96708005184?pwd=SFNBQXpnVHNvNnZLMktxZjJsZkVsUT09 You will need the passwords to join - this is distributed via the Slack channel. You should have gotten an email invite to the slack before the course started. Email me if you need help: toups@email.unc.edu
The goals of this course:
- Familiarity with Data Science Tools like R, Python, git, Make, Docker, etc
- Good Data Science Practices
We will be covering the entire data science project lifetime, from data ingest, quality control, analysis and reporting. An emphasis will be placed on effectively communicating correct results (even when they are negative) and upon giving feedback to colleagues.
To do these tasks effectively we will also focus heavily on using Git, Make and Docker.
I have both academic and "real world" experience as a scientist, software engineer, and data scientist.
By the end of the semester each student will have produced a portfolio including:
- A complete analysis in R demonstrating data wrangling, modeling, visualization.
- An interactive Shiny Dashboard
- A hybrid analysis using R, Python, Make and Docker (or Julia).
This year in particular it will be useful for our class to communicate online using our Slack Channel.
Fall 2020 Classes are unfortunately weird because of Covid 19. Classes will be held Monday and Wed from 5:20 pm to 6:35 pm. Recitation/lab time will be Tuesdays from 3:00-4:00 pm. Labs will be a chance to work with me directly or virtually on material covered during lecture.
BIOS 611 will be Hyflex this semester. This means that we will be holding course in person but also broadcasting them live (presumably over Zoom). Maximum occupancy and social distancing requirements mean that only some students will attend any given class, with others tuning in over zoom and still others watching the recorded lectures at a different time.
NB - this is the first time I've taught this course. We might deviate from this syllabus.
Day | Time | Class Type | Subject | Materials | HW |
---|---|---|---|---|---|
Aug 10 (Monday) 2020 | 5:20-6:35 pm | Lecture | Intro and Demo | ||
Aug 11 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | Compute Resources | ||
Aug 12 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Compute Resources & R 1 | ||
Aug 17 (Monday) 2020 | 5:20-6:35 pm | Lecture | Programming Languages via R (R 2) & Datasets | [HW1](https://github.com/Vincent-Toups/datasci611/blob/master/homeworks/hw1.md) | |
Aug 18 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | Project 1 Setup | ||
Aug 19 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Linux and Bash … | ||
Aug 24 (Monday) 2020 | 5:20-6:35 pm | Lecture | Docker & Make & Party | ||
Aug 25 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | Setting up Our Project | ||
Aug 26 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Tidy Data & ggplot | ||
Aug 31 (Monday) 2020 | 5:20-6:35 pm | Lecture | Tidy Data & ggplot 2 | ||
Sep 01 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Sep 02 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Git Concepts and Practices | ||
Sep 07 (Monday) 2020 | 5:20-6:35 pm | ~ | Labor Day | ||
Sep 08 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Sep 09 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Relational Data Operations | ||
Sep 14 (Monday) 2020 | 5:20-6:35 pm | Lecture | Agile Data Science? | ||
Sep 15 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Sep 16 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Classification | ||
Sep 21 (Monday) 2020 | 5:20-6:35 pm | Lecture | Parameter Fitting and Optim | ||
Sep 22 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Sep 23 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Model Val & Char | ||
Sep 28 (Monday) 2020 | 5:20-6:35 pm | Lecture | The Dark Art of Clustering | ||
Sep 29 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Sep 30 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Best in Show: Gradient Boosting Machines | ||
Oct 05 (Monday) 2020 | 5:20-6:35 pm | Lecture | GBMs in Practice | ||
Oct 06 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Oct 07 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Project Presentations | ||
Oct 12 (Monday) 2020 | 5:20-6:35 pm | Lecture | Shiny Introduction | ||
Oct 13 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Oct 14 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Docker Recap and Shiny | ||
Oct 19 (Monday) 2020 | 5:20-6:35 pm | Lecture | Programming Languages and Python | ||
Oct 20 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Oct 21 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Pandas, Dplyr, SQL 1 | ||
Oct 26 (Monday) 2020 | 5:20-6:35 pm | Lecture | Pandas, Dplyr, SQL 2 | ||
Oct 27 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Oct 28 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Scikit Learn | ||
Nov 02 (Monday) 2020 | 5:20-6:35 pm | Lecture | A Taste of Neural Networks | ||
Nov 03 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Nov 04 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Regular Expressions | ||
Nov 09 (Monday) 2020 | 5:20-6:35 pm | Lecture | Data Science Ethics | ||
Nov 10 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation | |||
Nov 11 (Wednesday) 2020 | 5:20-6:35 pm | Lecture | Virtual Panel w/ Datascientists | ||
Nov 16 (Monday) 2020 | 5:20-6:35 pm | Lecture | Presentations | ||
Nov 17 (Tuesday) 2020 | 3:00-4:00 pm | Lab/Recitation |
Grades will be based primarily on projects with the following steps:
- Students will submit an initial proposal “README” file describing the project
- Students will work individually to produce a first draft and submit it on Github
- Each student will review a handful of project drafts and provide thoughtful feedback
- Students will rate the quality of the feedback received from their peers
- Students will submit a final project draft
- Graders will review the project for high level organization and readability
- Students will give a short presentation about their project (only projects 1 and 3)
The grade will be based on the 1) quality of feedback provided to peers, 2) the grader’s review, and 3) the presentation.
Students will give feedback on other student's projects which will be graded. Feedback should be succinct, relevant and actionable. It should cover:
- Does the project use tidyverse functions to keep code succinct, efficient and readable? Where could a tidyverse function be added to improve the code?
- Are the plots appropriate for the data types, the hypotheses being tested, and the points being communicated?
- How can the code be organized or documented more clearly?
- Is the purpose of the project communicated clearly?
- Is the source of the data made clear?
- Is the interpretation of figures clearly explained?
- Is the purpose and interpretation of analysis steps clearly communicated?
- Are overall take-home messages clearly communicated?
The nature of data science is that our results are often uninteresting and/or negative. This is not a problem with a project or presentation. If anything, communicating negative results is even more important, in practice, than communicating positive ones.
A “complete” analysis in R, demonstrating data wrangling, modeling, visualization and delivery using R markdown.
An interactive dashboard built with Shiny.
A polyglot analysis using R, Python, Make and Docker.
Projects will be graded on the following:
- A project should be easily runnable by anyone who checks out the git repository who has Docker installed.
- Git commits should be small and cover single changes to the code base after the initial phase of the project.
- The git repository shouldn't contain non-code artifacts. All results should be buildable from code and source data alone.
- The code should be organized and easy to understand at a high level.
- For project (1) the final result should be a PDF file generated via Latex or RMarkdown that summarizes the results. For project 2 the result is a shiny application.