Skip to content

sag129/datasci611

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

  1. 2020 BIOS 611 - Intro to Data Science
    1. Course Information
    2. About your Instructor
    3. Portfolio
    4. Slack
    5. Course Schedule
      1. Hyflex
      2. Course Schedule
    6. Projects
      1. Feedback
      2. Project 1
      3. Project 2
      4. Project 3
      5. Project Grading

2020 BIOS 611 - Intro to Data Science

Lecture Zoom Link: https://uncsph.zoom.us/j/96603579699?pwd=Ny9RUHFrL2lNQUUxZEJ1UTNmVFQvdz09 Lab Zoom: https://uncsph.zoom.us/j/96708005184?pwd=SFNBQXpnVHNvNnZLMktxZjJsZkVsUT09 You will need the passwords to join - this is distributed via the Slack channel. You should have gotten an email invite to the slack before the course started. Email me if you need help: toups@email.unc.edu

Course Information

The goals of this course:

  1. Familiarity with Data Science Tools like R, Python, git, Make, Docker, etc
  2. Good Data Science Practices

We will be covering the entire data science project lifetime, from data ingest, quality control, analysis and reporting. An emphasis will be placed on effectively communicating correct results (even when they are negative) and upon giving feedback to colleagues.

To do these tasks effectively we will also focus heavily on using Git, Make and Docker.

About your Instructor

I have both academic and "real world" experience as a scientist, software engineer, and data scientist.

Portfolio

By the end of the semester each student will have produced a portfolio including:

  1. A complete analysis in R demonstrating data wrangling, modeling, visualization.
  2. An interactive Shiny Dashboard
  3. A hybrid analysis using R, Python, Make and Docker (or Julia).

Slack

This year in particular it will be useful for our class to communicate online using our Slack Channel.

Course Schedule

Fall 2020 Classes are unfortunately weird because of Covid 19. Classes will be held Monday and Wed from 5:20 pm to 6:35 pm. Recitation/lab time will be Tuesdays from 3:00-4:00 pm. Labs will be a chance to work with me directly or virtually on material covered during lecture.

Hyflex

BIOS 611 will be Hyflex this semester. This means that we will be holding course in person but also broadcasting them live (presumably over Zoom). Maximum occupancy and social distancing requirements mean that only some students will attend any given class, with others tuning in over zoom and still others watching the recorded lectures at a different time.

Course Schedule

NB - this is the first time I've taught this course. We might deviate from this syllabus.

Day Time Class Type Subject Materials HW
Aug 10 (Monday) 2020 5:20-6:35 pm Lecture Intro and Demo    
Aug 11 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation Compute Resources    
Aug 12 (Wednesday) 2020 5:20-6:35 pm Lecture Compute Resources & R 1    
Aug 17 (Monday) 2020 5:20-6:35 pm Lecture Programming Languages via R (R 2) & Datasets [HW1](https://github.com/Vincent-Toups/datasci611/blob/master/homeworks/hw1.md)
Aug 18 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation Project 1 Setup    
Aug 19 (Wednesday) 2020 5:20-6:35 pm Lecture Linux and Bash …    
Aug 24 (Monday) 2020 5:20-6:35 pm Lecture Docker & Make & Party    
Aug 25 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation Setting up Our Project    
Aug 26 (Wednesday) 2020 5:20-6:35 pm Lecture Tidy Data & ggplot    
Aug 31 (Monday) 2020 5:20-6:35 pm Lecture Tidy Data & ggplot 2    
Sep 01 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Sep 02 (Wednesday) 2020 5:20-6:35 pm Lecture Git Concepts and Practices    
Sep 07 (Monday) 2020 5:20-6:35 pm ~ Labor Day    
Sep 08 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Sep 09 (Wednesday) 2020 5:20-6:35 pm Lecture Relational Data Operations    
Sep 14 (Monday) 2020 5:20-6:35 pm Lecture Agile Data Science?    
Sep 15 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Sep 16 (Wednesday) 2020 5:20-6:35 pm Lecture Classification    
Sep 21 (Monday) 2020 5:20-6:35 pm Lecture Parameter Fitting and Optim    
Sep 22 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Sep 23 (Wednesday) 2020 5:20-6:35 pm Lecture Model Val & Char    
Sep 28 (Monday) 2020 5:20-6:35 pm Lecture The Dark Art of Clustering    
Sep 29 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Sep 30 (Wednesday) 2020 5:20-6:35 pm Lecture Best in Show: Gradient Boosting Machines    
Oct 05 (Monday) 2020 5:20-6:35 pm Lecture GBMs in Practice    
Oct 06 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Oct 07 (Wednesday) 2020 5:20-6:35 pm Lecture Project Presentations    
Oct 12 (Monday) 2020 5:20-6:35 pm Lecture Shiny Introduction    
Oct 13 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Oct 14 (Wednesday) 2020 5:20-6:35 pm Lecture Docker Recap and Shiny    
Oct 19 (Monday) 2020 5:20-6:35 pm Lecture Programming Languages and Python    
Oct 20 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Oct 21 (Wednesday) 2020 5:20-6:35 pm Lecture Pandas, Dplyr, SQL 1    
Oct 26 (Monday) 2020 5:20-6:35 pm Lecture Pandas, Dplyr, SQL 2    
Oct 27 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Oct 28 (Wednesday) 2020 5:20-6:35 pm Lecture Scikit Learn    
Nov 02 (Monday) 2020 5:20-6:35 pm Lecture A Taste of Neural Networks    
Nov 03 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Nov 04 (Wednesday) 2020 5:20-6:35 pm Lecture Regular Expressions    
Nov 09 (Monday) 2020 5:20-6:35 pm Lecture Data Science Ethics    
Nov 10 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      
Nov 11 (Wednesday) 2020 5:20-6:35 pm Lecture Virtual Panel w/ Datascientists    
Nov 16 (Monday) 2020 5:20-6:35 pm Lecture Presentations    
Nov 17 (Tuesday) 2020 3:00-4:00 pm Lab/Recitation      

Projects

Grades will be based primarily on projects with the following steps:

  1. Students will submit an initial proposal “README” file describing the project
  2. Students will work individually to produce a first draft and submit it on Github
  3. Each student will review a handful of project drafts and provide thoughtful feedback
  4. Students will rate the quality of the feedback received from their peers
  5. Students will submit a final project draft
  6. Graders will review the project for high level organization and readability
  7. Students will give a short presentation about their project (only projects 1 and 3)

The grade will be based on the 1) quality of feedback provided to peers, 2) the grader’s review, and 3) the presentation.

Feedback

Students will give feedback on other student's projects which will be graded. Feedback should be succinct, relevant and actionable. It should cover:

  1. Does the project use tidyverse functions to keep code succinct, efficient and readable? Where could a tidyverse function be added to improve the code?
  2. Are the plots appropriate for the data types, the hypotheses being tested, and the points being communicated?
  3. How can the code be organized or documented more clearly?
  4. Is the purpose of the project communicated clearly?
  5. Is the source of the data made clear?
  6. Is the interpretation of figures clearly explained?
  7. Is the purpose and interpretation of analysis steps clearly communicated?
  8. Are overall take-home messages clearly communicated?

The nature of data science is that our results are often uninteresting and/or negative. This is not a problem with a project or presentation. If anything, communicating negative results is even more important, in practice, than communicating positive ones.

Project 1

A “complete” analysis in R, demonstrating data wrangling, modeling, visualization and delivery using R markdown.

Project 2

An interactive dashboard built with Shiny.

Project 3

A polyglot analysis using R, Python, Make and Docker.

Project Grading

Projects will be graded on the following:

  1. A project should be easily runnable by anyone who checks out the git repository who has Docker installed.
  2. Git commits should be small and cover single changes to the code base after the initial phase of the project.
  3. The git repository shouldn't contain non-code artifacts. All results should be buildable from code and source data alone.
  4. The code should be organized and easy to understand at a high level.
  5. For project (1) the final result should be a PDF file generated via Latex or RMarkdown that summarizes the results. For project 2 the result is a shiny application.

About

Materials for Principles of Data Science BIOS 611

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 39.0%
  • HTML 28.0%
  • Jupyter Notebook 19.2%
  • JavaScript 11.8%
  • CSS 1.3%
  • C 0.4%
  • Other 0.3%