Topic: GitHub
In this assignment you will practice the basics of working with project repos using GitHub and the GitHub desktop app. Specifically, you will fork this repo, clone it to your computer, contribute to it, push changes from your local repo (your computer) to your upstream branch (your forked copy on GitHub), and, finally, submit a pull request to merge your contribution with the master repo (i.e., the professor’s copy).
Assigned: Week 1
Due: Monday, 02/03 before 10pm
- If you have not already done so, fork this repo and clone it to your computer.
- In your local copy create a personal ‘dropbox’
- create an empty folder named “lastname_firstname” (ex.
casillas_joseph). Put it inside
misc > students
. - create another README.md file and place it inside your
personal folder. Include the following info:
- Your name
- Your email
- Your personal website if you have one
- A goal you have for this class
- create an empty folder named “lastname_firstname” (ex.
casillas_joseph). Put it inside
- Create another folder. Name it
summaries
and place it inside your personal dropbox folder. Next, read Wickham, 2015. Create a file called wickham_2015_summary.md and write a 4 sentence summary about the article. Save this file insummaries
inside your personal dropbox (the lastname_firstname folder you just made). - Read R4DS Preface - Ch. 2 (p. ix-41). Do all examples included in the text as you read (nothing to turn in).
- Read QML Ch. 1 (pp. 1-33). Do the R examples included in the text as you read (nothing to turn in).
- Commit the changes to your upstream branch, i.e., your copy of the
repo on github.com. Check your repo on GitHub to make sure it
worked, and then submit a pull-request. It should include the
following…
- your dropbox folder (lastname_firstname)
- a README.md file
- a summaries folder (inside your dropbox)
- your Wickham (2015) summary (wickham_2015_summary.md)
This is programming assignment 1 of 4. It is worth 10 of the 40 possible points. In order to receive full credit you must complete steps 1-6 above and follow all the instructions.
Task | Points |
---|---|
Create a dropbox folder | 1 |
Include a README.md | 1 |
Create a summaries folder | 1 |
Include a summary of Wickham (2015) | 2 |
Successfully submit a pull request | 5 |
Total | 10 |
This is how the file structure currently looks:
programming_assignments
│
├── README.md
└── misc
└── students
├── README.md
└── lastname_firstname
├── README.md
└── summaries
└── wickham_2015_summary.md
Take a look inside the firstname_firstname
folder if you need an
example (this is highly recommended). Your personal dropbox should look
exactly the same, but with your information, summaries, etc. In other
words, you will add a folder inside students
that looks like this (I
am using my name, you will use your name):
casillas_joseph
│
├── README.md
└── summaries
└── wickham_2015_summary.md
Remember to check the GitHub setup tutorial if you need help pushing your changes and submitting a pull request.
Topics: RMarkdown, ggplot
In this assignment you will practice the basics of using .Rmd files to
create dynamic, reproducible reports in .docx, .pdf, or .html format.
Moreover, you will show your data visualization skills using ggplot2
.
Assigned: Week 3
Due: Monday, 2/17 before 10pm
- Fetch the latest updates in
programming_assignments
to your forked branch and pull the changes to your local copy (i.e., the copy on your computer). Review the GitHub setup tutorial if you need a refresher. - Create a folder,
pa2
, inside your personal dropbox.1 - In RStudio create an RMarkdown file called
README.Rmd
and put it inside thepa2
folder. - Use the following information for the yaml front matter:
title: "Programming assignment 2"
author: "Your name"
date: "Last update: `r Sys.time()`"
output:
html_document:
highlight: kate
keep_md: yes
theme: united
- Install and load the
languageR
package from CRAN. - Familiarize yourself with three of the following five datasets:
- beginningReaders
- danish
- dativeSimplified
- english
- spanishFunctionWords
- Inside your
README.Rmd
file you will generate 3 different plots usingggplot2
. You must use 3 of the aforementioned datasets (a different dataset for each plot). All plots must include informative x-, y-axis labels and a title. The plots you must create are:- A bivariate scatterplot
- A boxplot with different fill colors
- A plot of your choice that includes a
stat_summary
and a facet.
- Commit the changes in your dropbox to your upstream branch, i.e.,
your forked copy on github. Check your repo on GitHub.com to make
sure it worked (notice anything cool when you check the
pa2
folder?), and then submit a pull-request to theds4ling/programming_assignments
main repo. It should include the following… - your
pa2
folder - your
README.Rmd
RMarkdown file (and probably a few others) - 3 plots created using
ggplot
This is programming assignment 2 of 4. It is worth 10 of the 40 possible points. In order to receive full credit you must complete all the steps above and follow all the instructions.
Task | Points |
---|---|
Create a pa2 folder | 0.5 |
Create a README.Rmd file | 0.5 |
Use correct yaml front matter | 1 |
Generate 3 specified plots | 6 |
Successfully submit a pull request | 2 |
Total | 10 |
Review Ch. 1 of R4DS for help with ggplot
. DO NOT copy the plots
directly from the book or the internet (I’ll know). Review the GitHub
setup
tutorial,
especially if you are struggling with git-specific terminology. Pay
special attention to file names, letter case, etc. in order to get the
appropriate results.
This is more or less how your dropbox should look (Note: your pa2 folder will have a little more detail than what I have described here. That is fine):
casillas_joseph
│
├── README.md
├── summaries
│ └── r4ds_ch1_summary.md
└── pa2
└── README.Rmd
1: Note: you might have noticed that your
current copy of programming_assignments
now includes the dropbox
folders of your classmates. This is on purpose. You are encouraged to
review your classmates assignments. You can learn from them and you will
notice that it is possible to solve data science problems using a
variety of different methods. That being said, you should only review
the work of your classmates after the assignment has been turned in.
Topics: Project management, Tidying data, GitHub Pages
In this assignment you will create your own RStudio project in which you get, tidy, transform and plot data from a publicly available dataset. You will host your project in a GitHub repo and create a project website.
Assigned: Week 5, 02/24
Due: Monday, 03/03 before 10pm
Choose any data set you want from the languageR
,
untidydata, or
worldlanguages packages
(it can be the same one you used last week, but if you prefer something
different get permission first). To see all the options, run the
following code in RStudio:
data(package = "languageR")
data(package = "untidydata")
data(package = "worldlangauges")
or check the documentation on the package website (note: you may need to install the package first).
- Create a new repo from GitHub.com called
pa3
and clone it to your desktop. - Create a new project for your repo using RStudio.
- Inside your new project, create an RMarkdown document called
index.Rmd
(the default output format should be html).
- Load the data set of your choice and get information about its structure (remember all code needs to be inside a knitr code chunk).
- Tidy the data set (every variable gets a column, every observation occupies a single row), if necessary.
- Calculate descriptive statistics of your choice.
- Select two continuous variables and fit a model to the data (bivariate regression).
- Generate a plot that includes a regression line.
- Write up some general observations (1-2 paragraphs max)
- Commit your changes and push them to GitHub.
- Publish your repo using GitHub Pages.
- Update your fork of the
programming_assignments
repo. Next, create a new folder inside your dropbox inprogramming_assignments
calledpa3
. Include a README.md file with a link to your published pa3 website. Submit a pull request to the masterprogramming_assignments
repo.
This is programming assignment 3 of 4. It is worth 10 of the 40 possible points. In order to receive full credit you must complete all steps in Setup, EDA, and Share detailed above, and follow all the instructions. Moreover, steps 1-5 in EDA must be completed in separate code chunks, you must comment every step in your code, and you MUST knit your project before submitting.
Task | Points |
---|---|
Tidy data | 2 |
Descriptive stats | 0.5 |
Plot data | 1 |
Fit bivariate regression | 1 |
Publish to GitHub Pages | 5 |
Successfully submit pull request | 0.5 |
Total | 10 |
- Review the RStudio Projects tutorial to refresh your memory.
- Review the recommended readings for tips on tidying your data.
- Only submit a pull request to
programming_assignments
once everything is working properly in your repo. - Always include a README in your repos.
- Make sure you look at the output after knitting. Is it clean? Make it look good (i.e., don’t type everything in bold!).
- Use slack to ask questions
Topics: Project management, tidying data, HTML Presentations
In this assignment you will create an RStudio project in which you get, tidy, transform, analyze and plot data from a publicly available dataset. You will host your project in a GitHub repo and create HTML slides to present your analysis.
Assigned: 03/24
Due: Monday, 03/31 before 10pm
- Create a new project in RStudio called
pa4
. Inside your new project, create three folders:data_raw
,data_tidy
, andslides
. - Download the dataset available
here
and store it in
data_raw
. - Generate an HTML presentation using xaringan. Save the RMarkdown
file as
index.Rmd
inside of the folderslides
. - Load the dataset from inside your
index.Rmd
file. - Tidy the data and save the tidy version as a .csv file in
data_tidy
. - Provide a table of descriptive statistics.
- Make a boxplot of center of gravity as a function of phoneme. In
another slide, plot skewness as a function of phoneme. Use a
statistical transformation (i.e., not a boxplot, but rather
stat_summary()
). - Fit a model that examines center of gravity as a function of skewness for the [s] segments (hint: you will have to transform the data). Make a table of the model summary.
- Make a scatter plot that illustrates the relationship in (8).
- Check model diagnostics (make plots).
- Write up the results (as if it were for a journal article).
- In a new slide, load the
assumptions.csv
dataset. Make a scatterplot. Explain in a few sentences why it would not be appropriate to fit a model to this data. - Host your project in a GitHub repo called
pa4
. - Turn the slides into a website using GitHub pages.
- Inside your dropbox in
programming_assignments
, create a folder calledpa4
that includes aREADME.md
file with a link to your slides. - Push changes to your forked version of
programming_assignments
and submit a pull request to the masterprogramming_assignments
repo in ds4ling.
This is programming assignment 4 of 4. It is worth 10 of the 40 possible points. In order to receive full credit you must complete steps 1-16 above and follow all the instructions.
Task | Points |
---|---|
Tidy data | 2 |
Descriptive stats | 0.5 |
Plot data | 1 |
Fit a model | 1 |
Assess and interpret model | 2 |
Publish HTML slides using GitHub Pages | 2 |
Complete question 12 | 1 |
Successfully submit pull request | 0.5 |
Total | 10 |
- Follow every instruction step-by-step.
- Worry about tidying your data, fitting your models, making plots, etc., before you worry about making the presentation (i.e., making each individual slide). It might be a good idea to use an R script first, as we have done in class, and then turn it into a presentation.
- Search for help when you get stuck, use stackoverflow and slack.
- Think of this PA as a practice exam. Use all of the skills you have developed up to this point in the class.
- Review fetching changes in GitHub (to update your programming_assignments repo) and submitting pull-requests (to ‘turn in’ your assignment).
Topics: Project management, tidying data, fitting linear models, testing hypotheses, reporting results.
In this assignment you will create an RStudio project in which you load, tidy, transform, plot, analyze and report data. You will host your project on GitHub in a personal repo and create a report to present your analysis.
Assigned: 04/24
Due: 04/28 by 12:00 pm
You will receive a ratings
data set. You already know about the data,
but I will briefly describe it anyway. The data set consists of
enjoyment
and difficulty
ratings provided by current and former
students in the ds4ling class. Each week at the beginning of class the
students provided an assessment of their perceived difficulty of the
material and their overall enjoyment of the class from the previous
week. There are a total of 11 weeks of data from 2 sections of the class
(2023, 2025). The students used a sliding rating scale that ranged from
0 to 1 (0 = no enjoyment, no difficulty; 1 = max enjoyment, max
difficulty).
Your task is to explore two of the following three research questions:
- Q1: Is there a difference in overall enjoyment between the 2023 class and the 2025 class?
- Q2: How do difficulty ratings change over time (i.e., within a semester)?
- Q3: What is the nature of the relationship between perceived difficulty and enjoyment?
The data set poses several non-trivial challenges. It is incomplete in several ways. Not every student provided ratings for each week. Some students did not use unique identifiers, thus it is not always possible to determine who a given observation comes from. You will need to take these issues into account when answering the aforementioned research questions. You must explain and justify all decisions you make.
You can complete this project independently or with one other person from class. If you work with somebody, you complete one project (one repo) and share the final grade.
- Get the latest version of
programming_assignments
, i.e., fetch the newest changes to update your local repo. - Create a new project in RStudio called
pa5
. Inside your new project, create two folders:data_raw
anddata_clean
. - Download the data set available
here
and store it in
data_raw
. - Create a new RMarkdown file called
index.Rmd
(‘index’ is not capitalized) and save it at the root level of your project (i.e., insidepa5
). The output of the Rmd file can be word, pdf or html. You can use thepapaja
package to generate an APA formatted manuscript if you’d like. You can usexaringan
to create html slides. You decide, but only pick one. Be sure to give an informative title and to include your name(s). - Load the dataset from inside your
index.Rmd
file. Pay special attention to the path. Don’t forget where the .csv file lives. - You will need to tidy the data set as necessary to run your models and plot your data. Keep in mind the principles of tidy data. You may need to format the data in different ways depending on what you are trying to achieve. Remember to use sections (#), subsections (##), text, comments, etc. to explain in prose every step.
Recall that this research question aims to assess whether there is a difference in enjoyment ratings between the 2023 class and the 2025 class. If you choose this question, you must do the following:
- Tidy the data set as necessary and provide a table of relevant descriptive statistics. You decide what is relevant based on the variables you have and the research question. Include an explanation in prose of any observations you make from the table. Be sure to print the table in a way that will show up in your knitted document (see previous examples from class).
- Create an informative plot of the data. You only get one plot, so make it count. Keep in mind the types of variables that you have, particularly those that are relevant to this specific question. Interpret the plot (in prose).
- Decide on a model you can use to answer the research question to the best of your abilities.
- Print a summary of the model and test that the model assumptions have been met (you can use plots for this, but you are not required to include them in the final version of your assignment).
- Write up the results. You should include (1) a description of the statistical analyses you have done in one paragraph, and (2) the actual interpretation of the results in another (see class slides for examples). Don’t forget to include an overall assessment of goodness of fit (variance explained).
- Important: In a separate paragraph discuss model assumptions. Did you violate any important assumptions? What decisions did you have to make in order to arrive at an answer? What are the advantages and disadvantages of the decisions you made?
This question aims to better understand if/how perceived difficulty ratings change over the course of the semester. If you choose this question, you must do the following:
- Establish an a priori hypothesis about how difficulty ratings will change over time.
- Tidy the data set as necessary and provide a table of relevant descriptive statistics. You decide what is relevant based on the variables you have and the research question. Include an explanation in prose of any observations you make from the table. Be sure to print the table in a way that will show up in your knitted document (see previous examples from class).
- Create an informative plot of the data. You only get one plot, so make it count. Keep in mind the types of variables that you have, particularly those that are relevant to your specific hypothesis. Interpret the plot (in prose).
- Decide on a model you can use to answer the research question to the best of your abilities. You should use a nested model and an inclusive model to test for the main effect of time. You will need to report the main effect using a nested model comparison and then parameter estimates from the final model.
- Print a summary of the model and test that the model assumptions have been met (you can use plots for this, but you are not required to include them in the final version of your assignment).
- Write up the results. You should include (1) a description of the statistical analyses you have done in one paragraph, and (2) the actual interpretation of the results in another (see class slides for examples). Don’t forget to include an overall assessment of goodness of fit (variance explained).
- Important: In a separate paragraph discuss model assumptions. Did you violate any important assumptions? What decisions did you have to make in order to arrive at an answer? What are the advantages and disadvantages of the decisions you made?
In Q3 you will assess the relationship between enjoyment and difficult ratings. If you choose this question, you must do the following:
- Establish an a priori hypothesis about the relationship between difficulty and enjoyment.
- Tidy the data set as necessary and provide a table of relevant descriptive statistics. You decide what is relevant based on the variables you have and the research question. Include an explanation in prose of any observations you make from the table. Be sure to print the table in a way that will show up in your knitted document (see previous examples from class).
- Create an informative plot of the data. You only get one plot, so make it count. Keep in mind the types of variables that you have, particularly those that are relevant to your specific hypothesis. Interpret the plot (in prose).
- Decide on a model you can use to answer the research question to the best of your abilities.
- Print a summary of the model and test that the model assumptions have been met (you can use plots for this, but you are not required to include them in the final version of your assignment).
- Write up the results. You should include (1) a description of the statistical analyses you have done in one paragraph, and (2) the actual interpretation of the results in another (see class slides for examples). Don’t forget to include an overall assessment of goodness of fit (variance explained).
- Important: In a separate paragraph discuss model assumptions. Did you violate any important assumptions? What decisions did you have to make in order to arrive at an answer? What are the advantages and disadvantages of the decisions you made?
- Once you have completed two of the three research questions, make sure your document will knit successfully. Read that last sentence again.
- Host your project in a private GitHub repo (call it
pa5
), and share it with me (jvcasillas). You can share a private repo from the settings tab on github.com (essentially you add me as a contributor). You do not need to make it a website. - Create a new folder in your
programming_assignments
dropbox folder. Include aREADME.md
file with a link to your repo (just the repo, not a website). - Celebrate. You survived. So far.
This is programming assignment 5. There are a total of 25 possible points. In order to receive full credit you must complete all of the steps described above following all instructions.
Task | Points |
---|---|
Create pa5 project with correct structure |
1.0 |
Create index.Rmd with title and author info |
1.0 |
Complete 1 of the 3 RQs | 10.0 |
Complete 2 of the 3 RQs | 10.0 |
Create private repo | 1.0 |
Successfully submit pull request | 2.0 |
Total | 25.0 |
The breakdown for the 10 points of the research questions is as follows:
Task | Points |
---|---|
Tidy the data as necessary | 1.0 |
Create table of relevant descriptive stats | 1.0 |
Generate an informative plot data and accurately describe it | 2.0 |
Fit model(s) | 2.0 |
Write up results for publication | 2.0 |
Discuss model assumptions | 2.0 |
Bonus points: You can earn up to two bonus points if you do something
meaningful with the qualitative data (i.e., the comments
column).
- Make use of markdown syntax. Include appropriate sections, subsections, etc.
- Comment all of your code. If you run into problems, explain what you are trying to do. If you find help on the internet, chatGPT, or in slides from class, include a link in your comments.
- You get lifelines. Talk to me if you get stuck.
PA5 answers will be available here at a later date.