Update README and Getting Started vignette to focus on mark_*() functions

JeffreyRStevens · JeffreyRStevens · commit 12b4ad2b9622 · 2021-10-13T17:11:04.000-05:00
diff --git a/README.Rmd b/README.Rmd
@@ -41,9 +41,9 @@ devtools::install_github("jeffreyrstevens/excluder")
 ## Verbs
 This package provides three primary verbs:
 
+* `mark` functions add a new column to the original data frame that labels the rows meeting the exclusion criteria. This is useful to label the potential exclusions for future processing without changing the original data frame.
 * `check` functions search for the exclusion criteria and output a message with the number of rows meeting the criteria and a data frame of the rows meeting the criteria. This is useful for viewing the potential exclusions.
 * `exclude` functions remove rows meeting the exclusion criteria. This is safest to do after checking the rows to ensure the exclusions are correct.
-* `mark` functions add a new column to the original data frame that labels the rows meeting the exclusion criteria. This is useful to label the potential exclusions for future processing without changing the original data frame.
 
 ## Exclusion types
 This package provides seven types of exclusions based on Qualtrics metadata. If you have ideas for other metadata exclusions, please submit them as [issues](https://github.com/jeffreyrstevens/excluder/issues). Note, the intent of this package is not to develop functions for excluding rows based on survey-specific data but on general, frequently used metadata.
@@ -59,14 +59,35 @@ This package provides seven types of exclusions based on Qualtrics metadata. If
 
 ## Usage
 
-The verbs and exclusion types combine with `_` to create the functions, such as [`check_duplicates()`](https://jeffreyrstevens.github.io/excluder/reference/check_duplicates.html), [`exclude_ip()`](https://jeffreyrstevens.github.io/excluder/reference/exclude_ip.html), and [`mark_duration()`](https://jeffreyrstevens.github.io/excluder/reference/mark_duration.html). Multiple functions can be linked together using the [`{magrittr}`](https://magrittr.tidyverse.org/) pipe `%>%`. For datasets downloaded directly from Qualtrics, use [`remove_label_rows()`](https://jeffreyrstevens.github.io/excluder/reference/remove_label_rows.html) to remove the first two rows of labels and convert date and numeric columns in the metadata and use [`deidentify()`](https://jeffreyrstevens.github.io/excluder/reference/deidentify.html) to remove standard Qualtrics columns with identifiable information.
+The verbs and exclusion types combine with `_` to create the functions, such as [`check_duplicates()`](https://jeffreyrstevens.github.io/excluder/reference/check_duplicates.html), [`exclude_ip()`](https://jeffreyrstevens.github.io/excluder/reference/exclude_ip.html), and [`mark_duration()`](https://jeffreyrstevens.github.io/excluder/reference/mark_duration.html). Multiple functions can be linked together using the [`{magrittr}`](https://magrittr.tidyverse.org/) pipe `%>%`. For datasets downloaded directly from Qualtrics, use [`remove_label_rows()`](https://jeffreyrstevens.github.io/excluder/reference/remove_label_rows.html) to remove the first two rows of labels and convert date and numeric columns in the metadata, and use [`deidentify()`](https://jeffreyrstevens.github.io/excluder/reference/deidentify.html) to remove standard Qualtrics columns with identifiable information (e.g., IP addresses, geolocation).
+
+### Marking
+The `mark_*()` functions output the original data set with a new column specifying rows that meet the exclusion criteria. These can be piped together with `%>%` for multiple exclusion types.
+
+```{r mark1}
+library(excluder)
+# Mark preview and short duration rows
+df <- qualtrics_text %>%
+  mark_preview() %>%
+  mark_duration(min_duration = 200)
+tibble::glimpse(df)
+```
+
+Use the [`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html) function to unite all of the marked columns into a single column.
+```{r mark2}
+# Collapse labels for preview and short duration rows
+df <- qualtrics_text %>%
+  mark_preview() %>%
+  mark_duration(min_duration = 200) %>%
+  unite_exclusions(exclusion_types = c("preview", "duration"))
+tibble::glimpse(df)
+```
 
 ### Checking
 
 The `check_*()` functions output messages about the number of rows that meet the exclusion criteria. Because checks return only the rows meeting the criteria, they should not be connected via pipes unless you want to subset the second check criterion within the rows that meet the first criterion.
 
 ```{r check1}
-library(excluder)
 # Check for preview rows
 qualtrics_text %>%
   check_preview()
@@ -103,26 +124,6 @@ df <- qualtrics_text %>%
   exclude_location()
 ```
 
-### Marking
-The `mark_*()` functions output the original data set with a new column specifying rows that meet the exclusion criteria. These can be piped together with `%>%` for multiple exclusion types.
-
-```{r mark1}
-# Mark preview and short duration rows
-df <- qualtrics_text %>%
-  mark_preview() %>%
-  mark_duration(min_duration = 200)
-tibble::glimpse(df)
-```
-Use the [`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html) function to unite all of the marked columns into a single column.
-```{r mark2}
-# Collapse labels for preview and short duration rows
-df <- qualtrics_text %>%
-  mark_preview() %>%
-  mark_duration(min_duration = 200) %>%
-  unite_exclusions(exclusion_types = c("preview", "duration"))
-tibble::glimpse(df)
-```
-
 ## Citing this package
 
 To cite `{excluder}`, use:
diff --git a/README.md b/README.md
@@ -48,17 +48,17 @@ devtools::install_github("jeffreyrstevens/excluder")
 
 This package provides three primary verbs:
 
+-   `mark` functions add a new column to the original data frame that
+    labels the rows meeting the exclusion criteria. This is useful to
+    label the potential exclusions for future processing without
+    changing the original data frame.
 -   `check` functions search for the exclusion criteria and output a
     message with the number of rows meeting the criteria and a data
     frame of the rows meeting the criteria. This is useful for viewing
     the potential exclusions.
 -   `exclude` functions remove rows meeting the exclusion criteria. This
     is safest to do after checking the rows to ensure the exclusions are
     correct.
--   `mark` functions add a new column to the original data frame that
-    labels the rows meeting the exclusion criteria. This is useful to
-    label the potential exclusions for future processing without
-    changing the original data frame.
 
 ## Exclusion types
 
@@ -96,9 +96,81 @@ Multiple functions can be linked together using the
 downloaded directly from Qualtrics, use
 [`remove_label_rows()`](https://jeffreyrstevens.github.io/excluder/reference/remove_label_rows.html)
 to remove the first two rows of labels and convert date and numeric
-columns in the metadata and use
+columns in the metadata, and use
 [`deidentify()`](https://jeffreyrstevens.github.io/excluder/reference/deidentify.html)
-to remove standard Qualtrics columns with identifiable information.
+to remove standard Qualtrics columns with identifiable information
+(e.g., IP addresses, geolocation).
+
+### Marking
+
+The `mark_*()` functions output the original data set with a new column
+specifying rows that meet the exclusion criteria. These can be piped
+together with `%>%` for multiple exclusion types.
+
+``` r
+library(excluder)
+# Mark preview and short duration rows
+df <- qualtrics_text %>%
+  mark_preview() %>%
+  mark_duration(min_duration = 200)
+#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
+#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
+tibble::glimpse(df)
+#> Rows: 100
+#> Columns: 18
+#> $ StartDate               <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
+#> $ EndDate                 <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
+#> $ Status                  <chr> "Survey Preview", "Survey Preview", "IP Addres…
+#> $ IPAddress               <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
+#> $ Progress                <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
+#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
+#> $ Finished                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
+#> $ RecordedDate            <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
+#> $ ResponseId              <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
+#> $ LocationLatitude        <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
+#> $ LocationLongitude       <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
+#> $ UserLanguage            <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
+#> $ Browser                 <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
+#> $ Version                 <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
+#> $ `Operating System`      <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
+#> $ Resolution              <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
+#> $ exclusion_preview       <chr> "preview", "preview", "", "", "", "", "", "", …
+#> $ exclusion_duration      <chr> "", "", "", "", "duration_quick", "", "duratio…
+```
+
+Use the
+[`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html)
+function to unite all of the marked columns into a single column.
+
+``` r
+# Collapse labels for preview and short duration rows
+df <- qualtrics_text %>%
+  mark_preview() %>%
+  mark_duration(min_duration = 200) %>%
+  unite_exclusions(exclusion_types = c("preview", "duration"))
+#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
+#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
+tibble::glimpse(df)
+#> Rows: 100
+#> Columns: 17
+#> $ StartDate               <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
+#> $ EndDate                 <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
+#> $ Status                  <chr> "Survey Preview", "Survey Preview", "IP Addres…
+#> $ IPAddress               <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
+#> $ Progress                <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
+#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
+#> $ Finished                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
+#> $ RecordedDate            <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
+#> $ ResponseId              <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
+#> $ LocationLatitude        <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
+#> $ LocationLongitude       <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
+#> $ UserLanguage            <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
+#> $ Browser                 <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
+#> $ Version                 <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
+#> $ `Operating System`      <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
+#> $ Resolution              <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
+#> $ exclusions              <chr> "preview", "preview", "", "", "duration_quick"…
+```
 
 ### Checking
 
@@ -109,7 +181,6 @@ subset the second check criterion within the rows that meet the first
 criterion.
 
 ``` r
-library(excluder)
 # Check for preview rows
 qualtrics_text %>%
   check_preview()
@@ -141,8 +212,8 @@ of rows meeting the exclusion criteria.
 df <- qualtrics_text %>%
   exclude_duration(min_duration = 100) %>%
   exclude_progress()
-#> 4 out of 100 rows of short and/or long duration were excluded, leaving 96 rows.
-#> 4 out of 96 rows with incomplete progress were excluded, leaving 92 rows.
+#> 4 out of 100 duplicate rows were excluded, leaving 96 rows.
+#> 4 out of 96 duplicate rows were excluded, leaving 92 rows.
 dim(df)
 #> [1] 92 16
 ```
@@ -152,8 +223,8 @@ dim(df)
 df <- qualtrics_text %>%
   exclude_progress() %>%
   exclude_duration(min_duration = 100)
-#> 6 out of 100 rows with incomplete progress were excluded, leaving 94 rows.
-#> 2 out of 94 rows of short and/or long duration were excluded, leaving 92 rows.
+#> 6 out of 100 duplicate rows were excluded, leaving 94 rows.
+#> 2 out of 94 duplicate rows were excluded, leaving 92 rows.
 dim(df)
 #> [1] 92 16
 ```
@@ -173,85 +244,13 @@ df <- qualtrics_text %>%
   exclude_resolution() %>%
   exclude_ip() %>%
   exclude_location()
-#> 2 out of 100 preview rows were excluded, leaving 98 rows.
-#> 6 out of 98 rows with incomplete progress were excluded, leaving 92 rows.
-#> 15 out of 92 duplicate rows were excluded, leaving 83 rows.
-#> 2 out of 83 rows of short and/or long duration were excluded, leaving 81 rows.
-#> 4 out of 81 rows with unacceptable screen resolution were excluded, leaving 77 rows.
-#> 0 out of 77 rows with IP addresses outside of the specified country were excluded, leaving 77 rows.
-#> 4 out of 77 rows outside of the US were excluded, leaving 73 rows.
-```
-
-### Marking
-
-The `mark_*()` functions output the original data set with a new column
-specifying rows that meet the exclusion criteria. These can be piped
-together with `%>%` for multiple exclusion types.
-
-``` r
-# Mark preview and short duration rows
-df <- qualtrics_text %>%
-  mark_preview() %>%
-  mark_duration(min_duration = 200)
-#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
-#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
-tibble::glimpse(df)
-#> Rows: 100
-#> Columns: 18
-#> $ StartDate               <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
-#> $ EndDate                 <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
-#> $ Status                  <chr> "Survey Preview", "Survey Preview", "IP Addres…
-#> $ IPAddress               <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
-#> $ Progress                <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
-#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
-#> $ Finished                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
-#> $ RecordedDate            <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
-#> $ ResponseId              <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
-#> $ LocationLatitude        <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
-#> $ LocationLongitude       <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
-#> $ UserLanguage            <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
-#> $ Browser                 <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
-#> $ Version                 <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
-#> $ `Operating System`      <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
-#> $ Resolution              <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
-#> $ exclusion_preview       <chr> "preview", "preview", NA, NA, NA, NA, NA, NA, …
-#> $ exclusion_duration      <chr> NA, NA, NA, NA, "duration", NA, "duration", NA…
-```
-
-Use the
-[`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html)
-function to unite all of the marked columns into a single column.
-
-``` r
-# Collapse labels for preview and short duration rows
-df <- qualtrics_text %>%
-  mark_preview() %>%
-  mark_duration(min_duration = 200) %>%
-  unite_exclusions(exclusion_types = c("preview", "duration"))
-#> [1] "exclusion_preview"  "exclusion_duration"
-#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
-#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
-#> [1] "exclusion_preview"  "exclusion_duration"
-tibble::glimpse(df)
-#> Rows: 100
-#> Columns: 17
-#> $ StartDate               <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
-#> $ EndDate                 <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
-#> $ Status                  <chr> "Survey Preview", "Survey Preview", "IP Addres…
-#> $ IPAddress               <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
-#> $ Progress                <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
-#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
-#> $ Finished                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
-#> $ RecordedDate            <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
-#> $ ResponseId              <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
-#> $ LocationLatitude        <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
-#> $ LocationLongitude       <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
-#> $ UserLanguage            <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
-#> $ Browser                 <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
-#> $ Version                 <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
-#> $ `Operating System`      <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
-#> $ Resolution              <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
-#> $ exclusions              <chr> "preview", "preview", NA, NA, "duration", NA, …
+#> 2 out of 100 duplicate rows were excluded, leaving 98 rows.
+#> 6 out of 98 duplicate rows were excluded, leaving 92 rows.
+#> 9 out of 92 duplicate rows were excluded, leaving 83 rows.
+#> 2 out of 83 duplicate rows were excluded, leaving 81 rows.
+#> 4 out of 81 duplicate rows were excluded, leaving 77 rows.
+#> 2 out of 77 duplicate rows were excluded, leaving 75 rows.
+#> 4 out of 75 duplicate rows were excluded, leaving 71 rows.
 ```
 
 ## Citing this package
diff --git a/vignettes/getting_started.Rmd b/vignettes/getting_started.Rmd