Skip to content

Commit 12b4ad2

Browse files
Update README and Getting Started vignette to focus on mark_*() functions
1 parent 2d9f253 commit 12b4ad2

File tree

3 files changed

+199
-158
lines changed

3 files changed

+199
-158
lines changed

README.Rmd

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,9 @@ devtools::install_github("jeffreyrstevens/excluder")
4141
## Verbs
4242
This package provides three primary verbs:
4343

44+
* `mark` functions add a new column to the original data frame that labels the rows meeting the exclusion criteria. This is useful to label the potential exclusions for future processing without changing the original data frame.
4445
* `check` functions search for the exclusion criteria and output a message with the number of rows meeting the criteria and a data frame of the rows meeting the criteria. This is useful for viewing the potential exclusions.
4546
* `exclude` functions remove rows meeting the exclusion criteria. This is safest to do after checking the rows to ensure the exclusions are correct.
46-
* `mark` functions add a new column to the original data frame that labels the rows meeting the exclusion criteria. This is useful to label the potential exclusions for future processing without changing the original data frame.
4747

4848
## Exclusion types
4949
This package provides seven types of exclusions based on Qualtrics metadata. If you have ideas for other metadata exclusions, please submit them as [issues](https://github.com/jeffreyrstevens/excluder/issues). Note, the intent of this package is not to develop functions for excluding rows based on survey-specific data but on general, frequently used metadata.
@@ -59,14 +59,35 @@ This package provides seven types of exclusions based on Qualtrics metadata. If
5959

6060
## Usage
6161

62-
The verbs and exclusion types combine with `_` to create the functions, such as [`check_duplicates()`](https://jeffreyrstevens.github.io/excluder/reference/check_duplicates.html), [`exclude_ip()`](https://jeffreyrstevens.github.io/excluder/reference/exclude_ip.html), and [`mark_duration()`](https://jeffreyrstevens.github.io/excluder/reference/mark_duration.html). Multiple functions can be linked together using the [`{magrittr}`](https://magrittr.tidyverse.org/) pipe `%>%`. For datasets downloaded directly from Qualtrics, use [`remove_label_rows()`](https://jeffreyrstevens.github.io/excluder/reference/remove_label_rows.html) to remove the first two rows of labels and convert date and numeric columns in the metadata and use [`deidentify()`](https://jeffreyrstevens.github.io/excluder/reference/deidentify.html) to remove standard Qualtrics columns with identifiable information.
62+
The verbs and exclusion types combine with `_` to create the functions, such as [`check_duplicates()`](https://jeffreyrstevens.github.io/excluder/reference/check_duplicates.html), [`exclude_ip()`](https://jeffreyrstevens.github.io/excluder/reference/exclude_ip.html), and [`mark_duration()`](https://jeffreyrstevens.github.io/excluder/reference/mark_duration.html). Multiple functions can be linked together using the [`{magrittr}`](https://magrittr.tidyverse.org/) pipe `%>%`. For datasets downloaded directly from Qualtrics, use [`remove_label_rows()`](https://jeffreyrstevens.github.io/excluder/reference/remove_label_rows.html) to remove the first two rows of labels and convert date and numeric columns in the metadata, and use [`deidentify()`](https://jeffreyrstevens.github.io/excluder/reference/deidentify.html) to remove standard Qualtrics columns with identifiable information (e.g., IP addresses, geolocation).
63+
64+
### Marking
65+
The `mark_*()` functions output the original data set with a new column specifying rows that meet the exclusion criteria. These can be piped together with `%>%` for multiple exclusion types.
66+
67+
```{r mark1}
68+
library(excluder)
69+
# Mark preview and short duration rows
70+
df <- qualtrics_text %>%
71+
mark_preview() %>%
72+
mark_duration(min_duration = 200)
73+
tibble::glimpse(df)
74+
```
75+
76+
Use the [`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html) function to unite all of the marked columns into a single column.
77+
```{r mark2}
78+
# Collapse labels for preview and short duration rows
79+
df <- qualtrics_text %>%
80+
mark_preview() %>%
81+
mark_duration(min_duration = 200) %>%
82+
unite_exclusions(exclusion_types = c("preview", "duration"))
83+
tibble::glimpse(df)
84+
```
6385

6486
### Checking
6587

6688
The `check_*()` functions output messages about the number of rows that meet the exclusion criteria. Because checks return only the rows meeting the criteria, they should not be connected via pipes unless you want to subset the second check criterion within the rows that meet the first criterion.
6789

6890
```{r check1}
69-
library(excluder)
7091
# Check for preview rows
7192
qualtrics_text %>%
7293
check_preview()
@@ -103,26 +124,6 @@ df <- qualtrics_text %>%
103124
exclude_location()
104125
```
105126

106-
### Marking
107-
The `mark_*()` functions output the original data set with a new column specifying rows that meet the exclusion criteria. These can be piped together with `%>%` for multiple exclusion types.
108-
109-
```{r mark1}
110-
# Mark preview and short duration rows
111-
df <- qualtrics_text %>%
112-
mark_preview() %>%
113-
mark_duration(min_duration = 200)
114-
tibble::glimpse(df)
115-
```
116-
Use the [`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html) function to unite all of the marked columns into a single column.
117-
```{r mark2}
118-
# Collapse labels for preview and short duration rows
119-
df <- qualtrics_text %>%
120-
mark_preview() %>%
121-
mark_duration(min_duration = 200) %>%
122-
unite_exclusions(exclusion_types = c("preview", "duration"))
123-
tibble::glimpse(df)
124-
```
125-
126127
## Citing this package
127128

128129
To cite `{excluder}`, use:

README.md

Lines changed: 89 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -48,17 +48,17 @@ devtools::install_github("jeffreyrstevens/excluder")
4848

4949
This package provides three primary verbs:
5050

51+
- `mark` functions add a new column to the original data frame that
52+
labels the rows meeting the exclusion criteria. This is useful to
53+
label the potential exclusions for future processing without
54+
changing the original data frame.
5155
- `check` functions search for the exclusion criteria and output a
5256
message with the number of rows meeting the criteria and a data
5357
frame of the rows meeting the criteria. This is useful for viewing
5458
the potential exclusions.
5559
- `exclude` functions remove rows meeting the exclusion criteria. This
5660
is safest to do after checking the rows to ensure the exclusions are
5761
correct.
58-
- `mark` functions add a new column to the original data frame that
59-
labels the rows meeting the exclusion criteria. This is useful to
60-
label the potential exclusions for future processing without
61-
changing the original data frame.
6262

6363
## Exclusion types
6464

@@ -96,9 +96,81 @@ Multiple functions can be linked together using the
9696
downloaded directly from Qualtrics, use
9797
[`remove_label_rows()`](https://jeffreyrstevens.github.io/excluder/reference/remove_label_rows.html)
9898
to remove the first two rows of labels and convert date and numeric
99-
columns in the metadata and use
99+
columns in the metadata, and use
100100
[`deidentify()`](https://jeffreyrstevens.github.io/excluder/reference/deidentify.html)
101-
to remove standard Qualtrics columns with identifiable information.
101+
to remove standard Qualtrics columns with identifiable information
102+
(e.g., IP addresses, geolocation).
103+
104+
### Marking
105+
106+
The `mark_*()` functions output the original data set with a new column
107+
specifying rows that meet the exclusion criteria. These can be piped
108+
together with `%>%` for multiple exclusion types.
109+
110+
``` r
111+
library(excluder)
112+
# Mark preview and short duration rows
113+
df <- qualtrics_text %>%
114+
mark_preview() %>%
115+
mark_duration(min_duration = 200)
116+
#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
117+
#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
118+
tibble::glimpse(df)
119+
#> Rows: 100
120+
#> Columns: 18
121+
#> $ StartDate <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
122+
#> $ EndDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
123+
#> $ Status <chr> "Survey Preview", "Survey Preview", "IP Addres…
124+
#> $ IPAddress <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
125+
#> $ Progress <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
126+
#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
127+
#> $ Finished <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
128+
#> $ RecordedDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
129+
#> $ ResponseId <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
130+
#> $ LocationLatitude <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
131+
#> $ LocationLongitude <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
132+
#> $ UserLanguage <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
133+
#> $ Browser <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
134+
#> $ Version <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
135+
#> $ `Operating System` <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
136+
#> $ Resolution <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
137+
#> $ exclusion_preview <chr> "preview", "preview", "", "", "", "", "", "", …
138+
#> $ exclusion_duration <chr> "", "", "", "", "duration_quick", "", "duratio…
139+
```
140+
141+
Use the
142+
[`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html)
143+
function to unite all of the marked columns into a single column.
144+
145+
``` r
146+
# Collapse labels for preview and short duration rows
147+
df <- qualtrics_text %>%
148+
mark_preview() %>%
149+
mark_duration(min_duration = 200) %>%
150+
unite_exclusions(exclusion_types = c("preview", "duration"))
151+
#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
152+
#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
153+
tibble::glimpse(df)
154+
#> Rows: 100
155+
#> Columns: 17
156+
#> $ StartDate <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
157+
#> $ EndDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
158+
#> $ Status <chr> "Survey Preview", "Survey Preview", "IP Addres…
159+
#> $ IPAddress <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
160+
#> $ Progress <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
161+
#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
162+
#> $ Finished <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
163+
#> $ RecordedDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
164+
#> $ ResponseId <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
165+
#> $ LocationLatitude <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
166+
#> $ LocationLongitude <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
167+
#> $ UserLanguage <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
168+
#> $ Browser <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
169+
#> $ Version <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
170+
#> $ `Operating System` <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
171+
#> $ Resolution <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
172+
#> $ exclusions <chr> "preview", "preview", "", "", "duration_quick"…
173+
```
102174

103175
### Checking
104176

@@ -109,7 +181,6 @@ subset the second check criterion within the rows that meet the first
109181
criterion.
110182

111183
``` r
112-
library(excluder)
113184
# Check for preview rows
114185
qualtrics_text %>%
115186
check_preview()
@@ -141,8 +212,8 @@ of rows meeting the exclusion criteria.
141212
df <- qualtrics_text %>%
142213
exclude_duration(min_duration = 100) %>%
143214
exclude_progress()
144-
#> 4 out of 100 rows of short and/or long duration were excluded, leaving 96 rows.
145-
#> 4 out of 96 rows with incomplete progress were excluded, leaving 92 rows.
215+
#> 4 out of 100 duplicate rows were excluded, leaving 96 rows.
216+
#> 4 out of 96 duplicate rows were excluded, leaving 92 rows.
146217
dim(df)
147218
#> [1] 92 16
148219
```
@@ -152,8 +223,8 @@ dim(df)
152223
df <- qualtrics_text %>%
153224
exclude_progress() %>%
154225
exclude_duration(min_duration = 100)
155-
#> 6 out of 100 rows with incomplete progress were excluded, leaving 94 rows.
156-
#> 2 out of 94 rows of short and/or long duration were excluded, leaving 92 rows.
226+
#> 6 out of 100 duplicate rows were excluded, leaving 94 rows.
227+
#> 2 out of 94 duplicate rows were excluded, leaving 92 rows.
157228
dim(df)
158229
#> [1] 92 16
159230
```
@@ -173,85 +244,13 @@ df <- qualtrics_text %>%
173244
exclude_resolution() %>%
174245
exclude_ip() %>%
175246
exclude_location()
176-
#> 2 out of 100 preview rows were excluded, leaving 98 rows.
177-
#> 6 out of 98 rows with incomplete progress were excluded, leaving 92 rows.
178-
#> 15 out of 92 duplicate rows were excluded, leaving 83 rows.
179-
#> 2 out of 83 rows of short and/or long duration were excluded, leaving 81 rows.
180-
#> 4 out of 81 rows with unacceptable screen resolution were excluded, leaving 77 rows.
181-
#> 0 out of 77 rows with IP addresses outside of the specified country were excluded, leaving 77 rows.
182-
#> 4 out of 77 rows outside of the US were excluded, leaving 73 rows.
183-
```
184-
185-
### Marking
186-
187-
The `mark_*()` functions output the original data set with a new column
188-
specifying rows that meet the exclusion criteria. These can be piped
189-
together with `%>%` for multiple exclusion types.
190-
191-
``` r
192-
# Mark preview and short duration rows
193-
df <- qualtrics_text %>%
194-
mark_preview() %>%
195-
mark_duration(min_duration = 200)
196-
#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
197-
#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
198-
tibble::glimpse(df)
199-
#> Rows: 100
200-
#> Columns: 18
201-
#> $ StartDate <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
202-
#> $ EndDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
203-
#> $ Status <chr> "Survey Preview", "Survey Preview", "IP Addres…
204-
#> $ IPAddress <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
205-
#> $ Progress <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
206-
#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
207-
#> $ Finished <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
208-
#> $ RecordedDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
209-
#> $ ResponseId <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
210-
#> $ LocationLatitude <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
211-
#> $ LocationLongitude <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
212-
#> $ UserLanguage <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
213-
#> $ Browser <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
214-
#> $ Version <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
215-
#> $ `Operating System` <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
216-
#> $ Resolution <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
217-
#> $ exclusion_preview <chr> "preview", "preview", NA, NA, NA, NA, NA, NA, …
218-
#> $ exclusion_duration <chr> NA, NA, NA, NA, "duration", NA, "duration", NA…
219-
```
220-
221-
Use the
222-
[`unite_exclusions()`](https://jeffreyrstevens.github.io/excluder/reference/unite_exclusions.html)
223-
function to unite all of the marked columns into a single column.
224-
225-
``` r
226-
# Collapse labels for preview and short duration rows
227-
df <- qualtrics_text %>%
228-
mark_preview() %>%
229-
mark_duration(min_duration = 200) %>%
230-
unite_exclusions(exclusion_types = c("preview", "duration"))
231-
#> [1] "exclusion_preview" "exclusion_duration"
232-
#> 2 out of 100 rows were collected as previews. It is highly recommended to exclude these rows before further checking.
233-
#> 23 out of 100 rows took less time than the minimum duration of 200 seconds.
234-
#> [1] "exclusion_preview" "exclusion_duration"
235-
tibble::glimpse(df)
236-
#> Rows: 100
237-
#> Columns: 17
238-
#> $ StartDate <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
239-
#> $ EndDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
240-
#> $ Status <chr> "Survey Preview", "Survey Preview", "IP Addres…
241-
#> $ IPAddress <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
242-
#> $ Progress <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
243-
#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
244-
#> $ Finished <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
245-
#> $ RecordedDate <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
246-
#> $ ResponseId <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
247-
#> $ LocationLatitude <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
248-
#> $ LocationLongitude <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
249-
#> $ UserLanguage <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
250-
#> $ Browser <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
251-
#> $ Version <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
252-
#> $ `Operating System` <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
253-
#> $ Resolution <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
254-
#> $ exclusions <chr> "preview", "preview", NA, NA, "duration", NA, …
247+
#> 2 out of 100 duplicate rows were excluded, leaving 98 rows.
248+
#> 6 out of 98 duplicate rows were excluded, leaving 92 rows.
249+
#> 9 out of 92 duplicate rows were excluded, leaving 83 rows.
250+
#> 2 out of 83 duplicate rows were excluded, leaving 81 rows.
251+
#> 4 out of 81 duplicate rows were excluded, leaving 77 rows.
252+
#> 2 out of 77 duplicate rows were excluded, leaving 75 rows.
253+
#> 4 out of 75 duplicate rows were excluded, leaving 71 rows.
255254
```
256255

257256
## Citing this package

0 commit comments

Comments
 (0)