Mock data review #7

DougManuel · 2025-10-19T23:36:38Z

Our current approach to generating dummy data for CHMS cycles uses manual specifications for each variable. I identified a few issues, which prompted me to use @JuanLiOHRI MockData repo as an alternative approach to generate data directly from variable.csv and variable_details.csv. This PR focuses just on the variables. After we sort variables, we'll review variable details.

I kept the R code for generating mock data. This code required refactoring with a few additions. We'll review with the view of incorporating the refactors to Juan's original repo.

What was implemented

Parsers (R/mockdata-parsers.R):
- parse_variable_start(): Extracts cycle-specific variable names from variableStart metadata
- parse_range_notation(): Handles range syntax like [7,9], [18.5,25), else
- Implements Strategy 2b: bracket notation [varname] represents DEFAULT for all cycles not explicitly overridden
Helpers (R/mockdata-helpers.R):
- get_cycle_variables(): Filters metadata by exact cycle match (avoids "cycle1" matching "cycle1_meds")
- get_raw_variables(): Returns unique raw variables with harmonisation groupings
- get_variable_details_for_raw(): Retrieves category specifications
- get_variable_categories(): Extracts valid category codes
Generators (R/mockdata-generators.R):
- create_cat_var(): Generates categorical variables with tagged NA support
- create_con_var(): Generates continuous variables with realistic distributions

Validation tools (mockdata-tools/):

validate-metadata.R: R CMD check-style validator with 5 diagnostic checks
test-all-cycles.R: Integration test across all 12 CHMS cycles
create-comparison.R: Side-by-side comparison of parsing approaches
README.qmd: Comprehensive guide with 6 executable examples

Tests (tests/testthat/test-mockdata.R):

224 tests covering parsers, helpers, and generators
Tests Strategy 2b validation (bracket as DEFAULT, cycle-prefix as OVERRIDE)
Tests exact cycle matching logic
All tests passing with system.file() paths

Key findings

Key finding: Both approaches have valuable information

The comparison revealed 235 differences across 12 cycles, with CHMS documentation verification showing both approaches contribute accurate information:

What the hard-coded data got right ✓:

Variable naming format uses official CHMS conventions: GEN_015, LFH_016 (zero-padded)
UPPERCASE in cycle6/cycle4_meds matches SAS/Haven import format
Includes lab_alt in cycle3 (confirmed in CHMS docs)
Includes lab_bpb in cycle5 (confirmed in environmental subsample)

What the metadata got right ✓:

Correctly excludes paadtot from cycles 1-2 (doesn't exist in those cycles)
Correctly excludes lab_alt from cycle5 (excluded due to specimen freezing)

What needs alignment:

Metadata: update variable names to use zero-padding (gen_15 → gen_015)
Metadata: add lab_alt to cycle3, add lab_bpb to cycle5
Hard-coded: remove paadtot from cycles 1-2, remove lab_alt from cycle5
Case sensitivity: decide on standardization approach (both UPPERCASE and lowercase are defensible)

Context: CHMS data comes from SAS (case-insensitive, stores uppercase). Haven imports preserve uppercase. Pattern reflects different workflows - both valid.

See MOCKDATA_CRITICAL_FINDINGS.md for detailed verification results from CHMS documentation review.

Coverage comparison: MockData functions vs hard-coded approach

Total: 656 (MockData) vs 639 (hard-coded) = +17 difference
235 differences identified, but 166 involve case sensitivity (same variables, different case)
69 substantive differences to investigate together

Patterns worth investigating:

Variables in different cycles:
- paadtot in cycle1/cycle2 (hard-coded) vs cycle3-6 (metadata)
- lab_alt in cycle3/cycle5 (hard-coded) vs cycle1/2/4/6 (metadata)
- lab_bpb in cycle3/4/5 (hard-coded) vs cycle1/2 (metadata)
- Question: Which reflects actual CHMS variable availability?
Variable naming format differences:
- Hard-coded: gen_015, gen_018, gen_020 (zero-padded)
- Metadata: gen_15, gen_18, gen_20 (unpadded)
- Similar pattern for lfh_016 vs lfh_16
- Question: Which format does source CHMS data use?
Variables in one approach only:
- Hard-coded only: lbf_soc, gfvd17dy, imm_03
- Metadata only: anymed2, diab_drug2, thifimp4, img_03
- Question: Should mock data be comprehensive or minimal test set?
Medication cycle coverage:
- cycle1-2_meds: fewer variables in hard-coded approach
- cycle4-6_meds: similar coverage (accounting for case)
- Question: What's the right scope for medication cycle mock data?

Metadata quality validation:

✓ All databaseStart cycles valid
✓ All non-DerivedVar variableStart entries parse successfully
✓ Format distribution identified: bracket format (245), cycle-prefixed (12), mixed (23), DerivedVar (15), plain (8)
✓ Variable details specifications complete for categorical variables
✓ 99.4% coverage (2,156/2,169 variables across 12 cycles)

Suggested investigation workflow

This comparison provides a great opportunity to systematically verify both approaches against source CHMS data:

Step 1: Document import workflow (clarifies case pattern)

Confirm cycle6/cycle4_meds used Haven import from SAS (explains uppercase)
Document if cycles 1-5 were converted to lowercase post-import
Decide on standardization approach and document in package conventions
Verify number padding format in source data (gen_015 vs gen_15)

Step 2: Verify cycle assignments (resolves paadtot, lab_alt, lab_bpb questions)

Review CHMS documentation for variable availability by cycle
Determine which approach (metadata or hard-coded) reflects actual data
Update whichever needs correction

Step 3: Review variable coverage (clarifies scope)

Discuss purpose of mock data: comprehensive vs minimal test set
Decide on derived variables inclusion (anymed2, diab_drug2, thifimp4)
Align both approaches to agreed-upon scope

Step 4: Reconcile findings (brings approaches into alignment)

Run comparison again after updates
Verify alignment between approaches
Document any intentional differences that remain

Running validation tools:

# Check metadata quality
Rscript mockdata-tools/validate-metadata.R

# Test all cycles
Rscript mockdata-tools/test-all-cycles.R

# Regenerate comparison after fixes
Rscript mockdata-tools/create-comparison.R

Built metadata-driven approach to systematically review and validate dummy data quality across all CHMS cycles. Core functions (R/mockdata-*.R): - parse_variable_start(): Handle recodeflow variableStart formats (bracket, cycle-prefixed, mixed, DerivedVar) - parse_range_notation(): Parse categorical value ranges and continuous intervals - get_cycle_variables(): Query metadata for cycle-specific variables - get_raw_variables(): Extract unique raw variable names - get_variable_details_for_raw(): Retrieve recoding specifications - get_variable_categories(): Parse category labels and values - create_cat_var(): Generate categorical variables from metadata - create_con_var(): Generate continuous variables from metadata Testing (tests/testthat/test-mockdata.R): - 224 tests covering all 8 functions - All tests passing (542 total: 224 MockData + 318 existing) - Test coverage includes edge cases, error handling, and integration Validation tools (mockdata-tools/): - validate-metadata.R: 5 validation checks for metadata quality (cycle names, parsing, format patterns, case sensitivity, completeness) - test-all-cycles.R: Integration test for all 12 cycles (achieved 99.4% coverage: 2,156/2,169 variables) - create-comparison.R: Compare different parsing approaches - README.qmd: 6 examples + common task workflows - README.md: Rendered GitHub-friendly documentation Key findings from validation: - Confirmed Strategy 2b: bracket notation [var] is DEFAULT for databases not explicitly listed (reduces verbosity) - Exact cycle matching needed: 'cycle1' should not match 'cycle1_meds' - Cycle-specific naming: Cycle 1 often uses different names than Cycles 2-6 - Case sensitivity issues in Cycle 6 data (documented for investigation) File organization: - Grouped functions by purpose (parsers, helpers, generators) - Combined tests into single comprehensive file - Self-contained validation tools with usage examples

Copilot

Pull Request Overview

Adds a metadata-driven MockData pipeline for CHMS: parsers, helpers, generators, validation utilities, and tests to compare with the existing manual approach.

Introduces parse/generation helpers and comprehensive tests for variables and ranges
Adds CLI tools to validate metadata, generate across cycles, and create comparison reports
Documents tools and comparison findings; minor formatting tweaks in medications function signatures

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/testthat/test-mockdata.R	New unit tests for parsers, helpers, and generators validating parsing, exact cycle matching, and generation behaviors
mockdata-tools/validate-metadata.R	CLI validator performing 5 checks (cycles, parsing, formats, case sensitivity, and categorical specs)
mockdata-tools/test-all-cycles.R	Integration test to generate mock data across cycles and report coverage
mockdata-tools/create-comparison.R	Script to compare manual vs metadata-driven variable coverage and emit CSV/markdown
mockdata-tools/README.qmd	Quarto-based documentation explaining tools, usage, and examples
mockdata-tools/README.md	Markdown documentation mirroring the Quarto content for GitHub display
R/mockdata-parsers.R	Parsers for variableStart and range notation supporting multiple formats
R/mockdata-helpers.R	Helpers to extract cycle/raw variables and variable details/categories
R/mockdata-generators.R	Generators for categorical and continuous mock variables
R/medications.R	Formatting change: closing parentheses moved to the next line for multiple function signatures
PR_MESSAGES.md	PR messaging and context documentation
MOCKDATA_COMPARISON.md	Generated comparison report; identifies case-sensitivity and coverage differences
MOCKDATA_COMPARISON.csv	Machine-readable comparison of current vs metadata-driven coverage

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-20T00:45:37Z

tests/testthat/test-mockdata.R

+    cycles <- trimws(cycles)
+
+    # If this variable is in cycle1_meds but NOT in cycle1, that's an error
+    if ("cycle1_meds" %in% cycles && !"cycle1" %in% cycles) {


The negation is applied to the string literal instead of the membership test, which will error in R. Use parentheses so the negation applies to the %in% result: if ("cycle1_meds" %in% cycles && !("cycle1" %in% cycles)) { ... }.

Suggested change

if ("cycle1_meds" %in% cycles && !"cycle1" %in% cycles) {

if ("cycle1_meds" %in% cycles && !("cycle1" %in% cycles)) {

Copilot · 2025-10-20T00:45:37Z

R/mockdata-generators.R

+  # Add NA values if requested
+  if (n_na > 0) {
+    if (length(na_labels) > 0 && !is.na(na_labels[1])) {
+      # Use NA codes from variable_details
+      na_values <- sample(na_labels, n_na, replace = TRUE)
+    } else {
+      # Use actual NA
+      na_values <- rep(NA, n_na)
+    }
+
+    # Combine and shuffle
+    all_values <- c(values, na_values)
+    all_values <- sample(all_values)
+  } else {
+    all_values <- values
+  }


Combining numeric continuous values with string NA codes (e.g., "NA::a") will coerce the entire vector to character, producing a non-numeric column. For continuous variables, use actual NA_real_ (or haven::tagged_na if you need tagged missing) to preserve numeric type: set na_values <- rep(NA_real_, n_na) or map any NA codes to NA_real_ before combining.

Copilot · 2025-10-20T00:45:37Z

mockdata-tools/create-comparison.R

+              "- [R/create_cat_var.R](R/create_cat_var.R)",
+              "- [R/create_con_var.R](R/create_con_var.R)",
+              "- [R/parse-range-notation.R](R/parse-range-notation.R)",
+              "- [data-raw/test-all-cycles.R](data-raw/test-all-cycles.R)",


The generated links point to files that don't exist in this repo; the implemented functions are in R/mockdata-generators.R and R/mockdata-parsers.R, and the test runner lives under mockdata-tools/test-all-cycles.R. Update these links accordingly to avoid broken references.

Suggested change

"- [R/create_cat_var.R](R/create_cat_var.R)",

"- [R/create_con_var.R](R/create_con_var.R)",

"- [R/parse-range-notation.R](R/parse-range-notation.R)",

"- [data-raw/test-all-cycles.R](data-raw/test-all-cycles.R)",

"- [R/mockdata-generators.R](R/mockdata-generators.R)",

"- [R/mockdata-parsers.R](R/mockdata-parsers.R)",

"- [mockdata-tools/test-all-cycles.R](mockdata-tools/test-all-cycles.R)",

Copilot · 2025-10-20T00:45:38Z

MOCKDATA_COMPARISON.md

+| Cycle | Current | MockData Functions | Difference | Notes |
+|-------|---------|-------------------|------------|-------|
+| cycle1 | 78 | 80 | +2 mockdata functions | |
+| cycle1_meds | 76 | 83 | +7 mockdata functions | |
+| cycle2 | 78 | 80 | +2 mockdata functions | |
+| cycle2_meds | 76 | 83 | +7 mockdata functions | |
+| cycle3 | 82 | 81 | -1 current | |
+| cycle3_meds | 3 | 3 | 0 | |
+| cycle4 | 82 | 82 | 0 | |
+| cycle4_meds | 3 | 3 | 0 | |
+| cycle5 | 78 | 78 | 0 | |
+| cycle5_meds | 3 | 3 | 0 | |
+| cycle6 | 77 | 77 | 0 | |
+| cycle6_meds | 3 | 3 | 0 | |
+| **TOTAL** | **639** | **656** | **+17** | |


Table rows contain an extra leading pipe, which breaks Markdown table rendering. Remove the extra " |" so each row begins with a single pipe, for example: "| Cycle | Current | ..." and "| cycle1 | 78 | ...".

Suggested change

| Cycle | Current | MockData Functions | Difference | Notes |

|-------|---------|-------------------|------------|-------|

| cycle1 | 78 | 80 | +2 mockdata functions | |

| cycle1_meds | 76 | 83 | +7 mockdata functions | |

| cycle2 | 78 | 80 | +2 mockdata functions | |

| cycle2_meds | 76 | 83 | +7 mockdata functions | |

| cycle3 | 82 | 81 | -1 current | |

| cycle3_meds | 3 | 3 | 0 | |

| cycle4 | 82 | 82 | 0 | |

| cycle4_meds | 3 | 3 | 0 | |

| cycle5 | 78 | 78 | 0 | |

| cycle5_meds | 3 | 3 | 0 | |

| cycle6 | 77 | 77 | 0 | |

| cycle6_meds | 3 | 3 | 0 | |

| **TOTAL** | **639** | **656** | **+17** | |

|Cycle | Current | MockData Functions | Difference | Notes |

|-------|---------|-------------------|------------|-------|

|cycle1 | 78 | 80 | +2 mockdata functions | |

|cycle1_meds | 76 | 83 | +7 mockdata functions | |

|cycle2 | 78 | 80 | +2 mockdata functions | |

|cycle2_meds | 76 | 83 | +7 mockdata functions | |

|cycle3 | 82 | 81 | -1 current | |

|cycle3_meds | 3 | 3 | 0 | |

|cycle4 | 82 | 82 | 0 | |

|cycle4_meds | 3 | 3 | 0 | |

|cycle5 | 78 | 78 | 0 | |

|cycle5_meds | 3 | 3 | 0 | |

|cycle6 | 77 | 77 | 0 | |

|cycle6_meds | 3 | 3 | 0 | |

|**TOTAL** | **639** | **656** | **+17** | |

Copilot · 2025-10-20T00:45:38Z

mockdata-tools/test-all-cycles.R

+library(dplyr)
+


[nitpick] dplyr is not used in this script; remove the unused library(dplyr) to avoid unnecessary dependencies and faster startup time.

Suggested change

library(dplyr)

Copilot · 2025-10-20T00:45:38Z

mockdata-tools/create-comparison.R

+# Date: 2025-10-17
+
+library(readr)
+library(dplyr)


[nitpick] dplyr is not used in this script; remove the unused library(dplyr) call to reduce unnecessary imports.

Suggested change

library(dplyr)

…stent according to MOCKDATA_COMPARISON (does not apply to anymed2 and diab_drug2)

rafdoodle · 2025-10-23T04:09:34Z

Hello Doug,

Using Claude, I edited the dummy data and variables.csv to address most of the differences noted in MOCKDATA_COMPARISON.csv which I then updated. anymed2 and diab_drug2 are not supposed to be in the dummy data as they themselves are duplicated from derived variables as shown in the recoding-medications.qmd. All tests for dummy data still pass.

Please let me know what you think and if you want, delete the Juan code in chmsflow that you added to create mock data.

Rafidul

…iableStart column)

…sflow into mock-data-review

rafdoodle

I've also updated variable-details.csv now. Anyways, please let me know what you think of the updated dummy data and worksheets. Feel free to remove the dummy data generation code from here back to its original repo. By the way, I cannot rename branches until the PR is closed. Thanks.

DougManuel · 2025-10-25T22:09:27Z

Looks good. Let's close and merge.

CRAN prep complete - package metadata and build configuration updated
Reproducible development - renv setup for contributors, excluded from CRAN build
Vignette solution - eval=FALSE appropriately handles time-intensive examples
Future work - medication function performance optimization is separate issue (not blocking)

DougManuel · 2025-10-26T11:52:33Z

Closed via manual merge to dev branch in commit 81f8ac5.

What was merged:

Corrected dummy data for all cycles (cycle1-6, cycle1_meds-6_meds)
Updated variables.csv and variable-details.csv
Updated prep-dummy-data.R
Vignette corrections

MockData comparison results:

Reduced discrepancies from 235 → 12
Remaining 12 differences are anymed2 and diab_drug2 (2 vars × 6 cycles)
These are correctly excluded from mock data as they are derived variable duplicates created during workflow (see recoding-medications.qmd)

MockData functions:

Validation/comparison tools remain in the MockData repo for reuse across projects
Not included in this package merge

Follow-up work:

Medication variable automation to be addressed in future issue

DougManuel added 2 commits October 19, 2025 19:28

Style code (GHA)

2f789e6

DougManuel changed the base branch from main to dev October 19, 2025 23:38

DougManuel requested review from Copilot and rafdoodle October 20, 2025 00:42

DougManuel mentioned this pull request Oct 20, 2025

missing dummy data? #5

Closed

Copilot AI reviewed Oct 20, 2025

View reviewed changes

rafdoodle and others added 2 commits October 22, 2025 23:47

Modified dummy data and variables.csv to ensure they are mostly consi…

d02d5a6

…stent according to MOCKDATA_COMPARISON (does not apply to anymed2 and diab_drug2)

Style code (GHA)

38538f2

rafdoodle added 2 commits October 24, 2025 22:47

Applied all recent variables.csv changes to variable-details.csv (var…

cee53cc

…iableStart column)

Merge branch 'mock-data-review' of http://github.com/Big-Life-Lab/chm…

2ab53cb

…sflow into mock-data-review

rafdoodle requested changes Oct 25, 2025

View reviewed changes

DougManuel closed this Oct 25, 2025

DougManuel deleted the mock-data-review branch October 25, 2025 22:09

rafdoodle restored the mock-data-review branch October 25, 2025 22:47

rafdoodle reopened this Oct 25, 2025

rafdoodle closed this Oct 25, 2025

rafdoodle deleted the mock-data-review branch October 25, 2025 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mock data review #7

Mock data review #7

Uh oh!

DougManuel commented Oct 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 20, 2025

Uh oh!

Copilot AI Oct 20, 2025

Uh oh!

Copilot AI Oct 20, 2025

Uh oh!

Copilot AI Oct 20, 2025

Uh oh!

Copilot AI Oct 20, 2025

Uh oh!

Copilot AI Oct 20, 2025

Uh oh!

rafdoodle commented Oct 23, 2025 •

edited

Loading

Uh oh!

rafdoodle left a comment

Uh oh!

DougManuel commented Oct 25, 2025

Uh oh!

DougManuel commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if ("cycle1_meds" %in% cycles && !"cycle1" %in% cycles) {
	if ("cycle1_meds" %in% cycles && !("cycle1" %in% cycles)) {

Mock data review #7

Mock data review #7

Uh oh!

Conversation

DougManuel commented Oct 19, 2025

What was implemented

Key findings

Suggested investigation workflow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

rafdoodle commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafdoodle left a comment

Choose a reason for hiding this comment

Uh oh!

DougManuel commented Oct 25, 2025

Uh oh!

DougManuel commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rafdoodle commented Oct 23, 2025 •

edited

Loading