Skip to content

Conversation

@DougManuel
Copy link
Contributor

Our current approach to generating dummy data for CHMS cycles uses manual specifications for each variable. I identified a few issues, which prompted me to use @JuanLiOHRI MockData repo as an alternative approach to generate data directly from variable.csv and variable_details.csv. This PR focuses just on the variables. After we sort variables, we'll review variable details.

I kept the R code for generating mock data. This code required refactoring with a few additions. We'll review with the view of incorporating the refactors to Juan's original repo.

What was implemented

  1. Parsers (R/mockdata-parsers.R):

    • parse_variable_start(): Extracts cycle-specific variable names from variableStart metadata
    • parse_range_notation(): Handles range syntax like [7,9], [18.5,25), else
    • Implements Strategy 2b: bracket notation [varname] represents DEFAULT for all cycles not explicitly overridden
  2. Helpers (R/mockdata-helpers.R):

    • get_cycle_variables(): Filters metadata by exact cycle match (avoids "cycle1" matching "cycle1_meds")
    • get_raw_variables(): Returns unique raw variables with harmonisation groupings
    • get_variable_details_for_raw(): Retrieves category specifications
    • get_variable_categories(): Extracts valid category codes
  3. Generators (R/mockdata-generators.R):

    • create_cat_var(): Generates categorical variables with tagged NA support
    • create_con_var(): Generates continuous variables with realistic distributions

Validation tools (mockdata-tools/):

  • validate-metadata.R: R CMD check-style validator with 5 diagnostic checks
  • test-all-cycles.R: Integration test across all 12 CHMS cycles
  • create-comparison.R: Side-by-side comparison of parsing approaches
  • README.qmd: Comprehensive guide with 6 executable examples

Tests (tests/testthat/test-mockdata.R):

  • 224 tests covering parsers, helpers, and generators
  • Tests Strategy 2b validation (bracket as DEFAULT, cycle-prefix as OVERRIDE)
  • Tests exact cycle matching logic
  • All tests passing with system.file() paths

Key findings

Key finding: Both approaches have valuable information

The comparison revealed 235 differences across 12 cycles, with CHMS documentation verification showing both approaches contribute accurate information:

What the hard-coded data got right ✓:

  • Variable naming format uses official CHMS conventions: GEN_015, LFH_016 (zero-padded)
  • UPPERCASE in cycle6/cycle4_meds matches SAS/Haven import format
  • Includes lab_alt in cycle3 (confirmed in CHMS docs)
  • Includes lab_bpb in cycle5 (confirmed in environmental subsample)

What the metadata got right ✓:

  • Correctly excludes paadtot from cycles 1-2 (doesn't exist in those cycles)
  • Correctly excludes lab_alt from cycle5 (excluded due to specimen freezing)

What needs alignment:

  • Metadata: update variable names to use zero-padding (gen_15gen_015)
  • Metadata: add lab_alt to cycle3, add lab_bpb to cycle5
  • Hard-coded: remove paadtot from cycles 1-2, remove lab_alt from cycle5
  • Case sensitivity: decide on standardization approach (both UPPERCASE and lowercase are defensible)

Context: CHMS data comes from SAS (case-insensitive, stores uppercase). Haven imports preserve uppercase. Pattern reflects different workflows - both valid.

See MOCKDATA_CRITICAL_FINDINGS.md for detailed verification results from CHMS documentation review.


Coverage comparison: MockData functions vs hard-coded approach

  • Total: 656 (MockData) vs 639 (hard-coded) = +17 difference
  • 235 differences identified, but 166 involve case sensitivity (same variables, different case)
  • 69 substantive differences to investigate together

Patterns worth investigating:

  1. Variables in different cycles:

    • paadtot in cycle1/cycle2 (hard-coded) vs cycle3-6 (metadata)
    • lab_alt in cycle3/cycle5 (hard-coded) vs cycle1/2/4/6 (metadata)
    • lab_bpb in cycle3/4/5 (hard-coded) vs cycle1/2 (metadata)
    • Question: Which reflects actual CHMS variable availability?
  2. Variable naming format differences:

    • Hard-coded: gen_015, gen_018, gen_020 (zero-padded)
    • Metadata: gen_15, gen_18, gen_20 (unpadded)
    • Similar pattern for lfh_016 vs lfh_16
    • Question: Which format does source CHMS data use?
  3. Variables in one approach only:

    • Hard-coded only: lbf_soc, gfvd17dy, imm_03
    • Metadata only: anymed2, diab_drug2, thifimp4, img_03
    • Question: Should mock data be comprehensive or minimal test set?
  4. Medication cycle coverage:

    • cycle1-2_meds: fewer variables in hard-coded approach
    • cycle4-6_meds: similar coverage (accounting for case)
    • Question: What's the right scope for medication cycle mock data?

Metadata quality validation:

  • ✓ All databaseStart cycles valid
  • ✓ All non-DerivedVar variableStart entries parse successfully
  • ✓ Format distribution identified: bracket format (245), cycle-prefixed (12), mixed (23), DerivedVar (15), plain (8)
  • ✓ Variable details specifications complete for categorical variables
  • ✓ 99.4% coverage (2,156/2,169 variables across 12 cycles)

Suggested investigation workflow

This comparison provides a great opportunity to systematically verify both approaches against source CHMS data:

Step 1: Document import workflow (clarifies case pattern)

  • Confirm cycle6/cycle4_meds used Haven import from SAS (explains uppercase)
  • Document if cycles 1-5 were converted to lowercase post-import
  • Decide on standardization approach and document in package conventions
  • Verify number padding format in source data (gen_015 vs gen_15)

Step 2: Verify cycle assignments (resolves paadtot, lab_alt, lab_bpb questions)

  • Review CHMS documentation for variable availability by cycle
  • Determine which approach (metadata or hard-coded) reflects actual data
  • Update whichever needs correction

Step 3: Review variable coverage (clarifies scope)

  • Discuss purpose of mock data: comprehensive vs minimal test set
  • Decide on derived variables inclusion (anymed2, diab_drug2, thifimp4)
  • Align both approaches to agreed-upon scope

Step 4: Reconcile findings (brings approaches into alignment)

  • Run comparison again after updates
  • Verify alignment between approaches
  • Document any intentional differences that remain

Running validation tools:

# Check metadata quality
Rscript mockdata-tools/validate-metadata.R

# Test all cycles
Rscript mockdata-tools/test-all-cycles.R

# Regenerate comparison after fixes
Rscript mockdata-tools/create-comparison.R

Built metadata-driven approach to systematically review and validate
dummy data quality across all CHMS cycles.

Core functions (R/mockdata-*.R):
- parse_variable_start(): Handle recodeflow variableStart formats
  (bracket, cycle-prefixed, mixed, DerivedVar)
- parse_range_notation(): Parse categorical value ranges and
  continuous intervals
- get_cycle_variables(): Query metadata for cycle-specific variables
- get_raw_variables(): Extract unique raw variable names
- get_variable_details_for_raw(): Retrieve recoding specifications
- get_variable_categories(): Parse category labels and values
- create_cat_var(): Generate categorical variables from metadata
- create_con_var(): Generate continuous variables from metadata

Testing (tests/testthat/test-mockdata.R):
- 224 tests covering all 8 functions
- All tests passing (542 total: 224 MockData + 318 existing)
- Test coverage includes edge cases, error handling, and integration

Validation tools (mockdata-tools/):
- validate-metadata.R: 5 validation checks for metadata quality
  (cycle names, parsing, format patterns, case sensitivity, completeness)
- test-all-cycles.R: Integration test for all 12 cycles
  (achieved 99.4% coverage: 2,156/2,169 variables)
- create-comparison.R: Compare different parsing approaches
- README.qmd: 6 examples + common task workflows
- README.md: Rendered GitHub-friendly documentation

Key findings from validation:
- Confirmed Strategy 2b: bracket notation [var] is DEFAULT for databases
  not explicitly listed (reduces verbosity)
- Exact cycle matching needed: 'cycle1' should not match 'cycle1_meds'
- Cycle-specific naming: Cycle 1 often uses different names than Cycles 2-6
- Case sensitivity issues in Cycle 6 data (documented for investigation)

File organization:
- Grouped functions by purpose (parsers, helpers, generators)
- Combined tests into single comprehensive file
- Self-contained validation tools with usage examples
@DougManuel DougManuel changed the base branch from main to dev October 19, 2025 23:38
@DougManuel DougManuel mentioned this pull request Oct 20, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a metadata-driven MockData pipeline for CHMS: parsers, helpers, generators, validation utilities, and tests to compare with the existing manual approach.

  • Introduces parse/generation helpers and comprehensive tests for variables and ranges
  • Adds CLI tools to validate metadata, generate across cycles, and create comparison reports
  • Documents tools and comparison findings; minor formatting tweaks in medications function signatures

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/testthat/test-mockdata.R New unit tests for parsers, helpers, and generators validating parsing, exact cycle matching, and generation behaviors
mockdata-tools/validate-metadata.R CLI validator performing 5 checks (cycles, parsing, formats, case sensitivity, and categorical specs)
mockdata-tools/test-all-cycles.R Integration test to generate mock data across cycles and report coverage
mockdata-tools/create-comparison.R Script to compare manual vs metadata-driven variable coverage and emit CSV/markdown
mockdata-tools/README.qmd Quarto-based documentation explaining tools, usage, and examples
mockdata-tools/README.md Markdown documentation mirroring the Quarto content for GitHub display
R/mockdata-parsers.R Parsers for variableStart and range notation supporting multiple formats
R/mockdata-helpers.R Helpers to extract cycle/raw variables and variable details/categories
R/mockdata-generators.R Generators for categorical and continuous mock variables
R/medications.R Formatting change: closing parentheses moved to the next line for multiple function signatures
PR_MESSAGES.md PR messaging and context documentation
MOCKDATA_COMPARISON.md Generated comparison report; identifies case-sensitivity and coverage differences
MOCKDATA_COMPARISON.csv Machine-readable comparison of current vs metadata-driven coverage

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

cycles <- trimws(cycles)

# If this variable is in cycle1_meds but NOT in cycle1, that's an error
if ("cycle1_meds" %in% cycles && !"cycle1" %in% cycles) {
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negation is applied to the string literal instead of the membership test, which will error in R. Use parentheses so the negation applies to the %in% result: if ("cycle1_meds" %in% cycles && !("cycle1" %in% cycles)) { ... }.

Suggested change
if ("cycle1_meds" %in% cycles && !"cycle1" %in% cycles) {
if ("cycle1_meds" %in% cycles && !("cycle1" %in% cycles)) {

Copilot uses AI. Check for mistakes.
Comment on lines +296 to +311
# Add NA values if requested
if (n_na > 0) {
if (length(na_labels) > 0 && !is.na(na_labels[1])) {
# Use NA codes from variable_details
na_values <- sample(na_labels, n_na, replace = TRUE)
} else {
# Use actual NA
na_values <- rep(NA, n_na)
}

# Combine and shuffle
all_values <- c(values, na_values)
all_values <- sample(all_values)
} else {
all_values <- values
}
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining numeric continuous values with string NA codes (e.g., "NA::a") will coerce the entire vector to character, producing a non-numeric column. For continuous variables, use actual NA_real_ (or haven::tagged_na if you need tagged missing) to preserve numeric type: set na_values <- rep(NA_real_, n_na) or map any NA codes to NA_real_ before combining.

Copilot uses AI. Check for mistakes.
Comment on lines +363 to +366
"- [R/create_cat_var.R](R/create_cat_var.R)",
"- [R/create_con_var.R](R/create_con_var.R)",
"- [R/parse-range-notation.R](R/parse-range-notation.R)",
"- [data-raw/test-all-cycles.R](data-raw/test-all-cycles.R)",
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated links point to files that don't exist in this repo; the implemented functions are in R/mockdata-generators.R and R/mockdata-parsers.R, and the test runner lives under mockdata-tools/test-all-cycles.R. Update these links accordingly to avoid broken references.

Suggested change
"- [R/create_cat_var.R](R/create_cat_var.R)",
"- [R/create_con_var.R](R/create_con_var.R)",
"- [R/parse-range-notation.R](R/parse-range-notation.R)",
"- [data-raw/test-all-cycles.R](data-raw/test-all-cycles.R)",
"- [R/mockdata-generators.R](R/mockdata-generators.R)",
"- [R/mockdata-parsers.R](R/mockdata-parsers.R)",
"- [mockdata-tools/test-all-cycles.R](mockdata-tools/test-all-cycles.R)",

Copilot uses AI. Check for mistakes.
Comment on lines 44 to 58
| Cycle | Current | MockData Functions | Difference | Notes |
|-------|---------|-------------------|------------|-------|
| cycle1 | 78 | 80 | +2 mockdata functions | |
| cycle1_meds | 76 | 83 | +7 mockdata functions | |
| cycle2 | 78 | 80 | +2 mockdata functions | |
| cycle2_meds | 76 | 83 | +7 mockdata functions | |
| cycle3 | 82 | 81 | -1 current | |
| cycle3_meds | 3 | 3 | 0 | |
| cycle4 | 82 | 82 | 0 | |
| cycle4_meds | 3 | 3 | 0 | |
| cycle5 | 78 | 78 | 0 | |
| cycle5_meds | 3 | 3 | 0 | |
| cycle6 | 77 | 77 | 0 | |
| cycle6_meds | 3 | 3 | 0 | |
| **TOTAL** | **639** | **656** | **+17** | |
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table rows contain an extra leading pipe, which breaks Markdown table rendering. Remove the extra " |" so each row begins with a single pipe, for example: "| Cycle | Current | ..." and "| cycle1 | 78 | ...".

Suggested change
| Cycle | Current | MockData Functions | Difference | Notes |
|-------|---------|-------------------|------------|-------|
| cycle1 | 78 | 80 | +2 mockdata functions | |
| cycle1_meds | 76 | 83 | +7 mockdata functions | |
| cycle2 | 78 | 80 | +2 mockdata functions | |
| cycle2_meds | 76 | 83 | +7 mockdata functions | |
| cycle3 | 82 | 81 | -1 current | |
| cycle3_meds | 3 | 3 | 0 | |
| cycle4 | 82 | 82 | 0 | |
| cycle4_meds | 3 | 3 | 0 | |
| cycle5 | 78 | 78 | 0 | |
| cycle5_meds | 3 | 3 | 0 | |
| cycle6 | 77 | 77 | 0 | |
| cycle6_meds | 3 | 3 | 0 | |
| **TOTAL** | **639** | **656** | **+17** | |
|Cycle | Current | MockData Functions | Difference | Notes |
|-------|---------|-------------------|------------|-------|
|cycle1 | 78 | 80 | +2 mockdata functions | |
|cycle1_meds | 76 | 83 | +7 mockdata functions | |
|cycle2 | 78 | 80 | +2 mockdata functions | |
|cycle2_meds | 76 | 83 | +7 mockdata functions | |
|cycle3 | 82 | 81 | -1 current | |
|cycle3_meds | 3 | 3 | 0 | |
|cycle4 | 82 | 82 | 0 | |
|cycle4_meds | 3 | 3 | 0 | |
|cycle5 | 78 | 78 | 0 | |
|cycle5_meds | 3 | 3 | 0 | |
|cycle6 | 77 | 77 | 0 | |
|cycle6_meds | 3 | 3 | 0 | |
|**TOTAL** | **639** | **656** | **+17** | |

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +10
library(dplyr)

Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] dplyr is not used in this script; remove the unused library(dplyr) to avoid unnecessary dependencies and faster startup time.

Suggested change
library(dplyr)

Copilot uses AI. Check for mistakes.
# Date: 2025-10-17

library(readr)
library(dplyr)
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] dplyr is not used in this script; remove the unused library(dplyr) call to reduce unnecessary imports.

Suggested change
library(dplyr)

Copilot uses AI. Check for mistakes.
rafdoodle and others added 2 commits October 22, 2025 23:47
…stent according to MOCKDATA_COMPARISON (does not apply to anymed2 and diab_drug2)
@rafdoodle
Copy link
Collaborator

rafdoodle commented Oct 23, 2025

Hello Doug,

Using Claude, I edited the dummy data and variables.csv to address most of the differences noted in MOCKDATA_COMPARISON.csv which I then updated. anymed2 and diab_drug2 are not supposed to be in the dummy data as they themselves are duplicated from derived variables as shown in the recoding-medications.qmd. All tests for dummy data still pass.

Please let me know what you think and if you want, delete the Juan code in chmsflow that you added to create mock data.

Rafidul

Copy link
Collaborator

@rafdoodle rafdoodle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also updated variable-details.csv now. Anyways, please let me know what you think of the updated dummy data and worksheets. Feel free to remove the dummy data generation code from here back to its original repo. By the way, I cannot rename branches until the PR is closed. Thanks.

@DougManuel
Copy link
Contributor Author

Looks good. Let's close and merge.

  1. CRAN prep complete - package metadata and build configuration updated
  2. Reproducible development - renv setup for contributors, excluded from CRAN build
  3. Vignette solution - eval=FALSE appropriately handles time-intensive examples
  4. Future work - medication function performance optimization is separate issue (not blocking)

@DougManuel DougManuel closed this Oct 25, 2025
@DougManuel DougManuel deleted the mock-data-review branch October 25, 2025 22:09
@rafdoodle rafdoodle restored the mock-data-review branch October 25, 2025 22:47
@rafdoodle rafdoodle reopened this Oct 25, 2025
@rafdoodle rafdoodle closed this Oct 25, 2025
@rafdoodle rafdoodle deleted the mock-data-review branch October 25, 2025 23:02
@DougManuel
Copy link
Contributor Author

Closed via manual merge to dev branch in commit 81f8ac5.

What was merged:

  • Corrected dummy data for all cycles (cycle1-6, cycle1_meds-6_meds)
  • Updated variables.csv and variable-details.csv
  • Updated prep-dummy-data.R
  • Vignette corrections

MockData comparison results:

  • Reduced discrepancies from 235 → 12
  • Remaining 12 differences are anymed2 and diab_drug2 (2 vars × 6 cycles)
  • These are correctly excluded from mock data as they are derived variable duplicates created during workflow (see recoding-medications.qmd)

MockData functions:

  • Validation/comparison tools remain in the MockData repo for reuse across projects
  • Not included in this package merge

Follow-up work:

  • Medication variable automation to be addressed in future issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants