-
Notifications
You must be signed in to change notification settings - Fork 0
Mock data review #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Built metadata-driven approach to systematically review and validate dummy data quality across all CHMS cycles. Core functions (R/mockdata-*.R): - parse_variable_start(): Handle recodeflow variableStart formats (bracket, cycle-prefixed, mixed, DerivedVar) - parse_range_notation(): Parse categorical value ranges and continuous intervals - get_cycle_variables(): Query metadata for cycle-specific variables - get_raw_variables(): Extract unique raw variable names - get_variable_details_for_raw(): Retrieve recoding specifications - get_variable_categories(): Parse category labels and values - create_cat_var(): Generate categorical variables from metadata - create_con_var(): Generate continuous variables from metadata Testing (tests/testthat/test-mockdata.R): - 224 tests covering all 8 functions - All tests passing (542 total: 224 MockData + 318 existing) - Test coverage includes edge cases, error handling, and integration Validation tools (mockdata-tools/): - validate-metadata.R: 5 validation checks for metadata quality (cycle names, parsing, format patterns, case sensitivity, completeness) - test-all-cycles.R: Integration test for all 12 cycles (achieved 99.4% coverage: 2,156/2,169 variables) - create-comparison.R: Compare different parsing approaches - README.qmd: 6 examples + common task workflows - README.md: Rendered GitHub-friendly documentation Key findings from validation: - Confirmed Strategy 2b: bracket notation [var] is DEFAULT for databases not explicitly listed (reduces verbosity) - Exact cycle matching needed: 'cycle1' should not match 'cycle1_meds' - Cycle-specific naming: Cycle 1 often uses different names than Cycles 2-6 - Case sensitivity issues in Cycle 6 data (documented for investigation) File organization: - Grouped functions by purpose (parsers, helpers, generators) - Combined tests into single comprehensive file - Self-contained validation tools with usage examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a metadata-driven MockData pipeline for CHMS: parsers, helpers, generators, validation utilities, and tests to compare with the existing manual approach.
- Introduces parse/generation helpers and comprehensive tests for variables and ranges
- Adds CLI tools to validate metadata, generate across cycles, and create comparison reports
- Documents tools and comparison findings; minor formatting tweaks in medications function signatures
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-mockdata.R | New unit tests for parsers, helpers, and generators validating parsing, exact cycle matching, and generation behaviors |
| mockdata-tools/validate-metadata.R | CLI validator performing 5 checks (cycles, parsing, formats, case sensitivity, and categorical specs) |
| mockdata-tools/test-all-cycles.R | Integration test to generate mock data across cycles and report coverage |
| mockdata-tools/create-comparison.R | Script to compare manual vs metadata-driven variable coverage and emit CSV/markdown |
| mockdata-tools/README.qmd | Quarto-based documentation explaining tools, usage, and examples |
| mockdata-tools/README.md | Markdown documentation mirroring the Quarto content for GitHub display |
| R/mockdata-parsers.R | Parsers for variableStart and range notation supporting multiple formats |
| R/mockdata-helpers.R | Helpers to extract cycle/raw variables and variable details/categories |
| R/mockdata-generators.R | Generators for categorical and continuous mock variables |
| R/medications.R | Formatting change: closing parentheses moved to the next line for multiple function signatures |
| PR_MESSAGES.md | PR messaging and context documentation |
| MOCKDATA_COMPARISON.md | Generated comparison report; identifies case-sensitivity and coverage differences |
| MOCKDATA_COMPARISON.csv | Machine-readable comparison of current vs metadata-driven coverage |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| cycles <- trimws(cycles) | ||
|
|
||
| # If this variable is in cycle1_meds but NOT in cycle1, that's an error | ||
| if ("cycle1_meds" %in% cycles && !"cycle1" %in% cycles) { |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The negation is applied to the string literal instead of the membership test, which will error in R. Use parentheses so the negation applies to the %in% result: if ("cycle1_meds" %in% cycles && !("cycle1" %in% cycles)) { ... }.
| if ("cycle1_meds" %in% cycles && !"cycle1" %in% cycles) { | |
| if ("cycle1_meds" %in% cycles && !("cycle1" %in% cycles)) { |
| # Add NA values if requested | ||
| if (n_na > 0) { | ||
| if (length(na_labels) > 0 && !is.na(na_labels[1])) { | ||
| # Use NA codes from variable_details | ||
| na_values <- sample(na_labels, n_na, replace = TRUE) | ||
| } else { | ||
| # Use actual NA | ||
| na_values <- rep(NA, n_na) | ||
| } | ||
|
|
||
| # Combine and shuffle | ||
| all_values <- c(values, na_values) | ||
| all_values <- sample(all_values) | ||
| } else { | ||
| all_values <- values | ||
| } |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combining numeric continuous values with string NA codes (e.g., "NA::a") will coerce the entire vector to character, producing a non-numeric column. For continuous variables, use actual NA_real_ (or haven::tagged_na if you need tagged missing) to preserve numeric type: set na_values <- rep(NA_real_, n_na) or map any NA codes to NA_real_ before combining.
| "- [R/create_cat_var.R](R/create_cat_var.R)", | ||
| "- [R/create_con_var.R](R/create_con_var.R)", | ||
| "- [R/parse-range-notation.R](R/parse-range-notation.R)", | ||
| "- [data-raw/test-all-cycles.R](data-raw/test-all-cycles.R)", |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated links point to files that don't exist in this repo; the implemented functions are in R/mockdata-generators.R and R/mockdata-parsers.R, and the test runner lives under mockdata-tools/test-all-cycles.R. Update these links accordingly to avoid broken references.
| "- [R/create_cat_var.R](R/create_cat_var.R)", | |
| "- [R/create_con_var.R](R/create_con_var.R)", | |
| "- [R/parse-range-notation.R](R/parse-range-notation.R)", | |
| "- [data-raw/test-all-cycles.R](data-raw/test-all-cycles.R)", | |
| "- [R/mockdata-generators.R](R/mockdata-generators.R)", | |
| "- [R/mockdata-parsers.R](R/mockdata-parsers.R)", | |
| "- [mockdata-tools/test-all-cycles.R](mockdata-tools/test-all-cycles.R)", |
MOCKDATA_COMPARISON.md
Outdated
| | Cycle | Current | MockData Functions | Difference | Notes | | ||
| |-------|---------|-------------------|------------|-------| | ||
| | cycle1 | 78 | 80 | +2 mockdata functions | | | ||
| | cycle1_meds | 76 | 83 | +7 mockdata functions | | | ||
| | cycle2 | 78 | 80 | +2 mockdata functions | | | ||
| | cycle2_meds | 76 | 83 | +7 mockdata functions | | | ||
| | cycle3 | 82 | 81 | -1 current | | | ||
| | cycle3_meds | 3 | 3 | 0 | | | ||
| | cycle4 | 82 | 82 | 0 | | | ||
| | cycle4_meds | 3 | 3 | 0 | | | ||
| | cycle5 | 78 | 78 | 0 | | | ||
| | cycle5_meds | 3 | 3 | 0 | | | ||
| | cycle6 | 77 | 77 | 0 | | | ||
| | cycle6_meds | 3 | 3 | 0 | | | ||
| | **TOTAL** | **639** | **656** | **+17** | | |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table rows contain an extra leading pipe, which breaks Markdown table rendering. Remove the extra " |" so each row begins with a single pipe, for example: "| Cycle | Current | ..." and "| cycle1 | 78 | ...".
| | Cycle | Current | MockData Functions | Difference | Notes | | |
| |-------|---------|-------------------|------------|-------| | |
| | cycle1 | 78 | 80 | +2 mockdata functions | | | |
| | cycle1_meds | 76 | 83 | +7 mockdata functions | | | |
| | cycle2 | 78 | 80 | +2 mockdata functions | | | |
| | cycle2_meds | 76 | 83 | +7 mockdata functions | | | |
| | cycle3 | 82 | 81 | -1 current | | | |
| | cycle3_meds | 3 | 3 | 0 | | | |
| | cycle4 | 82 | 82 | 0 | | | |
| | cycle4_meds | 3 | 3 | 0 | | | |
| | cycle5 | 78 | 78 | 0 | | | |
| | cycle5_meds | 3 | 3 | 0 | | | |
| | cycle6 | 77 | 77 | 0 | | | |
| | cycle6_meds | 3 | 3 | 0 | | | |
| | **TOTAL** | **639** | **656** | **+17** | | | |
| |Cycle | Current | MockData Functions | Difference | Notes | | |
| |-------|---------|-------------------|------------|-------| | |
| |cycle1 | 78 | 80 | +2 mockdata functions | | | |
| |cycle1_meds | 76 | 83 | +7 mockdata functions | | | |
| |cycle2 | 78 | 80 | +2 mockdata functions | | | |
| |cycle2_meds | 76 | 83 | +7 mockdata functions | | | |
| |cycle3 | 82 | 81 | -1 current | | | |
| |cycle3_meds | 3 | 3 | 0 | | | |
| |cycle4 | 82 | 82 | 0 | | | |
| |cycle4_meds | 3 | 3 | 0 | | | |
| |cycle5 | 78 | 78 | 0 | | | |
| |cycle5_meds | 3 | 3 | 0 | | | |
| |cycle6 | 77 | 77 | 0 | | | |
| |cycle6_meds | 3 | 3 | 0 | | | |
| |**TOTAL** | **639** | **656** | **+17** | | |
| library(dplyr) | ||
|
|
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] dplyr is not used in this script; remove the unused library(dplyr) to avoid unnecessary dependencies and faster startup time.
| library(dplyr) |
| # Date: 2025-10-17 | ||
|
|
||
| library(readr) | ||
| library(dplyr) |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] dplyr is not used in this script; remove the unused library(dplyr) call to reduce unnecessary imports.
| library(dplyr) |
…stent according to MOCKDATA_COMPARISON (does not apply to anymed2 and diab_drug2)
|
Hello Doug, Using Claude, I edited the dummy data and Please let me know what you think and if you want, delete the Juan code in chmsflow that you added to create mock data. Rafidul |
…iableStart column)
…sflow into mock-data-review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also updated variable-details.csv now. Anyways, please let me know what you think of the updated dummy data and worksheets. Feel free to remove the dummy data generation code from here back to its original repo. By the way, I cannot rename branches until the PR is closed. Thanks.
|
Looks good. Let's close and merge.
|
|
Closed via manual merge to dev branch in commit 81f8ac5. What was merged:
MockData comparison results:
MockData functions:
Follow-up work:
|
Our current approach to generating dummy data for CHMS cycles uses manual specifications for each variable. I identified a few issues, which prompted me to use @JuanLiOHRI MockData repo as an alternative approach to generate data directly from variable.csv and variable_details.csv. This PR focuses just on the variables. After we sort variables, we'll review variable details.
I kept the R code for generating mock data. This code required refactoring with a few additions. We'll review with the view of incorporating the refactors to Juan's original repo.
What was implemented
Parsers (R/mockdata-parsers.R):
parse_variable_start(): Extracts cycle-specific variable names fromvariableStartmetadataparse_range_notation(): Handles range syntax like[7,9],[18.5,25),else[varname]represents DEFAULT for all cycles not explicitly overriddenHelpers (R/mockdata-helpers.R):
get_cycle_variables(): Filters metadata by exact cycle match (avoids "cycle1" matching "cycle1_meds")get_raw_variables(): Returns unique raw variables with harmonisation groupingsget_variable_details_for_raw(): Retrieves category specificationsget_variable_categories(): Extracts valid category codesGenerators (R/mockdata-generators.R):
create_cat_var(): Generates categorical variables with tagged NA supportcreate_con_var(): Generates continuous variables with realistic distributionsValidation tools (mockdata-tools/):
validate-metadata.R: R CMD check-style validator with 5 diagnostic checkstest-all-cycles.R: Integration test across all 12 CHMS cyclescreate-comparison.R: Side-by-side comparison of parsing approachesREADME.qmd: Comprehensive guide with 6 executable examplesTests (tests/testthat/test-mockdata.R):
Key findings
Key finding: Both approaches have valuable information
The comparison revealed 235 differences across 12 cycles, with CHMS documentation verification showing both approaches contribute accurate information:
What the hard-coded data got right ✓:
GEN_015,LFH_016(zero-padded)lab_altin cycle3 (confirmed in CHMS docs)lab_bpbin cycle5 (confirmed in environmental subsample)What the metadata got right ✓:
paadtotfrom cycles 1-2 (doesn't exist in those cycles)lab_altfrom cycle5 (excluded due to specimen freezing)What needs alignment:
gen_15→gen_015)lab_altto cycle3, addlab_bpbto cycle5paadtotfrom cycles 1-2, removelab_altfrom cycle5Context: CHMS data comes from SAS (case-insensitive, stores uppercase). Haven imports preserve uppercase. Pattern reflects different workflows - both valid.
See MOCKDATA_CRITICAL_FINDINGS.md for detailed verification results from CHMS documentation review.
Coverage comparison: MockData functions vs hard-coded approach
Patterns worth investigating:
Variables in different cycles:
paadtotin cycle1/cycle2 (hard-coded) vs cycle3-6 (metadata)lab_altin cycle3/cycle5 (hard-coded) vs cycle1/2/4/6 (metadata)lab_bpbin cycle3/4/5 (hard-coded) vs cycle1/2 (metadata)Variable naming format differences:
gen_015,gen_018,gen_020(zero-padded)gen_15,gen_18,gen_20(unpadded)lfh_016vslfh_16Variables in one approach only:
lbf_soc,gfvd17dy,imm_03anymed2,diab_drug2,thifimp4,img_03Medication cycle coverage:
Metadata quality validation:
databaseStartcycles validvariableStartentries parse successfullySuggested investigation workflow
This comparison provides a great opportunity to systematically verify both approaches against source CHMS data:
Step 1: Document import workflow (clarifies case pattern)
gen_015vsgen_15)Step 2: Verify cycle assignments (resolves paadtot, lab_alt, lab_bpb questions)
Step 3: Review variable coverage (clarifies scope)
anymed2,diab_drug2,thifimp4)Step 4: Reconcile findings (brings approaches into alignment)
Running validation tools: