-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Describe the bug
When handling a report, megaqc loops over each data value and checks to see if that SampleDataType
already exists. However it only checks on the basis of data_id
, but ignores data_section
. Therefore if multiple report types (data sections) reuse the same data_id
, currently this will reuse that SampleDataType
even if data_section
is wrong for the incoming report.
This becomes problematic if you want to query for historic results based on data_section.
This is due to this code, which
- Checks
sample_data_type
to see if the field's name has been seen before - If it has NOT been seen before, creates a new entry with
data_key = "{}__{}".format(section, d_key)
.
But in step (1) it will reuse any key matching d_key
, even if section
does not match.
To Reproduce
Here is a barebones multiqc_config and set of report files that can reveal the issue.
multiqc_config.yaml
custom_data:
Pipeline_A_Result:
file_format: "csv"
Pipeline_B_Result:
file_format: "csv"
sp:
Pipeline_A_Result:
fn: "*A_report.csv"
Pipeline_B_Result:
fn: "*B_report.csv"
A_report.csv
(generated by Pipeline A)
sample_id,patient_id,variant_count
sample_1,patient_1,10
B_report.csv
(generated by Pipeline B)
sample_id,patient_id,pvalue
sample_2,patient_2,0.0001
Steps:
- Run pipeline A and submit its data to megaqc,
- Run pipeline B and submit its data to megaqc
megaqc erroneously associates patient_id
to only come from Pipeline_A_Result
, even though in one case it comes from Pipeline_B_Result
.
Specifically, the sample_data
and sample_data_type
tables will look like
sample_data_type
sample_data_type_id | data_id | data_section | data_key | schema |
---|---|---|---|---|
0 | patient_id | Pipeline_A_Result-plot | Pipeline_A_Result-plot__patient_id | null |
1 | variant_count | Pipeline_A_Result-plot | Pipeline_A_Result-plot__variant_count | null |
2 | pvalue | Pipeline_B_Result-plot | Pipeline_B_Result-plot__pvalue | null |
sample_data
sample_data_id | report_id | sample_data_type_id | sample_id | value |
---|---|---|---|---|
0 | 0 | 0 | 0 | patient_1 |
1 | 0 | 1 | 0 | 10 |
2 | 1 | 0 (*) | 1 | patient_2 |
3 | 1 | 2 | 1 | 0.0001 |
* NOTE:
sample_data_type_id=0
refers todata_section=Pipeline_A_Result-plot
, even though this value actually came fromPipeline_B
.
Expected behavior
data_id='patient_id'
will appear in two separate sample_data_type
rows, once with data_section='Pipeline_A_Result-plot'
and once with data_section='Pipeline_B_Result-plot'
sample_data_type_id | data_id | data_section | data_key | schema |
---|---|---|---|---|
0 | patient_id | Pipeline_A_Result-plot | Pipeline_A_Result-plot__patient_id | null |
1 | variant_count | Pipeline_A_Result-plot | Pipeline_A_Result-plot__variant_count | null |
2 | patient_id | Pipeline_B_Result-plot | Pipeline_B_Result-plot__patient_id | null |
3 | pvalue | Pipeline_B_Result-plot | Pipeline_B_Result-plot__pvalue | null |
System
- MegaQC: 0.3.0