Skip to content

questions regarding raw signal analysis #212

@jorisbalc

Description

@jorisbalc

Hi Marcus,

I've got a couple of questions regarding the api usage and raw signal analysis. Sorry in advance if some of the questions are super basic as I'm not too familiar with the remora api yet.

  1. The region_basecalls = io_read.extract_ref_reg(ref_reg = ref_region, signal_type="norm") returns an object, sufficient for mapping raw signal data as it contains the seq_to_sig array. I'm wondering how these values are transformed as I've analzyed the pA values from sample to sample and my modified sample has roughly ~2pA higher values on average, comapred to non-modified sample (in non-modified sites, such as explicitly selected large kmers of 50-60 bases in length without any CG sites). Is it possible for such difference is caused by the samples being run at different times or even cells or would this strictly be caused by the DNA sample being modified with bulky mods at every CG site? I also wonder what's the best properly transform the norm signal values to represent something like the printout of remora analyze command.
  2. I've encountered some reads, where the seq_to_sig array doesn't make any sense to me. The only logical conclusion I've come to is that somehow the strand "skips" and goes through way faster than intended? I'm posting a single example that gets returned by <io_read.extract_ref_reg>. In this particular case, I'm looking at 31-mers centered at CG sites. Would you perhaps help in understanding how such a read happens? Also, what exactly does sig_start denote?
read = io_read.extract_ref_reg()
ReadRefReg(
    read_id='784882a1-b632-4512-8322-6a7227f6183e',
    norm_signal=array([
        55827.85090193, 55757.59090376, 55944.95089888, 55804.43090254,
        55921.53089949, 55687.33090559, 55593.65090803, 55289.19091596,
        55289.19091596, 55476.55091108, 55804.43090254, 56272.83089034,
        55476.55091108, 54446.07093792, 55429.7109123 , 55218.93091779,
        55687.33090559, 55640.49090681, 55195.5109184 , 55008.15092328,
        55781.01090315, 55382.87091352, 55851.27090132, 56249.41089095,
        56296.25088973, 56553.87088302, 59504.79080615, 60160.55078907,
        60137.13078968, 59715.57080066, 59434.53080798, 59668.73080188,
        59738.99080005, 59715.57080066, 59879.51079639, 60020.03079273,
        59668.73080188, 58614.83082933, 57279.89086411, 56577.29088241,
        56577.29088241, 56670.97087997, 56319.67088912, 59434.53080798,
        59200.33081408, 59668.73080188, 59738.99080005, 59762.41079944,
        59481.37080676, 59668.73080188, 59106.65081652, 59528.21080554,
        59387.6908092 , 60160.55078907, 59457.95080737, 59973.19079395,
        59598.47080371, 59598.47080371, 59317.43081103, 59340.85081042,
        59247.17081286, 59926.35079517, 59856.090797  , 60020.03079273,
        59645.31080249, 59598.47080371, 59130.07081591, 59645.31080249,
        59996.61079334, 59949.77079456, 60301.07078541, 59926.35079517,
        59856.090797  , 60066.87079151, 59598.47080371, 60020.03079273,
        59996.61079334, 59528.21080554, 59575.05080432, 59692.15080127,
        59387.6908092 , 59645.31080249, 59528.21080554, 59902.93079578,
        59364.27080981, 59270.59081225, 57537.5108574 , 55429.7109123 ,
        55336.03091474, 55593.65090803, 56108.89089461, 56670.97087997,
        54422.65093853, 53579.53096049, 54399.23093914, 55172.09091901,
        55289.19091596, 55031.57092267, 55195.5109184 , 55218.93091779,
        55499.97091047, 54797.37092877, 55078.41092145, 55265.77091657,
        54961.3109245 , 54820.79092816, 55054.99092206, 55617.07090742,
        55593.65090803, 55734.17090437, 55921.53089949, 55663.9109062 ,
        55734.17090437, 55804.43090254, 55546.81090925, 55781.01090315,
        55382.87091352, 55406.29091291, 55359.45091413, 55944.95089888,
        55382.87091352, 55663.9109062 , 58029.33084459, 63579.87070001,
        63533.03070123
    ]),
    seq='AGTTAATGAGTGTGTCGAAAAACCTGAAATC',
    seq_to_sig_map=array([
        0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
        6, 7, 7, 8, 8, 9, 9, 10, 11, 23, 29, 41, 53,
        83, 89, 95, 113, 119, 125
    ]),
    ref_reg=RefRegion(ctg='JB0001.0', strand='+', start=17711, end=17742),
    sig_start=22687
)
  1. We noticed some modified reads have 20-50% more <.norm_signal> entries. I assume this correlates to dwell time? In any case, is dwell time information stored in the <io_read> object?
  2. Is the <ReadRefReg.seq> the basecalled sequence or the ref? I assume it's the basecalled seq.

As always, I appreciate any help on this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions