Chunk data to an algorithm #3

knoepfel · 2024-12-16T22:19:49Z

knoepfel
Dec 16, 2024
Maintainer

DUNE US S&C R&D item 101

Data chunking is intended to process a logical data product that is too large to fit in memory at once. This demonstrator requires several things:

Ask DUNE for specific example
Establish interface for algorithms that want to take advantage of chunking
Start with vector of numbers and transform them to something else
Combine chunked algorithm results with a fold operation
Understand ramifications for DDL system and the IO system
Input arguments (e.g.) std::span<T> vs. std::vector<T> could imply that some data can be chunked for an algorithm and some cannot.
What about Python algorithms? Annotations or decorators.

To produce a demonstrator we are introducing a concept of chunk-able data product (e.g. a sequence of waveforms), in general a chunk-able data product will be a sequence of something.

Rule for defining data products that are chunk-able, need to understand nature of this virtual data product -- what does it mean to be a sequence, known size at the beginning, size only known once done
- Produce code (certainly C++ but may be both in C++ and Python?)
- Need to be able to represent in memory
The IO system will need the concept of virtual data product, that can be both read and written in chunks. The size of the chunks are under control of the writer of the data product.
- Will need a mockup IO system
- Will not need the ability to read/write these in this demonstrator, need to know enough to convey to the IO group/team
What user will write
- A C++ function that expects a span of waveform (std::span), and partial accumulation and its output is an accumulator that is input to a fold (and output is also an accumulator).
- In python, we need to pay attention to avoid copy data (0 copy for python)
What is the plugin that framework will load
- Declare a framework module that my input is chunk-able sequence of waveforms, and output is what reduction product
Framework will
- Produce compilation error if data product is not chunk-able => work not with Python code, we need some runtime support (may be look about some boilerplate stuff)
- Some runtime check will be needed (when chunk size specified is too big or something that may not work)
- Put together a program that uses flow graph and task to do the processing of the virtual data product in chunks.
Think about parallel processing of chunks (may be pipelined)
How much of a scheduler will be needed? If we are using Meld, then we will already be relying on scheduler. For standalone, we need to figure out.

sabasehrish · 2025-02-05T14:53:39Z

sabasehrish
Feb 5, 2025
Maintainer

Subsystems

Required

Registration
Task management
Mock I/O

Nice to have

Algorithm description
I/O
Configuration
Plugin management
Logging

0 replies

knoepfel · 2025-02-21T22:09:40Z

knoepfel
Feb 21, 2025
Maintainer Author

Related to:

0 replies

knoepfel · 2025-02-24T15:18:19Z

knoepfel
Feb 24, 2025
Maintainer Author

Responsible developers: @marcpaterno and @sabasehrish.

0 replies

brettviren · 2025-04-30T18:46:04Z

brettviren
Apr 30, 2025

Here are some types of chunking needed for DUNE and what implications chunking may have. It has a focus on FD charge data and Wire-Cell implementations so is definitely not comprehensive.

Terms

TPC : a contiguous sub-detector unit with a single "face" of electrodes in which drifting ionization electrons inducde current which is then measured. Ionization electron signal originating in a given TPC only induce current in the TPC's face.
APA : Back to back TPCs with electrodes from both faces providing induced current to a single electronics channel.
TR : DUNE trigger record which comes in two basic time durations (nominal 3-5ms and the 100s long "extended" supernova neutrino burst (SNB) candidate) and a variety of spacial extent (providing data anywhere from 1 to 150 APAs for DUNE FD HD).

Wire-Cell charge waveform simulation

Basic transformation:

(depos) -> [sim] -> (ADC waveforms)

The [sim] (as implemented by Wire-Cell Toolkit) is itself a (data flow) graph that includes these major nodes:

drifting of depos
convolution of drifted charge distribution with detector response
addition of noise
digitization

There are several types of chunking relevant to this simulation:

time : Input (depos) may be chunked into time bins, each group fed into [sim] and the resulting (ADC waveforms) regrouped. The time duration of the output waveforms is longer than the time duration input depos by an amount governed by the detector response which can be O(1ms). Thus, overlaps between neighboring waveform chunks are formed and must be summed. This chunking is only required for "extended" FD trigger records (eg SNB candidates).
space : The [sim] operates independently on each TPC (in principle, on each plane of a TPC face). The output (ADC waveforms) are naturally chunked by TPC or APA. The input (depos) can likewise be pre-chunked, however [sim] will only consume the subset of depos relevant to a given TPC (or APA). There is no concern about overlap across APAs, but combining waveforms from two TPCs in an APA requires a sum that is aware of an electrode-channel mapping.
source : Here, "source" means the physical process (and the implementing code/job) that produced a set of (depos). Some jobs may have a single source of depos (eg, just neutrino interactions) and some may need to properly "mix" different sources (eg nueutrinos + cosmic muons + radiologicals). The mixing can be done to produce the input (depos) or multiple sets of (depos) can be input to [sim], which can then properly to mixing. If mixing is done prior, it poses no "source chunking" issue to [sim]. If [sim] does the mixing, chunking is handled in a "streaming" algorithm which means internal buffering and that feeding input (depos) sets must not "starve" any stream.

Wire-Cell charge waveform signal processing

Basic transformation:

(ADC waveforms) -> [sigproc] -> (signal waveforms)

Signal waveforms represent a reconstruction of the distribution of drifted ionization charge in (transverse) space vs time dimensions of each tomographic wire-plane view. The samples of a signal waveform are in units of number of (drifted) ionization electrons per tick per channel. The signal waveforms are highly sparse and can be represented in a space-efficient way either with sparse arrays or as compressed dense arrays (zero padding the sparse regions).

There are two types of chunking that are relevant:

time : ADC waveform blocks (from one TPC or APA) that are longer than about 10ms become uncomfortably large for signal processing. The nominal DUNE FD TR is 3-5ms. However the "extended" (SNB candidate) TRs are 100s and must be chunked in time. Like [sim], the [sigproc] transformation produces output chunks that have longer duration in time than the input chunks and combining the output must take into account summing the overlap.
space : The [sim] the [sigproc] transformation operates independently at the level of one APA (not one TPC).

Wire-Cell charge sim+sigproc

As a special case, when both simulation and signal processing are needed, it is desired (at least for large scale production) to NOT expose (ADC waveform) data tier to any persistence (file or memory) and so a combined transformation is:

(depos) -> [sim+sigproc] -> (signal waveforms)

Wire-Cell 3D charge imaging

(signal waveforms) -> [img] -> (blobs)

This process reconstructs, with coarse resolution, locations in space/time likely to contain ionization electron signal. It is a per-APA transformation and essentially a streaming algorithm. Thus, robust against space-chunking at APA level and any reasonable time-chunk.

Wire-Cell charge cluster stitching

(blobs) -> [clus] (clusters)

WC (and other) reconstruction chains form "clusters" of some type that represent high resolution reconstruction of ionization locations.

In WC and for the case of compact (nominal, not extended) data, clusters are constructed first on a per-TPC basis. They are then "stitched" across the two TPCs of one APA and then across neighboring APAs. Each type of stitching requires assembly of any chunk-level clusters such that the boundaries are spanned. This can be pair-wise at the 2TPC->APA stitching and then all APA level clusters can be assembled for the cross-APA stitching.

Finding clusters from extended data poses a problem in the face of chunking due to a given set of blobs that should become a single cluster landing on a chunk boundary. Some possible solutions:

Chunk and hope. Clusters on either side of the chunk boundary may be formed from the surviving blobs. This formation will suffer some inefficiency in some cases. Stitching across chunk boundary is needed.
Streaming cluster algorithm. Rework clustering reco to consume blobs in a time-ordered stream, buffering as needed and making smart determination. Blob time chunks can be of arbitrary size.

Wire-Cell Charge-Light matching

Charge clusters and "flashes" reconstructed from the optical detection system must be matched in space and time in order to absolutely locate the cluster.

(clusters) -+
            |
            +-> [q-l match] -> (clusters)
            |
(flashes)  -+

The DUNE FD design does not include optical boundaries at the TPC or APA level and so the matching is done with whole-detector charge and light information. Any prior chunking of these data must be such to allow the required assembly.

Like with clustering, chunking in time may be required for input clusters and/or flashes and similar solutions can be considered ("chunk and hope" vs "streaming alg").

Cross-chain merging

DUNE has multiple, independent reco chains. Eg Wire-Cell and Pandora both split off after signal processing in order to implement different strategies. It is necessary to allow data products from one chain to "cross over" to another. This is needed for performing comparisons and so that one chain simply input results from the other to form a subsequent hybrid chain. Each consumer at the merge will impose some requirements related to the chunk boundaries of the data products from each stream. Even in the unlikely case that identical chunk boundaries existed on both streams, the node consuming the two streams may have special needs. Eg, it may require to consume a FIFO queue of some depth of data products from each stream.

3 replies

marcpaterno May 20, 2025
Maintainer

Here are some types of chunking needed for DUNE and what implications chunking may have. It has a focus on FD charge data and Wire-Cell implementations so is definitely not comprehensive.

Terms

TPC : a contiguous sub-detector unit with a single "face" of electrodes in which drifting ionization electrons inducde current which is then measured. Ionization electron signal originating in a given TPC only induce current in the TPC's face.

APA : Back to back TPCs with electrodes from both faces providing induced current to a single electronics channel.

TR : DUNE trigger record which comes in two basic time durations (nominal 3-5ms and the 100s long "extended" supernova neutrino burst (SNB) candidate) and a variety of spacial extent (providing data anywhere from 1 to 150 APAs for DUNE FD HD).

Wire-Cell charge waveform simulation

Basic transformation:
(depos) -> [sim] -> (ADC waveforms)
The [sim] (as implemented by Wire-Cell Toolkit) is itself a (data flow) graph that includes these major nodes:

drifting of depos

convolution of drifted charge distribution with detector response

addition of noise

digitization

There are several types of chunking relevant to this simulation:

time : Input (depos) may be chunked into time bins, each group fed into [sim] and the resulting (ADC waveforms) regrouped. The time duration of the output waveforms is longer than the time duration input depos by an amount governed by the detector response which can be O(1ms). Thus, overlaps between neighboring waveform chunks are formed and must be summed. This chunking is only required for "extended" FD trigger records (eg SNB candidates).

In the attached document, we have tried to describe this part of a possible workflow with the terminology used in Phlex.
wirecell-charge-waveform-sim-doc.pdf
We would like to use this document to improve our understanding of the workflow and to determine what kinds of higher order functions need to be provided by Phlex. We have two main questions:

Does the document capture the general idea of the workflow correctly?
If it does, does the fold at the end of this workflow suffice? Or do you need something we'd call a "windowed fold", which would present two consecutive NoisyConvolvedDepos to the digitize algorithm at a time -- perhaps first NCD1 and NCD2, then NCD2 and NCD3, etc) so that "edge effects" between time bins can be mitigated?

brettviren May 20, 2025

Hi @marcpaterno

There are a few things that are a little off. Hopefully the following helps:

I know this is just an exercise to get a feel but I think it remains to be understood how granular the WC flow graphs should appear to phlex's flow graph. My feeling is that my ASCII diagram above is the right granularity - ie, the WC sim is a single "black box" to phlex. And really, we may want sim+sigproc to be combined to avoid exposing the large intermediate waveform blocks (more on that below).
The data product types of an input depo and a drifted depo are identical. The size of the set of drifted depos will be no larger than the size of the set of input depos. The drifted set may be smaller as any input depos that are not in the active volume of a detector are dropped.
The drifting can (and does) handle a "chunk streamed" data flow so one need not have all depos in memory at once. To handle the causality of drifting, the drifter must buffer some depos long enough to know when some subset is safe to be released as output. This requires input depos to be provided in time order. Output depos are also time ordered (which is a different order compared to input because of varying drift distance).
Figure 2 is not sufficient because the convolution of one chunk of DriftedDepos leads to the ConvolvedDepos becoming extended in time such that there is "overlap" with the next ConvolvedDepos in time. The time duration of the overlap is at least as long as the duration of the convolution kernel. If a really long RC response is relevant, it is possible for the overlap duration to be even longer than a nominal choice for the chunk duration. The overlap should be snipped off its ConvolvedDepos and added to the next one prior to adding noise. Thus something must sit between "ConvolvedDepos sequence" and "NoiseConvovledDepos sequence" which is allowed to process the input sequence, err, sequentially. I note that this sequencing requirement is similar to the one for the drifter.
In most types of production jobs we will want to NOT write out DigitizedWaveforms but instead follow WC sim immediately with WC sigproc. And, DUNE uses this sim+sigproc mode now. It is used so that we write out a much smaller data product (signals - reconstructed ionization electron distribution). This is also the data product (not ADCs) that begins ALL branches of downstream reconstruction. Some special purpose, smaller scale, jobs may need to write out sim DigitizedWaveforms. This all doesn't qualitatively change the picture you paint. It just lengthens the graph.
The term "time bin" is not clear to me. If it means the chunk duration (say, 1-10 ms), no worries. If it means "tick" (~500ns sampling period) then big worries.

sabasehrish Jun 25, 2025
Maintainer

Thanks @brettviren, @marcpaterno and I have updated the document based on your feedback. Please see attached an updated copy of the document. Does the updated document capture the general idea of the workflow correctly?

wirecell-charge-waveform-sim-doc.pdf

knoepfel · 2025-05-01T15:55:52Z

knoepfel
May 1, 2025
Maintainer Author

To provide some context, we discussed these slides at yesterday's meeting to start the discussion.

0 replies

marcpaterno · 2025-07-28T14:27:34Z

marcpaterno
Jul 28, 2025
Maintainer

Hi @brettviren . Saba and I are just getting back to looking at the window function after weeks of finalizing the document for the design review and then a week of design team retreat. Have you had a chance to look at the updated workflow document we posted above? We believe it is closer to what you need, and are interested in what additional adjustments might be necessary.

0 replies

Framework R&D

Chunk data to an algorithm #3

Uh oh!

Uh oh!

knoepfel Dec 16, 2024 Maintainer

Replies: 6 comments · 3 replies

Uh oh!

Uh oh!

sabasehrish Feb 5, 2025 Maintainer

Subsystems

Required

Nice to have

Uh oh!

knoepfel Feb 21, 2025 Maintainer Author

Uh oh!

knoepfel Feb 24, 2025 Maintainer Author

Uh oh!

brettviren Apr 30, 2025

Terms

Wire-Cell charge waveform simulation

Wire-Cell charge waveform signal processing

Wire-Cell charge sim+sigproc

Wire-Cell 3D charge imaging

Wire-Cell charge cluster stitching

Wire-Cell Charge-Light matching

Cross-chain merging

Uh oh!

marcpaterno May 20, 2025 Maintainer

Terms

Wire-Cell charge waveform simulation

Uh oh!

Uh oh!

brettviren May 20, 2025

Uh oh!

sabasehrish Jun 25, 2025 Maintainer

Uh oh!

knoepfel May 1, 2025 Maintainer Author

Uh oh!

marcpaterno Jul 28, 2025 Maintainer

knoepfel
Dec 16, 2024
Maintainer

Replies: 6 comments 3 replies

sabasehrish
Feb 5, 2025
Maintainer

knoepfel
Feb 21, 2025
Maintainer Author

knoepfel
Feb 24, 2025
Maintainer Author

brettviren
Apr 30, 2025

marcpaterno May 20, 2025
Maintainer

sabasehrish Jun 25, 2025
Maintainer

knoepfel
May 1, 2025
Maintainer Author

marcpaterno
Jul 28, 2025
Maintainer