Skip to content

Commit a538220

Browse files
committed
docs built + file introspection
1 parent ff826de commit a538220

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+5960
-12
lines changed
33.3 KB
Binary file not shown.
24.7 KB
Binary file not shown.

docs/_build/doctrees/index.doctree

8.3 KB
Binary file not shown.
8.11 KB
Binary file not shown.
21.5 KB
Binary file not shown.

docs/_build/html/.buildinfo

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Sphinx build info version 1
2+
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3+
config: ca3626de684797951ef03443c45ac66f
4+
tags: 645f666f9bcd5a90fca523b33c5a78b7
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Remote File Introspecting
2+
3+
The `get_structure()` function allows users to query and inspect the internal structure of datasets available through ServiceX. This is useful for determining which branches exist in a given dataset before running a full transformation with the correct branch labelling and typing.
4+
5+
It is useful for any lightweight exploration when only metadata or structure information is required without fetching event-level data.
6+
7+
---
8+
9+
## Overview
10+
11+
The function internally issues a ServiceX request, using python function backend, for the specified dataset(s) and returns a simplified summary of the file structure, such as branches and types in a string formatted for readability.
12+
13+
It accepts both programmatic and command-line usage with parametric return types.
14+
15+
---
16+
17+
## Function
18+
19+
```python
20+
get_structure(datasets, array_out=False, **kwargs)
21+
```
22+
23+
**Parameters:**
24+
25+
- `datasets` (`dict`, `str`, or `list[str]`): One or more datasets to inspect. Made for Rucio DIDs. If a dictionary is used, keys will be used as labels for each dataset in the output string.
26+
- `array_out` (`bool`): If True, empty awkward arrays are reconstructed from the structure information. The function will return a dictionary of ak.Array.type objects. This allows for programmatic access to the dataset structure which can be further manipulated.
27+
- `**kwargs`: Additional arguments forwarded to the helper function `print_structure_from_str`, such as `filter_branch` to apply a filter to displayed branches, `do_print` to print the output during the function call, or `save_to_txt` to save the output to `samples_structure.txt`.
28+
29+
**Returns:**
30+
- `str`: The formatted file structure string.
31+
- `None`: If `do_print` or `save_to_txt` is `True`, the function will print or save the output to a file.
32+
- `dict`: keys are sample names and values are `ak.Array.type` objects with the same dataset structure.
33+
34+
---
35+
36+
## Command-Line Usage
37+
38+
The function is also available as a CLI tool:
39+
40+
```bash
41+
$ servicex-get-structure "scope:dataset-rucio-id" --filter_branch "el_"
42+
```
43+
44+
This dumps to the shell a summary of the structure of the dataset, filtered to branches that contain `"el_"` in their names.
45+
46+
```bash
47+
$ servicex-get-structure "scope:dataset-rucio-id1" "scope:dataset-rucio-id2" --filter_branch "el_"
48+
```
49+
50+
This will output a combined summary of both datasets with the same filter.
51+
52+
---
53+
54+
### Practical Output Example
55+
56+
Command:
57+
58+
```bash
59+
$ servicex-get-structure user.mtost:user.mtost.all.Mar11 --filter-branch el_pt
60+
```
61+
62+
Output on shell:
63+
64+
```bash
65+
File structure of all samples with branch filter 'el_pt':
66+
67+
---------------------------
68+
📁 Sample: user.mtost:user.mtost.all.Mar11
69+
---------------------------
70+
71+
🌳 Tree: EventLoop_FileExecuted
72+
├── Branches:
73+
74+
🌳 Tree: EventLoop_JobStats
75+
├── Branches:
76+
77+
🌳 Tree: reco
78+
├── Branches:
79+
│ ├── el_pt_NOSYS ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
80+
│ ├── el_pt_EG_RESOLUTION_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
81+
│ ├── el_pt_EG_RESOLUTION_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
82+
│ ├── el_pt_EG_SCALE_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
83+
│ ├── el_pt_EG_SCALE_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
84+
```
85+
86+
The output lists all trees and branch names matching the specified filter pattern for each requested dataset.
87+
It shows the branch data type information as interpreted by `uproot`. This includes the vector nesting level (jagged arrays) and the base type (e.g., f4 for 32-bit floats).
88+
89+
90+
#### JSON input
91+
92+
A json file can be used as input to simplify the command for multiple samples.
93+
94+
```bash
95+
$ servicex-get-structure "path/to/datasets.jsosn"
96+
```
97+
98+
With `datasets.json` containing:
99+
100+
```
101+
{
102+
"Signal": "mc23_13TeV:signal-dataset-rucio-id",
103+
"Background W+jets": "mc23_13TeV:background-dataset-rucio-id1",
104+
"Background Z+jets": "mc23_13TeV:background-dataset-rucio-id2",
105+
"Background Drell-Yan": "mc23_13TeV:background-dataset-rucio-id3",
106+
}
107+
```
108+
109+
---
110+
111+
## Programmatic Example
112+
113+
Similarly to the CLI functionality, the output string containing the dataset structure can be retrieved such as:
114+
115+
```python
116+
from servicex_analysis_utils import get_structure
117+
118+
# Retrieve structure of a specific dataset
119+
file_structure=get_structure("mc23_13TeV:some-dataset-rucio-id")
120+
```
121+
122+
### Other options
123+
124+
With `do_print` and `save_to_txt`, the dataset-structure string can instead be routed to std_out or to a text file in the running path.
125+
126+
```python
127+
from servicex_analysis_utils import get_structure
128+
129+
# Directly dump structure to std_out
130+
get_structure("mc23_13TeV:some-dataset-rucio-id", do_print=True)
131+
# Save to samples_summaty.txt
132+
get_structure("mc23_13TeV:some-dataset-rucio-id", save_to_txt=True)
133+
```
134+
135+
136+
#### Return awkward array type
137+
138+
139+
If `array_out` is set to `True` the function reconstructs dummy arrays with the correct structre and return their `Awkward.Array.type` object.
140+
141+
```python
142+
from servicex_analysis_utils import get_structure
143+
144+
DS = {"sample1": "user.mtost:user.mtost.all.Mar11"}
145+
ak_type = get_structure(DS, array_out=True)
146+
147+
rec = ak_type["sample1"].content #get RecordType
148+
149+
# Find index of reco tree and runNumber branch
150+
reco_idx = rec.fields.index("reco")
151+
branch_idx = rec.contents[reco_idx].fields.index("runNumber")
152+
153+
print("Type for branch 'runNumber':", rec.contents[reco_idx].contents[branch_idx])
154+
```
155+
Output:
156+
157+
```bash
158+
Type for branch 'runNumber': var * int64
159+
```
160+
161+
---
162+
163+
## Notes
164+
165+
- The function does not retrieve event data — only structure/metadata.
166+
- CLI output is printed directly to stdout but can be routed to a file with ` > structure_summary.txt`
167+
- Many types will show as None or unknown when they are not interpretable by the uproot or fail to be reconstructed to ak.arrays
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# ServiceX Analysis Utilities
2+
3+
This package provides tools interacting with [ServiceX](https://github.com/ssl-hep/ServiceX_frontend), a data extraction, transformation and delivery system built for ATLAS and CMS analyses on large datasets. The Analysis Utils package offers helper functions that streamline the usage of ServiceX and simplify its integration on workflows. But it also contains specific use case tools that benefit from the service.
4+
5+
---
6+
7+
## Installation
8+
9+
Install the package from PyPI:
10+
11+
```bash
12+
pip install servicex-analysis-utils
13+
```
14+
More information can be found in [Instalation and requirements](installation.md)
15+
16+
## Documentation Contents
17+
18+
```{toctree}
19+
:maxdepth: 2
20+
:caption: Documentation Contents:
21+
22+
installation
23+
materialization
24+
file_introspecting
25+
```
26+
27+
---
28+
29+
## Utility Functions
30+
31+
### `to_awk()`
32+
Load an Awkward Array from ServiceX output easily.
33+
34+
See detailed usage here: [Materiazlization documentation](materialization.md)
35+
36+
### `get_structure()`
37+
Create and send ServiceX requests to retrieve file structures with a CLI implementation.
38+
39+
See detailed usage here: [File introspection documentation](file_introspecting.md)
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Installation
2+
3+
This section provides instructions for installing the ServiceX Analysis Utilities package.
4+
5+
## Prerequisites
6+
7+
Before installing, ensure the following requirements are satisfied:
8+
9+
- Python 3.9 or higher is installed.
10+
- `pip` is updated to the latest version (`pip install --upgrade pip`).
11+
- Access to a ServiceX endpoit is granted
12+
- A valid `servicex.yaml` configuration file is on your local machine.
13+
14+
For instructions on setting up ServiceX, refer to the [ServiceX Installation Guide](https://servicex-frontend.readthedocs.io/en/stable/connect_servicex.html).
15+
16+
## Installation from PyPI
17+
18+
The package is available on PyPI and can be installed via:
19+
20+
```bash
21+
pip install servicex-analysis-utils
22+
```
23+
24+
## Installation from Source
25+
26+
Alternatively, the package can be installed from the GitHub repository:
27+
28+
```bash
29+
git clone https://github.com/ssl-hep/ServiceX_analysis_utils.git
30+
cd ServiceX_analysis_utils
31+
pip install .
32+
```
33+
34+
## Verifying the Installation
35+
36+
After installation, you can verify that the package is accessible by running:
37+
38+
```bash
39+
python -c "import servicex_analysis_utils"
40+
```
41+
42+
No output indicates a successful installation.
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Materialization of delivered data
2+
3+
The `to_awk()` function provides a streamlined method to materialize the output of a ServiceX `deliver()` call into Awkward Arrays, Dask arrays, or iterators.
4+
5+
This simplifies workflows by allowing easy manipulation of the retrieved data in various analysis pipelines like in the examples below.
6+
7+
---
8+
9+
## Overview
10+
11+
The `to_awk()` function loads data from the deliver output dictionary, supporting both ROOT (`.root`) and Parquet (`.parquet` or `.pq`) file formats.
12+
13+
It provides flexible options for:
14+
15+
- Direct loading into Awkward Arrays.
16+
- Lazy loading using Dask for scalable operations.
17+
- Returning iterator objects for manual control over file streaming.
18+
19+
## Function
20+
21+
```python
22+
to_awk(deliver_dict, dask=False, iterator=False, **kwargs)
23+
```
24+
25+
**Parameters:**
26+
27+
- `deliver_dict` (dict): Dictionary returned by `servicex.deliver()`. Keys are sample names, values are file paths or URLs.
28+
- `dask` (bool, optional): If True, loads files lazily using Dask. Default is False.
29+
- `iterator` (bool, optional): If True and not using Dask, returns iterators instead of materialized arrays. Default is False.
30+
- `**kwargs`: Additional keyword arguments passed to `uproot.dask`, `uproot.iterate`, dak.from_parquet, or `awkward.from_parquet`.
31+
32+
**Returns:**
33+
34+
- `dict`: A dictionary where keys are sample names and values are either Awkward Arrays, Dask Arrays, or iterators. It keeps the same structure as the `deliver` output dict.
35+
36+
---
37+
38+
## Usage Examples
39+
40+
### Simple Materialization
41+
42+
Load ServiceX deliver results directly into Awkward Arrays:
43+
44+
```python
45+
from servicex_analysis_utils import to_awk
46+
from servicex import query, dataset, deliver
47+
48+
spec = {
49+
"Sample": [
50+
{
51+
"Name": "simple_transform",
52+
"Dataset": dataset.FileList(
53+
["root://eospublic.cern.ch//eos/opendata/atlas/rucio/data16_13TeV/DAOD_PHYSLITE.37019878._000001.pool.root.1"] # noqa: E501
54+
),
55+
"Query": query.FuncADL_Uproot()
56+
.FromTree("CollectionTree")
57+
.Select(lambda e: {"el_pt": e["AnalysisElectronsAuxDyn.pt"]}),
58+
}
59+
]
60+
}
61+
62+
arrays=to_awk(deliver(spec))
63+
```
64+
65+
### Lazy Loading with Dask
66+
67+
Load results lazily for large datasets using Dask task graphs. Enables parallel execution across multiple workers.
68+
69+
```python
70+
import dask_awkward as dak
71+
72+
dask_arrays = to_awk(deliver(spec), dask=True)
73+
el_pt_array = dask_arrays["simple_transform"]["el_pt"]
74+
mean_el_pt = dak.mean(el_pt_array).compute()
75+
```
76+
77+
### Using Iterators
78+
79+
Return iterators instead of materialized arrays to avoid loading too much data into memory. Requires `dask=False` (default). Example with loading 10,000 events per chunk:
80+
81+
```python
82+
iterables = to_awk(deliver(spec), iterator=True, step_size=10000)
83+
```
84+
85+
You can then manually loop over the data chunks:
86+
87+
```python
88+
for chunk in iterables['simple_transform']:
89+
# process small chunk (~10k events)
90+
analyse(chunk) #some function for el_pt
91+
```
92+
93+
All events can also be loaded by using:
94+
95+
```python
96+
import awkward as ak
97+
arrays= ak.concatenate(list[iterables['simple_transform']])
98+
```
99+
100+
---
101+
102+
103+
## Multiple samples
104+
105+
ServiceX queries allow multiple sample transformations. The `to_awk` allows a straightforward manipulation of such requests. This allows seamless integration with analysis frameworks with multiple samples being manipulated separately after being passing the same transformation using `deliver()`.
106+
107+
```python
108+
from servicex_analysis_utils import to_awk
109+
import awkward as ak
110+
111+
# Given a ServiceX deliver return
112+
deliver_result = {
113+
"Signal": ["path/to/signal_file1.root", "path/to/signal_file2.root"],
114+
"Background": ["path/to/background_file.root"]
115+
}
116+
117+
arrays = to_awk(deliver_result)
118+
119+
signal_el_pt = arrays["Signal"]["el_pt"]
120+
background_el_pt = arrays["Background"]["el_pt"]
121+
122+
mean_signal = ak.mean(signal_el_pt)
123+
mean_background = ak.mean(background_el_pt)
124+
125+
print(f"Mean electron pT (Signal): {mean_signal:.2f} GeV")
126+
print(f"Mean electron pT (Background): {mean_background:.2f} GeV")
127+
```
128+
129+
130+
## Notes
131+
132+
- **Multiple samples:** For transformations delivering multiple samples the dask and iterators are applied homegeneously to all.
133+
- **Error Handling:** In case of loading errors, the affected sample will have `None` as its value in the returned dictionary.
134+
- **Supported Formats:** A custom dict (non servicex) can be inputed but the paths must point be either ROOT or Parquet format.
135+
- **Branch Filtering, others:** Additional `**kwargs` allow specifying branch selections or other loading options supported by `uproot`, `awkward` and `dask_awkward`.
136+

0 commit comments

Comments
 (0)