Skip to content

Commit ff826de

Browse files
committed
Adding modules for doc config and materialization
1 parent 142e4b4 commit ff826de

File tree

6 files changed

+323
-0
lines changed

6 files changed

+323
-0
lines changed

docs/Makefile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = .
9+
BUILDDIR = _build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/conf.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Configuration file for the Sphinx documentation builder.
2+
#
3+
# For the full list of built-in configuration values, see the documentation:
4+
# https://www.sphinx-doc.org/en/master/usage/configuration.html
5+
6+
# -- Project information -----------------------------------------------------
7+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
8+
9+
import servicex_analysis_utils
10+
11+
project = "ServiceX Analysis Utils"
12+
copyright = "2025 Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP)" # NOQA 501
13+
author = "Artur Cordeiro Oudot Choi & the ServiceX team"
14+
release = servicex_analysis_utils.__version__
15+
16+
# -- General configuration ---------------------------------------------------
17+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
18+
19+
extensions = [
20+
"sphinx.ext.napoleon",
21+
"sphinx.ext.intersphinx",
22+
"sphinx.ext.viewcode",
23+
"myst_parser",
24+
"sphinx_copybutton",
25+
]
26+
27+
source_suffix = {
28+
".rst": "restructuredtext",
29+
".md": "markdown",
30+
}
31+
32+
templates_path = ["_templates"]
33+
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
34+
35+
36+
# -- Options for HTML output -------------------------------------------------
37+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
38+
39+
40+
html_theme = "furo"
41+
html_static_path = ["_static"]
42+
43+
copybutton_prompt_text = r">>> |\.\.\. |\$ "
44+
copybutton_prompt_is_regexp = True
45+
copybutton_here_doc_delimiter = "EOF"

docs/index.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# ServiceX Analysis Utilities
2+
3+
This package provides tools interacting with [ServiceX](https://github.com/ssl-hep/ServiceX_frontend), a data extraction, transformation and delivery system built for ATLAS and CMS analyses on large datasets. The Analysis Utils package offers helper functions that streamline the usage of ServiceX and simplify its integration on workflows. But it also contains specific use case tools that benefit from the service.
4+
5+
It offers functions for:
6+
- Materializing ´servicex.deliver()´ results
7+
- Retrieving and displaying file structures from datasets remotely with branch filtering.
8+
9+
---
10+
11+
## Installation
12+
13+
Install the package from PyPI:
14+
15+
```bash
16+
pip install servicex-analysis-utils
17+
```
18+
More information can be found in [Instalation and requirements](installation.md)
19+
20+
## Documentation Contents
21+
22+
```{toctree}
23+
:maxdepth: 2
24+
:caption: Documentation Contents:
25+
26+
installation
27+
usage
28+
materialization
29+
file_introspecting
30+
```
31+
32+
---
33+
34+
## Utility Functions
35+
36+
### `to_awk()`
37+
Load an Awkward Array from ServiceX output easily.
38+
39+
See detailed usage here: [Materiazlization documentation](materialization.md)
40+
41+
### `get_structure()`
42+
Create and send ServiceX requests to retrieve file structures with a CLI implementation.
43+
44+
See detailed usage here: [File introspection documentation](file_introspecting.md)

docs/installation.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Installation
2+
3+
This section provides instructions for installing the ServiceX Analysis Utilities package.
4+
5+
## Prerequisites
6+
7+
Before installing, ensure the following requirements are satisfied:
8+
9+
- Python 3.9 or higher is installed.
10+
- `pip` is updated to the latest version (`pip install --upgrade pip`).
11+
- Access to a ServiceX endpoit is granted
12+
- A valid `servicex.yaml` configuration file is on your local machine.
13+
14+
For instructions on setting up ServiceX, refer to the [ServiceX Installation Guide](https://servicex-frontend.readthedocs.io/en/stable/connect_servicex.html).
15+
16+
## Installation from PyPI
17+
18+
The package is available on PyPI and can be installed via:
19+
20+
```bash
21+
pip install servicex-analysis-utils
22+
```
23+
24+
## Installation from Source
25+
26+
Alternatively, the package can be installed from the GitHub repository:
27+
28+
```bash
29+
git clone https://github.com/ssl-hep/ServiceX_analysis_utils.git
30+
cd ServiceX_analysis_utils
31+
pip install .
32+
```
33+
34+
## Verifying the Installation
35+
36+
After installation, you can verify that the package is accessible by running:
37+
38+
```bash
39+
python -c "import servicex_analysis_utils"
40+
```
41+
42+
No output indicates a successful installation.

docs/make.bat

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
@ECHO OFF
2+
3+
pushd %~dp0
4+
5+
REM Command file for Sphinx documentation
6+
7+
if "%SPHINXBUILD%" == "" (
8+
set SPHINXBUILD=sphinx-build
9+
)
10+
set SOURCEDIR=.
11+
set BUILDDIR=_build
12+
13+
%SPHINXBUILD% >NUL 2>NUL
14+
if errorlevel 9009 (
15+
echo.
16+
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17+
echo.installed, then set the SPHINXBUILD environment variable to point
18+
echo.to the full path of the 'sphinx-build' executable. Alternatively you
19+
echo.may add the Sphinx directory to PATH.
20+
echo.
21+
echo.If you don't have Sphinx installed, grab it from
22+
echo.https://www.sphinx-doc.org/
23+
exit /b 1
24+
)
25+
26+
if "%1" == "" goto help
27+
28+
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29+
goto end
30+
31+
:help
32+
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33+
34+
:end
35+
popd

docs/materialization.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Materialization of ServiceX Deliver Results
2+
3+
The `to_awk()` function provides a streamlined method to materialize the output of a ServiceX `deliver()` call into Awkward Arrays, Dask arrays, or iterators.
4+
5+
This simplifies workflows by allowing easy manipulation of the retrieved data in various analysis pipelines like in the examples below.
6+
7+
---
8+
9+
## Functionality Overview
10+
11+
The `to_awk()` function loads data from the deliver output dictionary, supporting both ROOT (`.root`) and Parquet (`.parquet` or `.pq`) file formats.
12+
13+
It provides flexible options for:
14+
15+
- Direct loading into Awkward Arrays.
16+
- Lazy loading using Dask for scalable operations.
17+
- Returning iterator objects for manual control over file streaming.
18+
19+
## Function Signature
20+
21+
```python
22+
to_awk(deliver_dict, dask=False, iterator=False, **kwargs)
23+
```
24+
25+
**Parameters:**
26+
27+
- `deliver_dict` (dict): Dictionary returned by `servicex.deliver()`. Keys are sample names, values are file paths or URLs.
28+
- `dask` (bool, optional): If True, loads files lazily using Dask. Default is False.
29+
- `iterator` (bool, optional): If True and not using Dask, returns iterators instead of materialized arrays. Default is False.
30+
- `**kwargs`: Additional keyword arguments passed to `uproot.dask`, `uproot.iterate`, dak.from_parquet, or `awkward.from_parquet`.
31+
32+
**Returns:**
33+
34+
- `dict`: A dictionary where keys are sample names and values are either Awkward Arrays, Dask Arrays, or iterators. Keeping the same dict structure as the `deliver` output.
35+
36+
---
37+
38+
## Usage Examples
39+
40+
### Simple Materialization
41+
42+
Load ServiceX deliver results directly into Awkward Arrays:
43+
44+
```python
45+
from servicex_analysis_utils import to_awk
46+
from servicex import query, dataset, deliver
47+
48+
spec = {
49+
"Sample": [
50+
{
51+
"Name": "simple_transform",
52+
"Dataset": dataset.FileList(
53+
["root://eospublic.cern.ch//eos/opendata/atlas/rucio/data16_13TeV/DAOD_PHYSLITE.37019878._000001.pool.root.1"] # noqa: E501
54+
),
55+
"Query": query.FuncADL_Uproot()
56+
.FromTree("CollectionTree")
57+
.Select(lambda e: {"el_pt": e["AnalysisElectronsAuxDyn.pt"]}),
58+
}
59+
]
60+
}
61+
62+
arrays=to_awk(deliver(spec))
63+
```
64+
65+
### Lazy Loading with Dask
66+
67+
Load results lazily for large datasets using Dask task graphs. Enables parallel execution across multiple workers.
68+
69+
```python
70+
import dask_awkward as dak
71+
72+
dask_arrays = to_awk(deliver(spec), dask=True)
73+
el_pt_array = arrays["simple_transform"]["el_pt"]
74+
mean_el_pt = dak.mean(el_pt_array).compute()
75+
```
76+
77+
### Using Iterators
78+
79+
Return iterators instead of materialized arrays to avoid loading too much data into memory. Requires `dask=False` (default). Example with loading 10,000 events per chunk:
80+
81+
```python
82+
iterables = to_awk(deliver(spec), iterator=True, step_size=10000)
83+
```
84+
85+
You can then manually loop over the data chunks:
86+
87+
```python
88+
for chunk in iterables['simple_transform']:
89+
# process small chunk (~10k events)
90+
analyse(chunk) #some function for el_pt
91+
```
92+
93+
All events can also be loaded by using:
94+
95+
```python
96+
import awkward as ak
97+
arrays= ak.concatenate(list[iterables['simple_transform']])
98+
```
99+
100+
---
101+
102+
103+
## Multiple samples
104+
105+
ServiceX queries allow multiple sample transformations. The `to_awk` allows a straightforward manipulation of such requests. This allows seamless integration with analysis frameworks with multiple samples being manipulated separately after being passing the same transformation using `deliver()`.
106+
107+
```python
108+
from servicex_analysis_utils import to_awk
109+
import awkward as ak
110+
111+
# Given a ServiceX deliver return
112+
deliver_result = {
113+
"Signal": ["path/to/signal_file1.root", "path/to/signal_file2.root"],
114+
"Background": ["path/to/background_file.root"]
115+
}
116+
117+
arrays = to_awk(deliver_result)
118+
119+
signal_el_pt = arrays["Signal"]["el_pt"]
120+
background_el_pt = arrays["Background"]["el_pt"]
121+
122+
mean_signal = ak.mean(signal_el_pt)
123+
mean_background = ak.mean(background_el_pt)
124+
125+
print(f"Mean electron pT (Signal): {mean_signal:.2f} GeV")
126+
print(f"Mean electron pT (Background): {mean_background:.2f} GeV")
127+
```
128+
129+
130+
## Notes
131+
132+
- **Multiple samples:** For transformations delivering multiple samples the dask and iterators are applied homegeneously to all.
133+
- **Error Handling:** In case of loading errors, the affected sample will have `None` as its value in the returned dictionary.
134+
- **Supported Formats:** A custom dict (non servicex) can be inputed but the paths must point be either ROOT or Parquet format.
135+
- **Branch Filtering, others:** Additional `**kwargs` allow specifying branch selections or other loading options supported by `uproot`, `awkward` and `dask_awkward`.
136+
137+
---

0 commit comments

Comments
 (0)