Skip to content

Add Docs to main #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Read the Docs configuration file for Sphinx projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/conf.py
# # You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
builder: html
# Fail on all warnings to avoid broken references
# fail_on_warning: true

# Optionally build your docs in additional formats such as PDF and ePub
formats:
- pdf

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- method: pip
path: .
extra_requirements:
- docs
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file added docs/_build/doctrees/file_introspecting.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/installation.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/materialization.doctree
Binary file not shown.
4 changes: 4 additions & 0 deletions docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: ca3626de684797951ef03443c45ac66f
tags: 645f666f9bcd5a90fca523b33c5a78b7
167 changes: 167 additions & 0 deletions docs/_build/html/_sources/file_introspecting.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Remote File Introspecting

The `get_structure()` function allows users to query and inspect the internal structure of datasets available through ServiceX. This is useful for determining which branches exist in a given dataset before running a full transformation with the correct branch labelling and typing.

It is useful for any lightweight exploration when only metadata or structure information is required without fetching event-level data.

---

## Overview

The function internally issues a ServiceX request, using python function backend, for the specified dataset(s) and returns a simplified summary of the file structure, such as branches and types in a string formatted for readability.

It accepts both programmatic and command-line usage with parametric return types.

---

## Function

```python
get_structure(datasets, array_out=False, **kwargs)
```

**Parameters:**

- `datasets` (`dict`, `str`, or `list[str]`): One or more datasets to inspect. Made for Rucio DIDs. If a dictionary is used, keys will be used as labels for each dataset in the output string.
- `array_out` (`bool`): If True, empty awkward arrays are reconstructed from the structure information. The function will return a dictionary of ak.Array.type objects. This allows for programmatic access to the dataset structure which can be further manipulated.
- `**kwargs`: Additional arguments forwarded to the helper function `print_structure_from_str`, such as `filter_branch` to apply a filter to displayed branches, `do_print` to print the output during the function call, or `save_to_txt` to save the output to `samples_structure.txt`.

**Returns:**
- `str`: The formatted file structure string.
- `None`: If `do_print` or `save_to_txt` is `True`, the function will print or save the output to a file.
- `dict`: keys are sample names and values are `ak.Array.type` objects with the same dataset structure.

---

## Command-Line Usage

The function is also available as a CLI tool:

```bash
$ servicex-get-structure "scope:dataset-rucio-id" --filter_branch "el_"
```

This dumps to the shell a summary of the structure of the dataset, filtered to branches that contain `"el_"` in their names.

```bash
$ servicex-get-structure "scope:dataset-rucio-id1" "scope:dataset-rucio-id2" --filter_branch "el_"
```

This will output a combined summary of both datasets with the same filter.

---

### Practical Output Example

Command:

```bash
$ servicex-get-structure user.mtost:user.mtost.all.Mar11 --filter-branch el_pt
```

Output on shell:

```bash
File structure of all samples with branch filter 'el_pt':

---------------------------
📁 Sample: user.mtost:user.mtost.all.Mar11
---------------------------

🌳 Tree: EventLoop_FileExecuted
├── Branches:

🌳 Tree: EventLoop_JobStats
├── Branches:

🌳 Tree: reco
├── Branches:
│ ├── el_pt_NOSYS ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_RESOLUTION_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_RESOLUTION_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_SCALE_ALL__1down ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
│ ├── el_pt_EG_SCALE_ALL__1up ; dtype: AsJagged(AsDtype('>f4'), header_bytes=10)
```

The output lists all trees and branch names matching the specified filter pattern for each requested dataset.
It shows the branch data type information as interpreted by `uproot`. This includes the vector nesting level (jagged arrays) and the base type (e.g., f4 for 32-bit floats).


#### JSON input

A json file can be used as input to simplify the command for multiple samples.

```bash
$ servicex-get-structure "path/to/datasets.jsosn"
```

With `datasets.json` containing:

```
{
"Signal": "mc23_13TeV:signal-dataset-rucio-id",
"Background W+jets": "mc23_13TeV:background-dataset-rucio-id1",
"Background Z+jets": "mc23_13TeV:background-dataset-rucio-id2",
"Background Drell-Yan": "mc23_13TeV:background-dataset-rucio-id3",
}
```

---

## Programmatic Example

Similarly to the CLI functionality, the output string containing the dataset structure can be retrieved such as:

```python
from servicex_analysis_utils import get_structure

# Retrieve structure of a specific dataset
file_structure=get_structure("mc23_13TeV:some-dataset-rucio-id")
```

### Other options

With `do_print` and `save_to_txt`, the dataset-structure string can instead be routed to std_out or to a text file in the running path.

```python
from servicex_analysis_utils import get_structure

# Directly dump structure to std_out
get_structure("mc23_13TeV:some-dataset-rucio-id", do_print=True)
# Save to samples_summaty.txt
get_structure("mc23_13TeV:some-dataset-rucio-id", save_to_txt=True)
```


#### Return awkward array type


If `array_out` is set to `True` the function reconstructs dummy arrays with the correct structre and return their `Awkward.Array.type` object.

```python
from servicex_analysis_utils import get_structure

DS = {"sample1": "user.mtost:user.mtost.all.Mar11"}
ak_type = get_structure(DS, array_out=True)

rec = ak_type["sample1"].content #get RecordType

# Find index of reco tree and runNumber branch
reco_idx = rec.fields.index("reco")
branch_idx = rec.contents[reco_idx].fields.index("runNumber")

print("Type for branch 'runNumber':", rec.contents[reco_idx].contents[branch_idx])
```
Output:

```bash
Type for branch 'runNumber': var * int64
```

---

## Notes

- The function does not retrieve event data — only structure/metadata.
- CLI output is printed directly to stdout but can be routed to a file with ` > structure_summary.txt`
- Many types will show as None or unknown when they are not interpretable by the uproot or fail to be reconstructed to ak.arrays
39 changes: 39 additions & 0 deletions docs/_build/html/_sources/index.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# ServiceX Analysis Utilities

This package provides tools interacting with [ServiceX](https://github.com/ssl-hep/ServiceX_frontend), a data extraction, transformation and delivery system built for ATLAS and CMS analyses on large datasets. The Analysis Utils package offers helper functions that streamline the usage of ServiceX and simplify its integration on workflows. But it also contains specific use case tools that benefit from the service.

---

## Installation

Install the package from PyPI:

```bash
pip install servicex-analysis-utils
```
More information can be found in [Instalation and requirements](installation.md)

## Documentation Contents

```{toctree}
:maxdepth: 2
:caption: Documentation Contents:

installation
materialization
file_introspecting
```

---

## Utility Functions

### `to_awk()`
Load an Awkward Array from ServiceX output easily.

See detailed usage here: [Materiazlization documentation](materialization.md)

### `get_structure()`
Create and send ServiceX requests to retrieve file structures with a CLI implementation.

See detailed usage here: [File introspection documentation](file_introspecting.md)
42 changes: 42 additions & 0 deletions docs/_build/html/_sources/installation.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Installation

This section provides instructions for installing the ServiceX Analysis Utilities package.

## Prerequisites

Before installing, ensure the following requirements are satisfied:

- Python 3.9 or higher is installed.
- `pip` is updated to the latest version (`pip install --upgrade pip`).
- Access to a ServiceX endpoit is granted
- A valid `servicex.yaml` configuration file is on your local machine.

For instructions on setting up ServiceX, refer to the [ServiceX Installation Guide](https://servicex-frontend.readthedocs.io/en/stable/connect_servicex.html).

## Installation from PyPI

The package is available on PyPI and can be installed via:

```bash
pip install servicex-analysis-utils
```

## Installation from Source

Alternatively, the package can be installed from the GitHub repository:

```bash
git clone https://github.com/ssl-hep/ServiceX_analysis_utils.git
cd ServiceX_analysis_utils
pip install .
```

## Verifying the Installation

After installation, you can verify that the package is accessible by running:

```bash
python -c "import servicex_analysis_utils"
```

No output indicates a successful installation.
Loading