Skip to content

Commit 77562e1

Browse files
authored
Merge pull request #88 from ssl-hep:develop
Version 2.0 release - Works with ServiceX RC2 (will also work with RC1, but move to RC2!) - Supports caching on the local system - Big re-work of the API - Brings back errors from servicex (DID bad, C++ fails, etc.) - Substantial internal code rework to enable modularity and testing
2 parents b6185d1 + a73d71e commit 77562e1

28 files changed

+4329
-545
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
[flake8]
22
max-line-length = 99
3+
ignore = W503

.github/workflows/ci.yaml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ jobs:
1313
strategy:
1414
matrix:
1515
platform: [ubuntu-latest, macOS-latest, windows-latest]
16-
python-version: [3.6, 3.7]
16+
python-version: [3.6, 3.7, 3.8]
1717
runs-on: ${{ matrix.platform }}
1818

1919
steps:
@@ -30,11 +30,14 @@ jobs:
3030
- name: Lint with Flake8
3131
if: matrix.python-version == 3.7 && matrix.platform == 'ubuntu-latest'
3232
run: |
33-
flake8 --exclude=tests/* --ignore=E501
33+
flake8 --exclude=tests/* --ignore=E501,W503
3434
- name: Test with pytest
3535
run: |
3636
python -m pytest
3737
- name: Report coverage with Codecov
3838
if: github.event_name == 'push' && matrix.python-version == 3.7 && matrix.platform == 'ubuntu-latest'
39-
run: |
40-
codecov --token=${{ secrets.CODECOV_TOKEN }}
39+
uses: codecov/codecov-action@v1
40+
with:
41+
token: ${{ secrets.CODECOV_TOKEN }}
42+
file: ./coverage.xml # optional
43+
flags: unittests # optional

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,3 +123,5 @@ dmypy.json
123123

124124
# Pyre type checker
125125
.pyre/
126+
127+
.idea/

.vscode/settings.json

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,69 @@
55
"python.analysis.logLevel": "Information",
66
"python.analysis.memory.keepLibraryAst": true,
77
"python.linting.flake8Enabled": true,
8+
"cSpell.words": [
9+
"AOD's",
10+
"AZNLOCTEQ",
11+
"Comming",
12+
"DAOD",
13+
"Minio",
14+
"NOQA",
15+
"Powheg",
16+
"Reconstructor",
17+
"STDM",
18+
"SXPASS",
19+
"SXUSER",
20+
"Servivce",
21+
"Topo",
22+
"accesskey",
23+
"aenter",
24+
"aexit",
25+
"aiohttp",
26+
"asyncio",
27+
"cacheme",
28+
"codecov",
29+
"dcache",
30+
"desy",
31+
"dont",
32+
"fget",
33+
"fname",
34+
"inmem",
35+
"jupyter",
36+
"jupyterlab",
37+
"leftfoot",
38+
"linq",
39+
"localds",
40+
"minio",
41+
"miniouser",
42+
"mino",
43+
"ncols",
44+
"ndarray",
45+
"nqueries",
46+
"ntuples",
47+
"numpy",
48+
"pathlib",
49+
"pnfs",
50+
"protomolecule",
51+
"ptetaphi",
52+
"pytest",
53+
"qastle",
54+
"qsize",
55+
"rootfiles",
56+
"rucio",
57+
"secretkey",
58+
"servicex",
59+
"servicexabc",
60+
"setuptools",
61+
"slateci",
62+
"sslhep",
63+
"stfc",
64+
"tcut",
65+
"tqdm",
66+
"unittests",
67+
"xaod",
68+
"xrootd"
69+
],
70+
"python.analysis.typeCheckingMode": "basic",
871
"python.testing.pytestArgs": [
972
"--no-cov"
1073
]

README.md

Lines changed: 152 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,70 @@
11
# ServiceX_frontend
2+
23
Client access library for ServiceX
34

4-
[![GitHub Actions Status](https://github.com/ssl-hep/ServiceX_frontend/workflows/CI/CD/badge.svg)](https://github.com/ssl-hep/ServiceX_frontend/actions)
5+
[![GitHub Actions Status](https://github.com/ssl-hep/ServiceX_frontend/workflows/CI/CD/badge.svg?branch=master)](https://github.com/ssl-hep/ServiceX_frontend/actions)
56
[![Code Coverage](https://codecov.io/gh/ssl-hep/ServiceX_frontend/graph/badge.svg)](https://codecov.io/gh/ssl-hep/ServiceX_frontend)
67

78
[![PyPI version](https://badge.fury.io/py/servicex.svg)](https://badge.fury.io/py/servicex)
89
[![Supported Python versions](https://img.shields.io/pypi/pyversions/servicex.svg)](https://pypi.org/project/servicex/)
910

10-
# Introduction
11+
## Introduction
1112

12-
Given you have a selection string, this library will manage submitting it to a ServiceX instance and retreiving the data locally for you.
13+
Given you have a selection string, this library will manage submitting it to a ServiceX instance and retrieving the data locally for you.
1314
The selection string is often generated by another front-end library, for example:
1415

15-
- func_adl.xAOD (for ATLAS xAOD's)
16-
- func_adl.XXX (for flat ntuples)
17-
- xxx for columns
16+
- [func_adl.xAOD](https://github.com/iris-hep/func_adl_xAOD) (for ATLAS xAOD's)
17+
- [func_adl.uproot](https://github.com/iris-hep/func_adl.uproot) (for flat ntuples)
18+
- [tcut_to_castle](https://pypi.org/project/tcut-to-qastle/) (translates `TCut` like syntax into a `servicex` query - should work for both)
1819

19-
These libraries are just coming up now, so this list is just an outline.
20+
## Prerequisites
2021

21-
# Prerequisites
22+
Before you can use this library you'll need:
2223

23-
Before you install this library you'll need:
24+
- An environment based on python 3.6 or later
25+
- A `ServiceX` end-point. For example, `http://localhost:5000/servicex`, if `ServiceX` is running on a local `k8` cluster and the proper ports are open, or the public servicex instance (contact IRIS-HEP at xxx if you are part of the LHC to request an account, or with help setting up an instance).
2426

25-
- An environment based on python 3.7 or later
26-
- A ServiceX end-point. For example, `http://localhost:5000/servicex`.
27+
### How to access your endpoint
2728

28-
# Usage
29+
The `servicex` library searches for configuration information in several locations to determine what end-point it should connect to, in the following order:
2930

30-
The following lines will return a `pandas.DataFrame` containing all the jet pT's from an ATLAS xAOD file containing Z->ee Monte Carlo:
31+
1. A `.servicex` file in the current working directory
32+
1. A `.servicex` file in the user's home directory (`$HOME` on Linux and Mac, and your profile
33+
directory on Windows).
34+
1. The `config_defaults.yaml` file distributed with the `servicex` package.
35+
36+
If no endpoint is specified, then the library defaults to the developer endpoint, which is `http://localhost:5000` for the web-service API, and `localhost:9000` for the `minio` endpoint. No passwords are required.
37+
38+
Create a `.servicex` file, in the `yaml` format, in the appropriate place for your work that contains the following:
3139

40+
```yaml
41+
api_endpoint:
42+
endpoint: <your-endpoint>
43+
email: <api-email>
44+
password: <api-password>
3245
```
33-
import servicex
46+
47+
All strings are expanded using python's [os.path.expand](https://docs.python.org/3/library/os.path.html#os.path.expandvars) method - so `$NAME` and `${NAME}` will work to expand existing environment variables.
48+
49+
Finally, you can create the objects `ServiceXAdaptor` and `MinioAdaptor` by hand in your code, passing them as arguments to `ServiceXDataset` and inject custom endpoints and credentials, avoiding the configuration system. This is probably only useful for advanced users.
50+
51+
## Usage
52+
53+
The following lines will return a `pandas.DataFrame` containing all the jet pT's from an ATLAS xAOD file containing Z->ee Monte Carlo:
54+
55+
```python
56+
from servicex import ServiceX
3457
query = "(call ResultTTree (call Select (call SelectMany (call EventDataset (list 'localds:bogus')) (lambda (list e) (call (attr e 'Jets') 'AntiKt4EMTopoJets'))) (lambda (list j) (/ (call (attr j 'pt')) 1000.0))) (list 'JetPt') 'analysis' 'junk.root')"
3558
dataset = "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00"
36-
r = servicex.get_data(query , dataset, servicex_endpoint=endpoint)
59+
ds = ServiceXDataset(dataset)
60+
r = ds.get_data_pandas_df(query)
3761
print(r)
3862
```
63+
3964
And the output in a terminal window from running the above script (takes about 1-2 minutes to complete):
40-
```
41-
python scripts\run_test.py http://localhost:5000/servicex
65+
66+
```bash
67+
python scripts/run_test.py http://localhost:5000/servicex
4268
JetPt
4369
entry
4470
0 38.065707
@@ -56,44 +82,138 @@ entry
5682
[11355980 rows x 1 columns]
5783
```
5884

59-
If your query is badly formed or there is an other problem with the backend, an exception will be thrown.
85+
If your query is badly formed or there is an other problem with the backend, an exception will be thrown with information about the error.
86+
87+
If you'd like to be able to submit multiple queries and have them run on the `ServiceX` back end in parallel, it is best to use the `asyncio` interface, which has the identical signature, but is called `get_data_pandas_df_async`.
6088

61-
If you'd like to be able to submit multiple queries and have them run on the ServiceX back end in parallel, it may be best to use the `asyncio` interface, which has the identical signature, but is called `get_data_async`.
89+
For documentation of `get_data` and `get_data_async` see the `servicex.py` source file.
6290

63-
# Features
91+
## Configuration
92+
93+
As mentioned above, the `.servicex` file is read to pull a configuration. The search path for this file:
94+
95+
1. Your current working directory
96+
2. Your home directory
97+
98+
The file can contain an `api_endpoint` as mentioned above. In addition the other following things can be put in:
99+
100+
- `cache_path`: Location where queries, data, and a record of queries are written. This should be an absolute path the person running the library has r/w access to. On windows, make sure to escape `\` - and best to follow standard `yaml` conventions and put the path in quotes - especially if it contains a space. Top level yaml item (don't indent it accidentally!). Defaults to `/tmp/servicex` (with the temp directory as appropriate for your platform) Examples:
101+
- Windows: `cache_path: "C:\\Users\\gordo\\Desktop\\cacheme"`
102+
- Linux: `cache_path: "/home/servicex-cache"`
103+
104+
- `minio_endpoint`, `minio_username`, `minio_password` - these are only interesting if you are using a pre-RC2 release of `servicex` - when the `minio` information wasn't part of the API exchange. This feature is depreciated and will be removed around the time `servicex` moves to RC3.
105+
106+
All strings are expanded using python's [os.path.expand](https://docs.python.org/3/library/os.path.html#os.path.expandvars) method - so `$NAME` and `${NAME}` will work to expand existing environment variables.
107+
108+
## Features
64109

65110
Implemented:
66111

67112
- Accepts a `qastle` formatted query
68113
- Exceptions are used to report back errors of all sorts from the service to the user's code.
69-
- Data is return as a `pandas.DataFrame` or a `awkward` array (see the `data_type` parameter)
114+
- Data is return in the following forms:
115+
- `pandas.DataFrame` an in process DataFrame of all the data requested
116+
- `awkward` an in process `JaggedArray` or dictionary of `JaggedArray`s
117+
- A list of root files that can be opened with `uproot` and used as desired.
118+
- Not all output formats are compatible with all transformations.
70119
- Complete returned data must fit in the process' memory
71-
- Run in an async or a non-async environment and non-async methods will accomodate automatically (including `jupyter` notebooks).
72-
- Support up to 100 simultanious queries from a laptop-like front end without overwhelming the local machine (hopefully ServiceX will be overwhelmed!)
120+
- Run in an async or a non-async environment and non-async methods will accommodate automatically (including `jupyter` notebooks).
121+
- Support up to 100 simultaneous queries from a laptop-like front end without overwhelming the local machine (hopefully ServiceX will be overwhelmed!)
73122
- Start downloading files as soon as they are ready (before ServiceX is done with the complete transform).
123+
- It has been tested to run against 100 datasets with multiple simultaneous queries.
124+
- It supports local caching of query data
125+
- It will provide feedback on progress.
126+
- Configuration files supported so that user identification information does not have to be checked
127+
into repositories.
74128

75-
Comming:
76-
77-
- Data is returned as a list of ROOT files located in a specified directory
78-
- Make it easy to submit the same query for 100 different datasets
79-
80-
# Testing
129+
## Testing
81130

82131
This code has been tested in several environments:
83132

84133
- Windows, Linux, MacOS
85134
- Python 3.6, 3.7, 3.8
86-
- 3.8.0 and 3.8.1 only. Unfortunately, 3.8.2 has caused `nest_asyncio` to fail. Until that package is updated we are stuck at 3.8.1.
87135
- Jupyter Notebooks (not automated), regular python command-line invoked source files
88136

89-
# Development
137+
## API
138+
139+
Everything is based around the `ServiceXDataset` object. Below is the documentation for the most common parameters.
140+
141+
```python
142+
ServiceXDataset(dataset: str,
143+
image: str = 'sslhep/servicex_func_adl_xaod_transformer:v0.4',
144+
max_workers: int = 20,
145+
servicex_adaptor: ServiceXAdaptor = None,
146+
minio_adaptor: MinioAdaptor = None,
147+
cache_adaptor: Optional[Cache] = None,
148+
status_callback_factory: Optional[StatusUpdateFactory] = _run_default_wrapper,
149+
local_log: log_adaptor = None,
150+
session_generator: Callable[[], Awaitable[aiohttp.ClientSession]] = None,
151+
config_adaptor: ConfigView = None)
152+
Create and configure a ServiceX object for a dataset.
153+
154+
Arguments
155+
156+
dataset Name of a dataset from which queries will be selected.
157+
image Name of transformer image to use to transform the data
158+
max_workers Maximum number of transformers to run simultaneously on
159+
ServiceX.
160+
servicex_adaptor Object to control communication with the servicex instance
161+
at a particular ip address with certain login credentials.
162+
Will be configured via the `.servicex` file by default.
163+
minio_adaptor Object to control communication with the minio servicex
164+
instance. By default configured with values from the
165+
`.servicex` file.
166+
cache_adaptor Runs the caching for data and queries that are sent up and
167+
down.
168+
status_callback_factory Factory to create a status notification callback for each
169+
query. One is created per query.
170+
local_log Log adaptor for logging.
171+
session_generator If you want to control the `ClientSession` object that
172+
is used for callbacks. Otherwise a single one for all
173+
`servicex` queries is used.
174+
config_adaptor Control how configuration options are read from the
175+
`.servicex` file.
176+
177+
Notes:
178+
179+
- The `status_callback` argument, by default, uses the `tqdm` library to render
180+
progress bars in a terminal window or a graphic in a Jupyter notebook (with proper
181+
jupyter extensions installed). If `status_callback` is specified as None, no
182+
updates will be rendered. A custom callback function can also be specified which
183+
takes `(total_files, transformed, downloaded, skipped)` as an argument. The
184+
`total_files` parameter may be `None` until the system knows how many files need to
185+
be processed (and some files can even be completed before that is known).
186+
```
187+
188+
To get the data use one of the `get_data` method. They all have the same API, differing only by what they return.
189+
190+
```python
191+
| get_data_awkward_async(self, selection_query: str) -> Dict[bytes, Union[awkward.array.jagged.JaggedArray, numpy.ndarray]]
192+
| Fetch query data from ServiceX matching `selection_query` and return it as
193+
| dictionary of awkward arrays, an entry for each column. The data is uniquely
194+
| ordered (the same query will always return the same order).
195+
|
196+
| get_data_awkward(self, selection_query: str) -> Dict[bytes, Union[awkward.array.jagged.JaggedArray, numpy.ndarray]]
197+
| Fetch query data from ServiceX matching `selection_query` and return it as
198+
| dictionary of awkward arrays, an entry for each column. The data is uniquely
199+
| ordered (the same query will always return the same order).
200+
```
201+
202+
Each data type comes in a pair - an `async` version and a synchronous version.
203+
204+
- `get_data_awkward_async, get_data_awkward` - Returns a dictionary of the requested data as `numpy` or `JaggedArray` objects.
205+
- `get_data_rootfiles`, `get_data_rootfiles_async` - Returns a list of locally download files (as `pathlib.Path` objects) containing the requested data. Suitable for opening with [`ROOT::TFile`](https://root.cern.ch/doc/master/classTFile.html) or [`uproot`](https://github.com/scikit-hep/uproot).
206+
- `get_data_pandas_df`, `get_data_pandas_df_async` - Returns the data as a `pandas` `DataFrame`. This will fail if the data you've requested has any structure (e.g. is hierarchical, like a single entry for each event, and each event may have some number of jets).
207+
- `get_data_parquet`, `get_data_parquet_async` - Returns a list of files locally downloaded that can be read by any parquet tools.
208+
209+
## Development
90210

91211
For any changes please feel free to submit pull requests!
92212

93213
To do development please setup your environment with the following steps:
94214

95215
1. A python 3.7 development environment
96-
1. Pull down this package, XX
216+
1. Fork/Pull down this package, XX
97217
1. `python -m pip install -e .[test]`
98218
1. Run the tests to make sure everything is good: `pytest`.
99219

scripts/run_test.py

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,29 @@
11
# Tests against a ServiceX instance. This is not meant to be a complete
22
# test of every code path. Just a sanity check. The unit tests are meant to
33
# do that sort of testing.
4-
# An example endpoint (pass as arg to this script): http://localhost:5000/servicex
4+
# An example endpoint (pass as arg to this script):
5+
# http://localhost:5000
56
import sys
6-
import servicex
7+
from servicex import ServiceXDataset
8+
from servicex.servicex_adaptor import ServiceXAdaptor
9+
from typing import Optional
710

811

9-
def run_query(endpoint: str) -> None:
10-
r = servicex.get_data("(call ResultTTree (call Select (call SelectMany (call EventDataset (list 'localds:bogus')) (lambda (list e) (call (attr e 'Jets') 'AntiKt4EMTopoJets'))) (lambda (list j) (/ (call (attr j 'pt')) 1000.0))) (list 'JetPt') 'analysis' 'junk.root')", "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00", servicex_endpoint=endpoint)
12+
def run_query(endpoint: Optional[ServiceXAdaptor]) -> None:
13+
ds = ServiceXDataset(
14+
"mc16_13TeV:mc16_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.deriv.DAOD_STDM3.e3601_e5984_s3126_r10201_r10210_p3975_tid20425969_00", # NOQA
15+
max_workers=100,
16+
servicex_adaptor=endpoint)
17+
18+
r = ds.get_data_rootfiles("(call ResultTTree (call Select (call SelectMany (call EventDataset (list 'localds:bogus')) (lambda (list e) (call (attr e 'Jets') 'AntiKt4EMTopoJets'))) (lambda (list j) (/ (call (attr j 'pt')) 1000.0))) (list 'JetPt') 'analysis' 'junk.root')") # NOQA
1119
print(r)
1220

1321

1422
if __name__ == '__main__':
15-
assert len(sys.argv) == 2
16-
run_query(sys.argv[1])
23+
# import logging
24+
# ch = logging.StreamHandler()
25+
# ch.setLevel(logging.DEBUG)
26+
# logging.getLogger('servicex').setLevel(logging.DEBUG)
27+
# logging.getLogger('servicex').addHandler(ch)
28+
servicex_adaptor = ServiceXAdaptor(sys.argv[1]) if len(sys.argv) >= 2 else None
29+
run_query(servicex_adaptor)

0 commit comments

Comments
 (0)