You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version 2.0 release
- Works with ServiceX RC2 (will also work with RC1, but move to RC2!)
- Supports caching on the local system
- Big re-work of the API
- Brings back errors from servicex (DID bad, C++ fails, etc.)
- Substantial internal code rework to enable modularity and testing
-[tcut_to_castle](https://pypi.org/project/tcut-to-qastle/) (translates `TCut` like syntax into a `servicex` query - should work for both)
18
19
19
-
These libraries are just coming up now, so this list is just an outline.
20
+
## Prerequisites
20
21
21
-
# Prerequisites
22
+
Before you can use this library you'll need:
22
23
23
-
Before you install this library you'll need:
24
+
- An environment based on python 3.6 or later
25
+
- A `ServiceX` end-point. For example, `http://localhost:5000/servicex`, if `ServiceX` is running on a local `k8` cluster and the proper ports are open, or the public servicex instance (contact IRIS-HEP at xxx if you are part of the LHC to request an account, or with help setting up an instance).
24
26
25
-
- An environment based on python 3.7 or later
26
-
- A ServiceX end-point. For example, `http://localhost:5000/servicex`.
27
+
### How to access your endpoint
27
28
28
-
# Usage
29
+
The `servicex` library searches for configuration information in several locations to determine what end-point it should connect to, in the following order:
29
30
30
-
The following lines will return a `pandas.DataFrame` containing all the jet pT's from an ATLAS xAOD file containing Z->ee Monte Carlo:
31
+
1. A `.servicex` file in the current working directory
32
+
1. A `.servicex` file in the user's home directory (`$HOME` on Linux and Mac, and your profile
33
+
directory on Windows).
34
+
1. The `config_defaults.yaml` file distributed with the `servicex` package.
35
+
36
+
If no endpoint is specified, then the library defaults to the developer endpoint, which is `http://localhost:5000` for the web-service API, and `localhost:9000` for the `minio` endpoint. No passwords are required.
37
+
38
+
Create a `.servicex` file, in the `yaml` format, in the appropriate place for your work that contains the following:
31
39
40
+
```yaml
41
+
api_endpoint:
42
+
endpoint: <your-endpoint>
43
+
email: <api-email>
44
+
password: <api-password>
32
45
```
33
-
import servicex
46
+
47
+
All strings are expanded using python's [os.path.expand](https://docs.python.org/3/library/os.path.html#os.path.expandvars) method - so `$NAME` and `${NAME}` will work to expand existing environment variables.
48
+
49
+
Finally, you can create the objects `ServiceXAdaptor` and `MinioAdaptor` by hand in your code, passing them as arguments to `ServiceXDataset` and inject custom endpoints and credentials, avoiding the configuration system. This is probably only useful for advanced users.
50
+
51
+
## Usage
52
+
53
+
The following lines will return a `pandas.DataFrame` containing all the jet pT's from an ATLAS xAOD file containing Z->ee Monte Carlo:
If your query is badly formed or there is an other problem with the backend, an exception will be thrown.
85
+
If your query is badly formed or there is an other problem with the backend, an exception will be thrown with information about the error.
86
+
87
+
If you'd like to be able to submit multiple queries and have them run on the `ServiceX` back end in parallel, it is best to use the `asyncio` interface, which has the identical signature, but is called `get_data_pandas_df_async`.
60
88
61
-
If you'd like to be able to submit multiple queries and have them run on the ServiceX back end in parallel, it may be best to use the `asyncio` interface, which has the identical signature, but is called `get_data_async`.
89
+
For documentation of `get_data` and `get_data_async` see the `servicex.py` source file.
62
90
63
-
# Features
91
+
## Configuration
92
+
93
+
As mentioned above, the `.servicex` file is read to pull a configuration. The search path for this file:
94
+
95
+
1. Your current working directory
96
+
2. Your home directory
97
+
98
+
The file can contain an `api_endpoint` as mentioned above. In addition the other following things can be put in:
99
+
100
+
- `cache_path`: Location where queries, data, and a record of queries are written. This should be an absolute path the person running the library has r/w access to. On windows, make sure to escape `\` - and best to follow standard `yaml` conventions and put the path in quotes - especially if it contains a space. Top level yaml item (don't indent it accidentally!). Defaults to `/tmp/servicex` (with the temp directory as appropriate for your platform) Examples:
- `minio_endpoint`, `minio_username`, `minio_password` - these are only interesting if you are using a pre-RC2 release of `servicex` - when the `minio` information wasn't part of the API exchange. This feature is depreciated and will be removed around the time `servicex` moves to RC3.
105
+
106
+
All strings are expanded using python's [os.path.expand](https://docs.python.org/3/library/os.path.html#os.path.expandvars) method - so `$NAME` and `${NAME}` will work to expand existing environment variables.
107
+
108
+
## Features
64
109
65
110
Implemented:
66
111
67
112
- Accepts a `qastle` formatted query
68
113
- Exceptions are used to report back errors of all sorts from the service to the user's code.
69
-
- Data is return as a `pandas.DataFrame` or a `awkward` array (see the `data_type` parameter)
114
+
- Data is return in the following forms:
115
+
- `pandas.DataFrame`an in process DataFrame of all the data requested
116
+
- `awkward`an in process `JaggedArray` or dictionary of `JaggedArray`s
117
+
- A list of root files that can be opened with `uproot` and used as desired.
118
+
- Not all output formats are compatible with all transformations.
70
119
- Complete returned data must fit in the process' memory
71
-
- Run in an async or a non-async environment and non-async methods will accomodate automatically (including `jupyter` notebooks).
72
-
- Support up to 100 simultanious queries from a laptop-like front end without overwhelming the local machine (hopefully ServiceX will be overwhelmed!)
120
+
- Run in an async or a non-async environment and non-async methods will accommodate automatically (including `jupyter` notebooks).
121
+
- Support up to 100 simultaneous queries from a laptop-like front end without overwhelming the local machine (hopefully ServiceX will be overwhelmed!)
73
122
- Start downloading files as soon as they are ready (before ServiceX is done with the complete transform).
123
+
- It has been tested to run against 100 datasets with multiple simultaneous queries.
124
+
- It supports local caching of query data
125
+
- It will provide feedback on progress.
126
+
- Configuration files supported so that user identification information does not have to be checked
127
+
into repositories.
74
128
75
-
Comming:
76
-
77
-
- Data is returned as a list of ROOT files located in a specified directory
78
-
- Make it easy to submit the same query for 100 different datasets
79
-
80
-
# Testing
129
+
## Testing
81
130
82
131
This code has been tested in several environments:
83
132
84
133
- Windows, Linux, MacOS
85
134
- Python 3.6, 3.7, 3.8
86
-
- 3.8.0 and 3.8.1 only. Unfortunately, 3.8.2 has caused `nest_asyncio` to fail. Until that package is updated we are stuck at 3.8.1.
| Fetch query data from ServiceX matching `selection_query` and return it as
198
+
| dictionary of awkward arrays, an entry for each column. The data is uniquely
199
+
| ordered (the same query will always return the same order).
200
+
```
201
+
202
+
Each data type comes in a pair - an `async` version and a synchronous version.
203
+
204
+
-`get_data_awkward_async, get_data_awkward` - Returns a dictionary of the requested data as `numpy` or `JaggedArray` objects.
205
+
-`get_data_rootfiles`, `get_data_rootfiles_async` - Returns a list of locally download files (as `pathlib.Path` objects) containing the requested data. Suitable for opening with [`ROOT::TFile`](https://root.cern.ch/doc/master/classTFile.html) or [`uproot`](https://github.com/scikit-hep/uproot).
206
+
-`get_data_pandas_df`, `get_data_pandas_df_async` - Returns the data as a `pandas``DataFrame`. This will fail if the data you've requested has any structure (e.g. is hierarchical, like a single entry for each event, and each event may have some number of jets).
207
+
-`get_data_parquet`, `get_data_parquet_async` - Returns a list of files locally downloaded that can be read by any parquet tools.
208
+
209
+
## Development
90
210
91
211
For any changes please feel free to submit pull requests!
92
212
93
213
To do development please setup your environment with the following steps:
94
214
95
215
1. A python 3.7 development environment
96
-
1. Pull down this package, XX
216
+
1.Fork/Pull down this package, XX
97
217
1.`python -m pip install -e .[test]`
98
218
1. Run the tests to make sure everything is good: `pytest`.
0 commit comments