Skip to content

Commit d342cc5

Browse files
committed
Re-organize contents
1 parent 41137a0 commit d342cc5

File tree

4 files changed

+162
-140
lines changed

4 files changed

+162
-140
lines changed

docs/connect_servicex.rst

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
Connecting to ServiceX
2+
======================
3+
4+
You need a `ServiceX endpoint <select-endpoint_>`_ where transformation is happening and
5+
a `client library <client-installation_>`_ to submit a transformation request.
6+
7+
.. _select-endpoint:
8+
Selecting an ServiceX endpoint
9+
----------------------
10+
11+
ServiceX is a hosted service. Each ServiceX instance is deployed at the server
12+
and dedicated to a specific experiment. Depending on which experiment you work in,
13+
there are different instances you can connect to. Some can be connected to from
14+
the outside world, while others are accessible only from a Jupyter notebook running
15+
inside the analysis facility.
16+
17+
.. list-table::
18+
:widths: 20 40 40
19+
:header-rows: 1
20+
21+
* - Collaboration
22+
- Name
23+
- URL
24+
* - ATLAS
25+
- Chicago Analysis Facility
26+
- `<https://servicex.af.uchicago.edu/>`_
27+
* - CMS
28+
- Coffea-Casa Nebraska
29+
- `<https://coffea.casa/hub>`_
30+
* - CMS
31+
- FNAL Elastic Analysis Facility
32+
- `<https://servicex.apps.okddev.fnal.gov>`_
33+
34+
35+
For ServiceX endpoints that can be connected from the outside, e.g. ATLAS Chicago
36+
Analysis Facility, you need to follow steps below to download a ServiceX access file.
37+
38+
Click on the **Sign-in** button in the upper right hand corner. You will be asked
39+
to authenticate via GlobusAuth and complete a registration form. Once this form is submitted,
40+
it will be reviewed by SSL staff. You will receive an email upon approval.
41+
42+
At this time you may return to the ServiceX page. Click on your name in the
43+
upper right hand corner and then select **Profile** tab. Click on the download
44+
button to have a ``servicex.yaml`` file generated with your access token and
45+
downloaded to your computer.
46+
47+
.. image:: img/download-servicex-yaml.jpg
48+
:alt: Download button
49+
50+
51+
ServiceX Access File
52+
~~~~~~~~~~~~~
53+
54+
The client relies on a ``servicex.yaml`` file to obtain the URLs of different
55+
servicex deployments, as well as tokens to authenticate with the
56+
service. The format of this file is as follows:
57+
58+
.. code:: yaml
59+
60+
- endpoint: https://servicex.af.uchicago.edu
61+
name: servicex-uc-af
62+
token: <YOUR TOKEN>
63+
64+
cache_path: /tmp/ServiceX_Client/cache-dir
65+
shortened_downloaded_filename: true
66+
67+
The cache database and downloaded files will be stored in the directory
68+
specified by ``cache_path``.
69+
70+
The ``shortened_downloaded_filename`` property controls whether
71+
downloaded files will have their names shortened for convenience.
72+
Setting to false preserves the full filename from the dataset. \`
73+
74+
The client library will search for this file in the current working directory
75+
and then start looking in parent directories until a file is found.
76+
77+
78+
.. _client-installation:
79+
ServiceX Client Installation
80+
----------------------------
81+
ServiceX client Python package is a python library for users to communicate
82+
with ServiceX backend (or server) to make transformation requests and handling
83+
of outputs.
84+
85+
86+
Prerequisites
87+
~~~~~~~~~~~~~
88+
89+
- Python 3.8, or above
90+
- Access to ServiceX endpoint (Member of the ATLAS or CMS collaborations)
91+
92+
Installation
93+
~~~~~~~~~~~~
94+
95+
.. code-block:: bash
96+
97+
pip install servicex
98+
99+
You're all set to make your ServiceX transformation request!

docs/index.rst

Lines changed: 5 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -11,51 +11,11 @@ The High Luminosity Large Hadron Collider (HL-LHC) faces enormous computational
1111
structure due to high pileup conditions. The ATLAS and CMS experiments will record ~ 10 times as
1212
much data from ~ 100 times as many collisions as were used to discover the Higgs boson.
1313

14-
15-
Columnar data delivery
16-
----------------------
17-
18-
ServiceX seeks to enable on-demand data delivery of columnar data in a variety of formats for
19-
physics analyses. It provides a uniform backend to data storage services, ensuring the user doesn't
20-
have to know how or where the data is stored, and is capable of on-the-fly data transformations
21-
into a variety of formats (ROOT files, Arrow arrays, Parquet files, ...) The service offers
22-
preprocessing functionality via an analysis description language called
23-
`func-adl <https://pypi.org/project/func-adl/>`_ that allows users to filter events, request columns,
24-
and even compute new variables. This enables the user to start from any format and extract only the
25-
data needed for an analysis.
14+
ServiceX is a scalable data extraction, transformation and delivery system deployed in a Kubernetes cluster.
2615

2716
.. image:: img/organize2.png
28-
:alt: Organization
29-
30-
ServiceX is designed to feed columns to a user running an analysis (e.g. via
31-
`Awkward <https://github.com/scikit-hep/awkward-array>`_ or
32-
`Coffea <https://github.com/CoffeaTeam/coffea>`_ tools) based on the results of a query designed by
33-
the user.
34-
35-
Connecting to ServiceX
36-
----------------------
37-
ServiceX is a hosted service. Depending on which experiment you work in, there are different
38-
instances you can connect to. Some can be connected to from the outside world, while others are
39-
accessible only from a Jupyter notebook running inside the analysis facility.
40-
41-
.. list-table::
42-
:widths: 20 40 40
43-
:header-rows: 1
44-
45-
* - Collaboration
46-
- Name
47-
- URL
48-
* - ATLAS
49-
- Chicago Analysis Facility
50-
- `<https://servicex.af.uchicago.edu/>`_
51-
* - CMS
52-
- Coffea-Casa Nebraska
53-
- `<https://coffea.casa/hub>`_
54-
* - CMS
55-
- FNAL Elastic Analysis Facility
56-
- `<https://servicex.apps.okddev.fnal.gov>`_
57-
58-
Follow the links to learn how to enable an account and launch a Jupyter notebook.
17+
:alt: organize
18+
5919

6020
Concepts
6121
--------
@@ -95,91 +55,18 @@ Local Cache
9555
ServiceX maintains a local cache of the results of queries. This cache can be used to avoid
9656
re-running queries that have already been executed.
9757

98-
Specify a Request
99-
-----------------
100-
Transform requests are specified with a General section, one or more Sample specifications, and
101-
optionally one or more definitions which are substituted into the Sample specifications.
102-
103-
These requests can be defined as:
104-
105-
1. A YAML file
106-
2. A Python dictionary
107-
3. Typed python objects
108-
109-
Regardless of how the request is specified, the request is submitted to ServiceX using the
110-
``deliver`` function, which returns either a list of URLs or a list of local file paths.
111-
112-
The General Section
113-
^^^^^^^^^^^^^^^^^^^
114-
The General section of the request includes the following fields:
115-
116-
* OutputFormat: Can be ``root-ttree`` or ``parquet``
117-
* Delivery: Can be ``URLs`` or ``LocalCache``
118-
119-
The Sample Sections
120-
^^^^^^^^^^^^^^^^^^^
121-
Each Sample section represents a single query to be executed. It includes the following fields:
122-
123-
* Name: A title for this sample.
124-
* RucioDID: A Rucio Dataset Identifier
125-
* XRootDFiles: A list of files to be processed without using Rucio. You must use either ``RucioDID`` or ``XRootDFiles`` but not both.
126-
* NFiles: An optional limit on the number of files to process
127-
* Query: The query to be executed. This can be a func-adl query, a Python function, or a dictionary of uproot selections.
128-
* IgnoreLocalCache: If set to true, don't use a local cache for this sample and always submit to ServiceX
129-
130-
The Definitions Sections
131-
^^^^^^^^^^^^^^^^^^^^^^^^
132-
The Definitions section is a dictionary of values that can be substituted into fields in the Sample
133-
sections. This is useful for defining common values that are used in multiple samples.
134-
135-
136-
Configuration
137-
-------------
138-
139-
The client relies on a YAML file to obtain the URLs of different
140-
servicex deployments, as well as tokens to authenticate with the
141-
service. The file should be named ``.servicex`` and the format of this
142-
file is as follows:
143-
144-
.. code:: yaml
145-
146-
api_endpoints:
147-
- endpoint: http://localhost:5000
148-
name: localhost
149-
150-
- endpoint: https://servicex-release-testing-4.servicex.ssl-hep.org
151-
name: testing4
152-
token: ...
153-
154-
default_endpoint: testing4
155-
156-
cache_path: /tmp/ServiceX_Client/cache-dir
157-
shortened_downloaded_filename: true
158-
159-
The ``default_endpoint`` will be used if otherwise not specified. The
160-
cache database and downloaded files will be stored in the directory
161-
specified by ``cache_path``.
162-
163-
The ``shortened_downloaded_filename`` property controls whether
164-
downloaded files will have their names shortened for convenience.
165-
Setting to false preserves the full filename from the dataset. \`
16658

167-
The library will search for this file in the current working directory
168-
and then start looking in parent directories until a file is found.
16959

17060
.. toctree::
17161
:maxdepth: 2
17262
:caption: Contents:
17363

174-
installation
64+
connect_servicex
17565
query_types
66+
transform_request
17667
examples
177-
databinder
17868
command_line
179-
getting_started
180-
transformer_matrix
18169
contribute
182-
troubleshoot
18370
about
18471
modules
18572
Github <https://github.com/ssl-hep/ServiceX_frontend>

docs/query_types.rst

Lines changed: 18 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,11 @@
11
Query Types
22
===========
33

4-
ServiceX supports several ways to specify the data to be extracted from a dataset.
5-
The choice of query type depends on dataset format, and the complexity of the selection criteria.
6-
7-
First a note about input data formats. The easiest data type to work with are flat root files that
8-
can be processed by the `uproot library <https://uproot.readthedocs.io/en/latest/index.html#documentation>`_.
9-
Examples of this file format would be CMS NanoAOD, ATLAS PHYSLITE, and group n-tuple files.
10-
11-
Other datasets require the experiment's C++ framework to make sense of the data. For example, ATLAS
12-
xAOD, and CMS MiniAOD. For these datasets, ServiceX converts the query into C++ script that is acutally
13-
executed by the experiment framework.
4+
ServiceX queries can be expressed using a number of query languages.
5+
The queries are translated to actual code in the ServiceX *codegens*.
6+
Not all query languages support all potential input data formats,
7+
so once you have determined what input data you need to manipulate,
8+
you can decide what query language to express your query in.
149

1510
This table sumarizes the query types supported by ServiceX and the data formats they can be used with.
1611

@@ -109,6 +104,19 @@ Each dictionary either has a ``treename`` key (indicating that it is a query on
109104

110105
.. _TTree.arrays(): https://uproot.readthedocs.io/en/latest/uproot.behaviors.TTree.TTree.html#arrays
111106

107+
108+
FuncADL Query Type
109+
------------------
110+
The FuncADL Query type is very powerful. It is based on functional programming concepts and allows
111+
the user to specify complex queries in a very compact form. The query is written in a functional
112+
style, with a series of functions that are applied to the data in sequence. The query is written
113+
in a string or as typed python objects. Depending on the source file format, the query is translated
114+
into C++ `EventLoop <https://atlassoftwaredocs.web.cern.ch/analysis-software/AnalysisTools/el_intro/>`_
115+
code, or uproot python code.
116+
117+
Full documentation on the func-adl query language can be found at this `JupyterBook <https://gordonwatts.github.io/xaod_usage/intro.html>`_.
118+
119+
112120
Python Function Query Type
113121
--------------------------
114122
This query type is the most flexible for extracting data from an uproot compatible dataset.
@@ -126,15 +134,3 @@ for each array. If a single awkward array is returned, it is stored in the tree
126134
with uproot.open({input_filenames: "reco"}) as o:
127135
br = o.arrays("el_pt_NOSYS")
128136
return br
129-
130-
131-
FuncADL Query Type
132-
------------------
133-
The FuncADL Query type is very powerful. It is based on functional programming concepts and allows
134-
the user to specify complex queries in a very compact form. The query is written in a functional
135-
style, with a series of functions that are applied to the data in sequence. The query is written
136-
in a string or as typed python objects. Depending on the source file format, the query is translated
137-
into C++ `EventLoop <https://atlassoftwaredocs.web.cern.ch/analysis-software/AnalysisTools/el_intro/>`_
138-
code, or uproot python code.
139-
140-
Full documentation on the func-adl query language can be found at this `JupyterBook <https://gordonwatts.github.io/xaod_usage/intro.html>`_.

docs/transform_request.rst

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
Transformation Request
2+
========
3+
4+
Specify a Request
5+
-----------------
6+
Transform requests are specified with one or more Sample specifications, and
7+
optionally a General section and one or more definitions which are substituted
8+
into the Sample specifications.
9+
10+
These requests can be defined as:
11+
12+
1. A YAML file
13+
2. A Python dictionary
14+
3. Typed python objects
15+
16+
Regardless of how the request is specified, the request is submitted to ServiceX using the
17+
``deliver`` function, which returns either a list of URLs or a list of local file paths.
18+
19+
20+
The Sample Sections
21+
^^^^^^^^^^^^^^^^^^^
22+
Each Sample section represents a single query to be executed. It includes the following fields:
23+
24+
* ``Name``: A title for this sample.
25+
* ``Dataset``: Rucio dataset, or a list of files via XRootD
26+
* ``Query``: The query to be executed. This can be a func-adl query, a Python function, or a dictionary of uproot selections.
27+
* (Optional) ``NFiles``: Limit on the number of files to process
28+
* (Optional) ``IgnoreLocalCache``: If set to true, don't use a local cache for this sample and always submit to ServiceX
29+
30+
The General Section
31+
^^^^^^^^^^^^^^^^^^^
32+
The General section of the request includes the following fields:
33+
34+
* (Optional) ``OutputFormat``: Can be ``root-ttree`` (default) or ``parquet``
35+
* (Optional) ``Delivery``: Can be ``URLs`` or ``LocalCache`` (default)
36+
37+
The Definitions Sections
38+
^^^^^^^^^^^^^^^^^^^^^^^^
39+
The Definitions section is a dictionary of values that can be substituted into fields in the Sample
40+
sections. This is useful for defining common values that are used in multiple samples.

0 commit comments

Comments
 (0)