Skip to content

Commit ca77c4c

Browse files
V2.14-email-notifications (#459)
* get-jobs-example * default-storage-class-sample * default-storage-class-sample * sso-changes * researcher-auth-for-sso * [RUN-11536] update whats new with version content * [RUN-11536] add RUN-10575 description. * add email messaging * [RUN-10087] add email notifications --------- Co-authored-by: Yaron <yaron@run.ai>
1 parent 912eb7d commit ca77c4c

File tree

8 files changed

+307
-43
lines changed

8 files changed

+307
-43
lines changed
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# Email notifications
2+
3+
The notifications service listens to events on the Kubernetes cluster and passes notifications of those events via email. The service can be configured to send the notifications to one or more pre configured email addresses, or to the email address of the user that submitted the workload.
4+
5+
Note: In order to send notifications dynamically to the user who submitted the workload, the user should be logged in to the Run:ai UI or CLI.
6+
7+
The service can also be configured using a regular expression to send notifications only for specific namespaces on the cluster. This enables notification only for specific Run:ai projects. The default configuration sends notifications for all the namespaces starting with `runai-`.
8+
9+
## Prerequisites
10+
11+
1. The service should be installed on each cluster used with Run:ai. The installation will be done separately from the Run:ai cluster installation using a new helm chart.
12+
2. As a part of the installation, the customer should provide their SMTP server address as well as credentials for it.
13+
14+
## Available notifications
15+
16+
Configure the notifications service to send events using the relevant `kind` and event `reason`.
17+
The following Run:ai notifications are available:
18+
19+
|Event|Kind|Reason|Description|Additional info|
20+
|:----|:----|:----|:----|:----|
21+
|Pod scheduled|`Pod`|`Scheduled`|a pod was scheduled on a node|Pod, Job, Project, Namespace, User|
22+
|Pod evicted|`PodGroup`|`Evict`|a pod was evicted to make room for another pod with higher priority, or to reclaim resources that belong to other project or department|Pod, Job, Project, Namespace, User|
23+
|Pod unschedulable|`Pod`|`Unschedulable`|a pod was determined as unschedulable and couldn't be scheduled on any node in the cluster| Pod, Job, Project, Namespace, User|
24+
|Failed scheduling pod|`Pod`|`FailedScheduling`|binding a pod to a node failed| Pod, Job, Project, Namespace, User|
25+
26+
!!! Tip
27+
You can configure the notifications service to send event messages about additional Kubernetes events using the relevant `kind` and event `reason`.
28+
<!--
29+
The following table shows the expected messages for each event:
30+
31+
|Event| Message |
32+
|--|--|
33+
| Pod scheduled | Successfully assigned `namespace`/`pod` to `node`.|
34+
| Pod evicted | Examples of messages explaining why the pod was evicted: <br /><br />Eviction due to priority within same namespace:<br /> Job `namespace`/`pod` was preempted by a job `namespace`/`pod` which has higher priority.<br /><br />Eviction due to reclaim from queue which is over-quota:<br />Job `namespace`/`pod` was reclaimed by job `namespace`/`podGroup`. The reclaimed project uses `x` GPUs with a quota of `y` GPUs. <br /><br />Eviction for consolidation:<br /> Pod `namespace`/`pod` was removed for bin packing. |
35+
| Pod unschedulable |Message explaining different reasons for scheduler not being able to schedule on different nodes. <br /> (for example "All nodes are unavailable: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: test}. 2 node(s) didn't have enough resource: GPUs. 2 node(s) didn't have enough resource: MilliCPUs.")|
36+
| Failed scheduling pod | The error returned from Kubernetes API server, which usually indicates an error in the scheduler or in the cluster. |
37+
-->
38+
39+
## Installation
40+
41+
Install the notification service using the following commands:
42+
43+
1. Set the helm repo to point to the notification service package using the following command:
44+
45+
```
46+
helm repo add runai-notifications-service https://storage.googleapis.com/runai-notifications-service/
47+
48+
helm repo update
49+
```
50+
51+
2. Check for the latest version using the following command
52+
53+
```
54+
helm repo search runai-notifications-service
55+
56+
```
57+
58+
3. Install the latest version using the following command:
59+
60+
```
61+
helm install runai-notifications-service/notifications-service --version 0.0.0
62+
```
63+
64+
## Configuration
65+
66+
The notification service is configured using a `configmap` file. The following is an example of a `configmap` file. Each of the tables below references a section in the `configmap` file.
67+
68+
<!-- Need to better understand this.
69+
!!! Note:
70+
You can change the service configuration values after deployment. Edit the config map and then rerun the `helm install` command above with the `-f` flag.
71+
-->
72+
73+
### `service` configuration
74+
75+
This section defines the number of events that will be sent by the service. Use the following table to configure options in the `service` section of the `configmap` file.
76+
77+
|Component|Field|Description|Default|
78+
|:----|:----|:----|:----|
79+
|`service`|`service.concurrent_limit`|maximum number of events the service will handle in parallel|50|
80+
|`service`|`service.cached_events`|queue size for events before blocking the listener|1000|
81+
82+
### `listener` configuration
83+
84+
This section defines the objects and events that the service will listen to and send as notifications. Use the following table to configure options in the `listener` section of the `configmap` file.
85+
86+
| Component | Field | Description | Default |
87+
| --- | --- | --- | --- |
88+
| `kubelistener` | `listener.relevant_objects` | white list of Kubernetes components for notifications | relevant_objects: <br> `kind:` <br> `Podreasons:UnschedulableScheduled` <br><br> `kind:` <br>`PodGroupreasons: - Evict` |
89+
| `kubelistener` | `listener.relevant_namespaces` | white list of namespaces that the service should listen to for events (regex) | `runai-.*` |
90+
91+
### `enrich` configuration
92+
93+
!!! Note
94+
This section of the `configmap` is for internal use only. Keep the default values.
95+
96+
| Component | Field | Default |
97+
| --- | --- | --- |
98+
|`KubeMetadata`|`enricher.kubeMetadata.lables`| `release: workloadDisplayName` <br><br>`training.kubeflow.org/job-name: workloadDisplayName` <br><br>`serving.knative.dev/service: workloadDisplayName` <br><br>`project: project`|
99+
|`KubeMetadata`|`enricher.kubeMetadata.annotations`|`"user": "user"`|
100+
101+
### `notify` configuration
102+
103+
This section defines the notification configuration of the service and contains the details for the SMTP server and the recipients list.
104+
Use the following table to configure options in the `notify` section of the `configmap` file.
105+
106+
|Component|Field|Description|Default|
107+
|--- |--- |--- |--- |
108+
|`Email`|`notify.email.smtp_host` (M)|SMTP server host address|Empty|
109+
|`Email`|`notify.email.smtp_port` (M)|SMTP server port|587|
110+
|`Email`|`notify.email.from_display_name` (M)|email's "From" display name|Run:ai|
111+
|`Email`|`notify.email.from` (M)|a valid domain source email address|<test@run.ai>|
112+
|`Email`|`notify.email.user` (M)|SMTP server user login|user|
113+
|`Email`|`notify.email.password` (M)|SMTP server user's password |password|
114+
|`Email`|`notify.email.direct_notifications` (together with Recipients)|when set to true, email notifications will be sent dynamically to the user who submitted the workload|false|
115+
|`Email`|`notify.email.recipients` (together with Direct Notifications)|additional email address recipients list for all the events - broadcast|Empty list|
116+
117+
**(M)** = mandatory to include in the `configmap` file.
118+
119+
### Example `configmap` file
120+
121+
The following file is an example of a configmap file for the notification service.
122+
123+
```
124+
service:
125+
concurrent_limit: 50
126+
cached_events: 1000
127+
listener:
128+
relevant_namespaces:
129+
- runai.*
130+
relevant_objects:
131+
- kind: Pod
132+
reasons:
133+
- Unschedulable
134+
- Scheduled
135+
- kind: PodGroup
136+
reasons:
137+
- Evict
138+
enrich:
139+
kubeMetadata:
140+
labels:
141+
"release": "workloadDisplayName"
142+
"training.kubeflow.org/job-name": "workloadDisplayName"
143+
"serving.knative.dev/service": "workloadDisplayName"
144+
"project": "project"
145+
annotations:
146+
"user": "user"
147+
notify:
148+
email:
149+
template_path: path/email.html # Internal use only.
150+
from: my@mail.com
151+
user: smtp_user
152+
password: smtp_password
153+
smtp_host: smtp.mail.com
154+
smtp_port: 587
155+
from_display_name: Company Name
156+
direct_notifications: true
157+
recipients:
158+
- some@mail.com
159+
```

docs/admin/runai-setup/authentication/researcher-authentication.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,7 @@ Modifying the API Server configuration differs between Kubernetes distributions:
5555
always_pull_images: false
5656
extra_args:
5757
oidc-client-id: runai # (1)
58-
oidc-issuer-url: https://example.com/auth
59-
oidc-username-prefix: "-"
58+
...
6059
6160
```
6261

@@ -69,10 +68,11 @@ Modifying the API Server configuration differs between Kubernetes distributions:
6968

7069
``` YAML title="/etc/rancher/rke2/config.yaml"
7170
kube-apiserver-arg:
72-
- "oidc-client-id=<CLIENT-ID>"
73-
- "oidc-issuer-url=<URL>"
74-
- "oidc-username-prefix=-"
71+
- "oidc-client-id=runai" # (1)
72+
...
7573
```
74+
75+
1. These are example parameters. Copy the actual parameters from `Settings | General | Researcher Authentication` as described above.
7676

7777
If working via Rancher UI, need to add the flag as part of the cluster provisioning.
7878

@@ -88,9 +88,19 @@ Modifying the API Server configuration differs between Kubernetes distributions:
8888

8989
Install the [yq](https://github.com/mikefarah/yq){target=_blank} utility and run:
9090

91+
For username-password authentication, run:
92+
93+
```
94+
kubectl get clientconfig default -n kube-public -o yaml > login-config.yaml
95+
yq -i e ".spec +={\"authentication\":[{\"name\":\"oidc\",\"oidc\":{\"clientID\":\"runai\",\"issuerURI\":\"$OIDC_ISSUER_URL\",\"kubectlRedirectURI\":\"http://localhost:8000/callback\",\"userClaim\":\"sub\",\"userPrefix\":\"-\"}}]}" login-config.yaml
96+
kubectl apply -f login-config.yaml
97+
```
98+
99+
For single-sign-on, run:
100+
91101
```
92102
kubectl get clientconfig default -n kube-public -o yaml > login-config.yaml
93-
yq -i e ".spec +={\"authentication\":[{\"name\":\"oidc\",\"oidc\":{\"clientID\":\"$OIDC_CLIENT_ID\",\"issuerURI\":\"$OIDC_ISSUER_URL\",\"kubectlRedirectURI\":\"http://localhost:8000/callback\",\"userClaim\":\"sub\",\"userPrefix\":\"$OIDC_USERNAME_PREFIX\"}}]}" login-config.yaml
103+
yq -i e ".spec +={\"authentication\":[{\"name\":\"oidc\",\"oidc\":{\"clientID\":\"runai\",\"issuerURI\":\"$OIDC_ISSUER_URL\",\"groupClaim\":\"groups\",\"kubectlRedirectURI\":\"http://localhost:8000/callback\",\"userClaim\":\"email\",\"userPrefix\":\"-\"}}]}" login-config.yaml
94104
kubectl apply -f login-config.yaml
95105
```
96106

docs/admin/runai-setup/authentication/sso.md

Lines changed: 17 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Single Sign-On (SSO) is an authentication scheme that allows a user to log in with a single ID to other, independent, software systems. SSO solves security issues involving multiple user/password data entries, multiple compliance schemes, etc.
44

5-
Run:ai supports SSO using the [SAML 2.0](https://en.wikipedia.org/wiki/Security_Assertion_Markup_Language){target=_blank} protocol and Open ID Connect (OIDC).
5+
Run:ai supports SSO using the [SAML 2.0](https://en.wikipedia.org/wiki/Security_Assertion_Markup_Language){target=_blank} protocol and Open ID Connect or [OIDC](https://openid.net/developers/how-connect-works/){target=_blank}.
66

77
!!! Caution
88
Single sign-on is only available with SaaS installations where the tenant has been created post-January 2022 or any Self-hosted installation of release 2.0.58 or later. If you are using single sign-on with older versions of Run:ai, please contact Run:ai customer support
@@ -13,8 +13,7 @@ Run:ai supports SSO using the [SAML 2.0](https://en.wikipedia.org/wiki/Security_
1313

1414
## SAML Prerequisites
1515

16-
* **XML Metadata**&mdash;you must have an *XML Metadata file* retrieved from your IdP. Upload the file to a web server such that you will have a URL to the file. The URL must have the *XML* file extension. For example, to connect using Google, you must create a custom SAML App [here](https://admin.google.com/ac/apps/unified){target=_blank}, download the Metadata file, and upload it to a web server.
17-
* **Organization Name**&mdash;you must have a Run:ai *Organization Name*. This is the name that appears on the top right of the Run:ai user interface.
16+
**XML Metadata**&mdash;you must have an *XML Metadata file* retrieved from your IdP. Upload the file to a web server such that you will have a URL to the file. The URL must have the *XML* file extension. For example, to connect using Google, you must create a custom SAML App [here](https://admin.google.com/ac/apps/unified){target=_blank}, download the Metadata file, and upload it to a web server.
1817

1918
## OIDC Prerequisites
2019

@@ -26,15 +25,15 @@ Run:ai supports SSO using the [SAML 2.0](https://en.wikipedia.org/wiki/Security_
2625

2726
You can configure your IdP to map several IdP attributes:
2827

29-
| IdP attribute | Run:ai required name | Description |
28+
| IdP attribute | Default Run:ai name | Description |
3029
|--|--|--|
31-
| User email | email | **(Mandatory)** `e-mail` is the user identifier with Run:ai. |
32-
| User role groups | GROUPS | (Optional) If exists, allows assigning Run:ai role groups via the IdP. The IdP attribute must be of a type of list of strings. See more below |
33-
| Linux User ID | UID (configurable) | (Optional) If exists in IdP, allows Researcher containers to start with the Linux User `UID`. Used to map access to network resources such as file systems to users. The IdP attribute must be of integer type. |
34-
| Linux Group ID | GID (configurable) | (Optional) If exists in IdP, allows Researcher containers to start with the Linux Group `GID`. The IdP attribute must be of integer type. |
35-
| Linux Supplementary Groups | SUPPLEMENTARYGROUPS (configurable) | (Optional) If exists in IdP, allows Researcher containers to start with the relevant Linux supplementary groups. The IdP attribute must be of a type of list of integers. |
36-
| User first name | firstName (configurable)| (Optional) Used as the first name showing in the Run:ai user interface. |
37-
| User last name | lastName (configurable)| (Optional) Used as the last name showing in the Run:ai user interface |
30+
| User email | email (cannot be changed) | **(Mandatory)** `e-mail` is the user identifier with Run:ai. |
31+
| User role groups | GROUPS | (Optional) If exists, allows assigning Run:ai role groups via the IdP. The IdP attribute must be of a type of list of strings. See more below |
32+
| Linux User ID | UID | (Optional) If exists in IdP, allows Researcher containers to start with the Linux User `UID`. Used to map access to network resources such as file systems to users. The IdP attribute must be of integer type. |
33+
| Linux Group ID | GID | (Optional) If exists in IdP, allows Researcher containers to start with the Linux Group `GID`. The IdP attribute must be of integer type. |
34+
| Linux Supplementary Groups | SUPPLEMENTARYGROUPS | (Optional) If exists in IdP, allows Researcher containers to start with the relevant Linux supplementary groups. The IdP attribute must be of a type of list of integers. |
35+
| User first name | firstName | (Optional) Used as the first name showing in the Run:ai user interface. |
36+
| User last name | lastName | (Optional) Used as the last name showing in the Run:ai user interface |
3837

3938
### Example attribute mapping for Google Suite
4039

@@ -54,12 +53,9 @@ You can configure your IdP to map several IdP attributes:
5453
For `Saml 2`:
5554

5655
1. In the `Metadata XML Url` field, enter the URL to the XML Metadata file.
57-
2. In the `GID` field, enter the GID.
58-
3. In the `GROUPS` field, enter the groups.
59-
4. In the `SUPPLEMENTARYGROUPS` field, enter the supplementary groups.
60-
5. In the `UID` field, enter the UID.
61-
6. In the `Logout uri` field, enter the desired URL logout page. If left empty, you will be redirected to the Run:ai portal.
62-
7. Press `Save`.
56+
2. Find your identity provider's attribute names for `GID`, `GROUPS`, `SUPPLEMENTARYGROUPS` and `UID`. If they are not in line with the Run:ai defaults described in the table above, you can change them here.
57+
3. In the `Logout uri` field, enter the desired URL logout page. If left empty, you will be redirected to the Run:ai portal.
58+
4. Press `Save`.
6359

6460
For `Open ID Connect`:
6561

@@ -68,12 +64,9 @@ For `Open ID Connect`:
6864
1. In the `Discovery Document URL` field, enter the URL to the discovery document.
6965
2. In the `Client ID` field, enter the client ID.
7066
3. In the `Client Secret` field, enter the client secret.
71-
4. In the `GID` field, enter the GID.
72-
5. In the `GROUPS` field, enter the groups.
73-
6. In the `SUPPLEMENTARYGROUPS` field, enter the supplementary groups.
74-
7. In the `UID` field, enter the UID.
75-
8. In the `Logout uri` field, enter the desired URL logout page. If left empty, you will be redirected to the Run:ai portal.
76-
9. Press `Save`.
67+
4. Find your identity provider's attribute names for `GID`, `GROUPS`, `SUPPLEMENTARYGROUPS` and `UID`. If they are not in line with the Run:ai defaults described in the table above, you can change them here.
68+
5. In the `Logout uri` field, enter the desired URL logout page. If left empty, you will be redirected to the Run:ai portal.
69+
6. Press `Save`.
7770

7871
Once you press `Save` you will receive a `Redirect URI` and an `Entity ID`. Both values must be set on the IdP side.
7972

@@ -86,12 +79,10 @@ Test Connectivity to Administration User Interface:
8679

8780
* Using an incognito browser tab and open the Run:ai user interface.
8881
* Select the `Login with SSO` button.
89-
* Provide the `Organization name` obtained above.
9082
* You will be redirected to the IdP login page. Use the previously entered *Administrator* email* to log in.
9183

9284
### Troubleshooting
93-
94-
The SSO log in can be separated into two parts:
85+
The SSO login can be separated into two parts:
9586

9687
1. Run:ai redirects to the IdP (for example, Google) for login using a *SAML Request*.
9788
2. Upon successful login, IdP redirects back to Run:ai with a *SAML Response*.

docs/admin/runai-setup/self-hosted/k8s/prerequisites.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,17 @@ See Run:ai Cluster prerequisites [Kubernetes](../../cluster-setup/cluster-prereq
4141

4242
The Run:ai control plane operating system prerequisites are identical.
4343

44-
The Run:ai control-plane requires a default storage class to create persistent volume claims for Run:ai storage. The storage class, as per Kubernetes standards, controls the reclaim behavior: whether the Run:ai persistent data is saved or deleted when the Run:ai control plane is deleted.
44+
The Run:ai control-plane requires a __default storage class__ to create persistent volume claims for Run:ai storage. The storage class, as per Kubernetes standards, controls the reclaim behavior: whether the Run:ai persistent data is saved or deleted when the Run:ai control plane is deleted.
45+
46+
47+
!!! Note
48+
For a simple (nonproduction) storage class example see [Kubernetes Local Storage Class](https://kubernetes.io/docs/concepts/storage/storage-classes/#local){target=_blank}. The storage class will set the directory `/opt/local-path-provisioner` to be used across all nodes as the path for provisioning persistent volumes.
49+
50+
Then set the new storage class as default:
51+
52+
```
53+
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
54+
```
4555

4656
### NVIDIA Prerequisites
4757

docs/developer/rest-auth.md

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -46,16 +46,20 @@ Replace `<COMPANY-URL>` below with `app.run.ai` for SaaS installations (not `<c
4646

4747
=== "Python"
4848
``` python
49-
import http.client
50-
51-
conn = http.client.HTTPSConnection("")
52-
payload = "grant_type=client_credentials&client_id=<APPLICATION-NAME>&client_secret=<CLIENT_SECRET>"
53-
headers = { 'content-type': "application/x-www-form-urlencoded" }
54-
conn.request("POST", "/<COMPANY-URL>/auth/realms/<REALM>/protocol/openid-connect/token", payload, headers)
55-
56-
res = conn.getresponse()
57-
data = res.read()
58-
print(data.decode("utf-8"))
49+
url = <COMPANY-URL> + "/auth/realms/" + <REALM> + "/protocol/openid-connect/token"
50+
51+
payload = 'grant_type=client_credentials&scope=openid&response_type=id_token&client_id=' + <APPLICATION-NAME> + '&client_secret=' + <CLIENT-SECRET>
52+
headers = {
53+
'Content-Type': 'application/x-www-form-urlencoded'
54+
}
55+
56+
response = requests.request("POST", url, headers=headers, data=payload)
57+
if response.status_code //100 == 2:
58+
j = json.loads(response.text)
59+
return j["access_token"]
60+
else:
61+
print(json.dumps(json.loads(response.text), sort_keys=True, indent=4, separators=(",", ": ")))
62+
return
5963
```
6064

6165
### Response

0 commit comments

Comments
 (0)