You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/rosa-policy-incident.adoc
+7-3Lines changed: 7 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -102,20 +102,24 @@ Platform audit logs are securely forwarded to a centralized security information
102
102
103
103
[id="rosa-policy-incident-management_{context}"]
104
104
== Incident management
105
-
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services. An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
105
+
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services, and can affect service-level agreements (SLAs).
106
+
107
+
Customers and Customer Experience and Engagement (CEE) members can raise an incident through a support case. The centralized monitoring and alerting system and members of the SRE team can also raise an incident directly.
106
108
107
109
Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
108
110
111
+
Red{nbsp}Hat either sends out cluster notifications to affected individual clusters or changes the status at link:https://status.redhat.com[status.redhat.com] to reflect a wider incident. Cluster notifications are not sent for low-impact events, low-risk security updates, routine operations and maintenance, or minor, transient issues that are quickly resolved by SRE.
112
+
109
113
When managing a new incident, Red{nbsp}Hat uses the following general workflow:
110
114
111
115
. An SRE first responder is alerted to a new incident and begins an initial investigation.
112
116
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
113
-
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
117
+
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates. If the status of a service changes or if Red{nbsp}Hat has a significant update on the progress, then the incident lead sends out an updated cluster notification.
114
118
. The incident is recovered.
115
119
. The incident is documented and a root cause analysis (RCA) is performed within 5 business days of the incident.
116
120
. An RCA draft document will be shared with the customer within 7 business days of the incident.
117
121
118
-
Red{nbsp}Hat also assists with customer incidents raised through support cases.
122
+
Red{nbsp}Hat also assists with customer incidents raised through support cases.
119
123
Red{nbsp}Hat can assist with activities including but not limited to:
120
124
121
125
* Forensic gathering, including isolating virtual compute
0 commit comments