Skip to content

Commit a734c25

Browse files
authored
Merge pull request #90613 from AedinC/OSDOCS-13195
[OSDOCS#13195]:Updated incident management section of RACI doc with more info.
2 parents a8be1d8 + f004b5f commit a734c25

File tree

2 files changed

+12
-3
lines changed

2 files changed

+12
-3
lines changed

modules/rosa-policy-incident.adoc

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -102,20 +102,24 @@ Platform audit logs are securely forwarded to a centralized security information
102102

103103
[id="rosa-policy-incident-management_{context}"]
104104
== Incident management
105-
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services. An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
105+
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services, and can affect service-level agreements (SLAs).
106+
107+
Customers and Customer Experience and Engagement (CEE) members can raise an incident through a support case. The centralized monitoring and alerting system and members of the SRE team can also raise an incident directly.
106108

107109
Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
108110

111+
Red{nbsp}Hat either sends out cluster notifications to affected individual clusters or changes the status at link:https://status.redhat.com[status.redhat.com] to reflect a wider incident. Cluster notifications are not sent for low-impact events, low-risk security updates, routine operations and maintenance, or minor, transient issues that are quickly resolved by SRE.
112+
109113
When managing a new incident, Red{nbsp}Hat uses the following general workflow:
110114

111115
. An SRE first responder is alerted to a new incident and begins an initial investigation.
112116
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
113-
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
117+
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates. If the status of a service changes or if Red{nbsp}Hat has a significant update on the progress, then the incident lead sends out an updated cluster notification.
114118
. The incident is recovered.
115119
. The incident is documented and a root cause analysis (RCA) is performed within 5 business days of the incident.
116120
. An RCA draft document will be shared with the customer within 7 business days of the incident.
117121

118-
Red{nbsp}Hat also assists with customer incidents raised through support cases.
122+
Red{nbsp}Hat also assists with customer incidents raised through support cases.
119123
Red{nbsp}Hat can assist with activities including but not limited to:
120124

121125
* Forensic gathering, including isolating virtual compute

rosa_architecture/rosa_policy_service_definition/rosa-policy-responsibility-matrix.adoc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,11 @@ include::modules/managed-cluster-notification-policy.adoc[leveloffset=+2]
3333
//---
3434
3535
include::modules/rosa-policy-incident.adoc[leveloffset=+1]
36+
37+
[role="_additional-resources"]
38+
.Additional resources
39+
* xref:../../rosa_cluster_admin/rosa-cluster-notifications.adoc#rosa-cluster-notifications[Cluster notifications]
40+
3641
include::modules/rosa-policy-change-management.adoc[leveloffset=+1]
3742
3843
[role="_additional-resources"]

0 commit comments

Comments
 (0)