Skip to content

Commit 4a43937

Browse files
authored
Merge pull request #92709 from AedinC/OSDOCS-14468
[OSDOCS-14468]:Updated content for Incident Management in ROSA and OSD docs
2 parents 1755ddd + 84efa4e commit 4a43937

File tree

2 files changed

+28
-15
lines changed

2 files changed

+28
-15
lines changed

modules/policy-incident.adoc

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,28 +5,40 @@
55
[id="policy-incident_{context}"]
66
= Incident and operations management
77

8-
This documentation details the Red Hat responsibilities for the {product-title} managed service.
8+
This documentation details the Red{nbsp}Hat responsibilities for the {product-title} managed service.
99
The cloud provider is responsible for protecting the hardware infrastructure that runs the services offered by the cloud provider.
1010
The customer is responsible for incident and operations management of customer application data and any custom networking the customer has configured for the cluster network or virtual network.
1111

1212
[id="platform-monitoring_{context}"]
1313
== Platform monitoring
14-
A Red Hat Site Reliability Engineer (SRE) maintains a centralized monitoring and alerting system for all {product-title} cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded to a centralized SIEM (Security Information and Event Monitoring) system, where they might trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.
14+
A Red{nbsp}Hat Site Reliability Engineer (SRE) maintains a centralized monitoring and alerting system for all {product-title} cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded to a centralized SIEM (Security Information and Event Monitoring) system, where they might trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.
1515

1616
[id="incident-management_{context}"]
1717
== Incident management
18-
An incident is an event that results in a degradation or outage of one or more Red Hat services. An incident can be raised by a customer or Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
18+
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services.
19+
20+
An incident can be raised by a customer or Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
1921

2022
Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
2123

22-
The general workflow of how a new incident is managed by Red Hat:
24+
When managing a new incident, Red{nbsp}Hat uses the following general workflow:
2325

2426
. An SRE first responder is alerted to a new incident, and begins an initial investigation.
2527
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
2628
. The incident lead manages all communication and coordination around recovery, including any relevant notifications or support case updates.
27-
. The incident is recovered.
28-
. The incident is documented and a root cause analysis is performed within 5 business days of the incident.
29-
. A root cause analysis (RCA) draft document is shared with the customer within 7 business days of the incident.
29+
. When the incident is resolved a brief summary of the incident and resolution are provided in the customer-initiated support ticket. This summary helps the customers understand the incident and its resolution in more detail.
30+
31+
If customers require more information in addition to what is provided in the support ticket, they can request the following workflow:
32+
33+
. The customer must make a request for the additional information within 5 business days of the incident resolution.
34+
. Depending on the severity of the incident, Red{nbsp}Hat may provide customers with a root cause summary, or a root cause analysis (RCA) in the support ticket. The additional information will be provided within 7 business days for root cause summary and 30 business days for root cause analysis from the incident resolution.
35+
36+
Red{nbsp}Hat also assists with customer incidents raised through support cases.
37+
Red{nbsp}Hat can assist with activities including but not limited to:
38+
39+
* Forensic gathering, including isolating virtual compute
40+
* Guiding compute image collection
41+
* Providing collected audit logs
3042

3143
[id="backup-recovery_{context}"]
3244
== Backup and recovery

modules/rosa-policy-incident.adoc

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -102,22 +102,23 @@ Platform audit logs are securely forwarded to a centralized security information
102102

103103
[id="rosa-policy-incident-management_{context}"]
104104
== Incident management
105-
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services, and can affect service-level agreements (SLAs).
105+
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services.
106106

107-
Customers and Customer Experience and Engagement (CEE) members can raise an incident through a support case. The centralized monitoring and alerting system and members of the SRE team can also raise an incident directly.
107+
An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
108108

109109
Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
110110

111-
Red{nbsp}Hat either sends out cluster notifications to affected individual clusters or changes the status at link:https://status.redhat.com[status.redhat.com] to reflect a wider incident. Cluster notifications are not sent for low-impact events, low-risk security updates, routine operations and maintenance, or minor, transient issues that are quickly resolved by SRE.
112-
113111
When managing a new incident, Red{nbsp}Hat uses the following general workflow:
114112

115113
. An SRE first responder is alerted to a new incident and begins an initial investigation.
116114
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
117-
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates. If the status of a service changes or if Red{nbsp}Hat has a significant update on the progress, then the incident lead sends out an updated cluster notification.
118-
. The incident is recovered.
119-
. The incident is documented and a root cause analysis (RCA) is performed within 5 business days of the incident.
120-
. An RCA draft document will be shared with the customer within 7 business days of the incident.
115+
. The incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
116+
. When the incident is resolved a brief summary of the incident and resolution are provided in the customer-initiated support ticket. This summary helps the customers understand the incident and its resolution in more detail.
117+
118+
If customers require more information in addition to what is provided in the support ticket, they can request the following workflow:
119+
120+
. The customer must make a request for the additional information within 5 business days of the incident resolution.
121+
. Depending on the severity of the incident, Red{nbsp}Hat may provide customers with a root cause summary, or a root cause analysis (RCA) in the support ticket. The additional information will be provided within 7 business days for root cause summary and 30 business days for root cause analysis from the incident resolution.
121122

122123
Red{nbsp}Hat also assists with customer incidents raised through support cases.
123124
Red{nbsp}Hat can assist with activities including but not limited to:

0 commit comments

Comments
 (0)