|
5 | 5 | [id="policy-incident_{context}"]
|
6 | 6 | = Incident and operations management
|
7 | 7 |
|
8 |
| -This documentation details the Red Hat responsibilities for the {product-title} managed service. |
| 8 | +This documentation details the Red{nbsp}Hat responsibilities for the {product-title} managed service. |
9 | 9 | The cloud provider is responsible for protecting the hardware infrastructure that runs the services offered by the cloud provider.
|
10 | 10 | The customer is responsible for incident and operations management of customer application data and any custom networking the customer has configured for the cluster network or virtual network.
|
11 | 11 |
|
12 | 12 | [id="platform-monitoring_{context}"]
|
13 | 13 | == Platform monitoring
|
14 |
| -A Red Hat Site Reliability Engineer (SRE) maintains a centralized monitoring and alerting system for all {product-title} cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded to a centralized SIEM (Security Information and Event Monitoring) system, where they might trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted. |
| 14 | +A Red{nbsp}Hat Site Reliability Engineer (SRE) maintains a centralized monitoring and alerting system for all {product-title} cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded to a centralized SIEM (Security Information and Event Monitoring) system, where they might trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted. |
15 | 15 |
|
16 | 16 | [id="incident-management_{context}"]
|
17 | 17 | == Incident management
|
18 |
| -An incident is an event that results in a degradation or outage of one or more Red Hat services. An incident can be raised by a customer or Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team. |
| 18 | +An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services. |
| 19 | + |
| 20 | +An incident can be raised by a customer or Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team. |
19 | 21 |
|
20 | 22 | Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
|
21 | 23 |
|
22 |
| -The general workflow of how a new incident is managed by Red Hat: |
| 24 | +When managing a new incident, Red{nbsp}Hat uses the following general workflow: |
23 | 25 |
|
24 | 26 | . An SRE first responder is alerted to a new incident, and begins an initial investigation.
|
25 | 27 | . After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
|
26 | 28 | . The incident lead manages all communication and coordination around recovery, including any relevant notifications or support case updates.
|
27 |
| -. The incident is recovered. |
28 |
| -. The incident is documented and a root cause analysis is performed within 5 business days of the incident. |
29 |
| -. A root cause analysis (RCA) draft document is shared with the customer within 7 business days of the incident. |
| 29 | +. When the incident is resolved a brief summary of the incident and resolution are provided in the customer-initiated support ticket. This summary helps the customers understand the incident and its resolution in more detail. |
| 30 | + |
| 31 | +If customers require more information in addition to what is provided in the support ticket, they can request the following workflow: |
| 32 | + |
| 33 | +. The customer must make a request for the additional information within 5 business days of the incident resolution. |
| 34 | +. Depending on the severity of the incident, Red{nbsp}Hat may provide customers with a root cause summary, or a root cause analysis (RCA) in the support ticket. The additional information will be provided within 7 business days for root cause summary and 30 business days for root cause analysis from the incident resolution. |
| 35 | + |
| 36 | +Red{nbsp}Hat also assists with customer incidents raised through support cases. |
| 37 | +Red{nbsp}Hat can assist with activities including but not limited to: |
| 38 | + |
| 39 | +* Forensic gathering, including isolating virtual compute |
| 40 | +* Guiding compute image collection |
| 41 | +* Providing collected audit logs |
30 | 42 |
|
31 | 43 | [id="backup-recovery_{context}"]
|
32 | 44 | == Backup and recovery
|
|
0 commit comments