Skip to content

Commit 787e1ab

Browse files
committed
TELCODOCS-2005: Add topics to Telco Day 2 Troubleshooting and Maintenance
1 parent 4c5df75 commit 787e1ab

20 files changed

+483
-8
lines changed

_topic_maps/_topic_map.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3444,14 +3444,14 @@ Topics:
34443444
File: telco-troubleshooting-general-troubleshooting
34453445
- Name: Cluster maintenance
34463446
File: telco-troubleshooting-cluster-maintenance
3447-
# - Name: Security
3448-
# File: telco-troubleshooting-security
3449-
# - Name: Certificate maintenance
3450-
# File: telco-troubleshooting-cert-maintenance
3451-
# - Name: Machine Config Operator
3452-
# File: telco-troubleshooting-mco
3453-
# - Name: Bare-metal node maintenance
3454-
# File: telco-troubleshooting-bmn-maintenance
3447+
- Name: Security
3448+
File: telco-troubleshooting-security
3449+
- Name: Certificate maintenance
3450+
File: telco-troubleshooting-cert-maintenance
3451+
- Name: Machine Config Operator
3452+
File: telco-troubleshooting-mco
3453+
- Name: Bare-metal node maintenance
3454+
File: telco-troubleshooting-bmn-maintenance
34553455
---
34563456
Name: Specialized hardware and driver enablement
34573457
Dir: hardware_enablement
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="telco-troubleshooting-bmn-maintenance"]
3+
= Bare-metal node maintenance
4+
include::_attributes/common-attributes.adoc[]
5+
:context: telco-troubleshooting-bmn-maintenance
6+
7+
toc::[]
8+
9+
You can connect to a node for general troubleshooting.
10+
However, in some cases, you need to perform troubleshooting or maintenance tasks on certain hardware components.
11+
This section discusses topics that you need to perform that hardware maintenance.
12+
13+
include::modules/telco-troubleshooting-bmn-connect-to-node.adoc[leveloffset=+1]
14+
include::modules/telco-troubleshooting-bmn-move-apps-to-pods.adoc[leveloffset=+1]
15+
16+
[role="_additional-resources"]
17+
.Additional resources
18+
19+
* xref:../../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working_nodes-nodes-working[Working with nodes]
20+
21+
include::modules/telco-troubleshooting-bmn-replace-dimm.adoc[leveloffset=+1]
22+
23+
[role="_additional-resources"]
24+
.Additional resources
25+
26+
* xref:../../../storage/index.adoc#storage-overview_storage-overview[{product-title} storage overview]
27+
28+
include::modules/telco-troubleshooting-bmn-replace-disk.adoc[leveloffset=+1]
29+
include::modules/telco-troubleshooting-bmn-replace-nw-card.adoc[leveloffset=+1]
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="telco-troubleshooting-cert-maintenance"]
3+
= Certificate maintenance
4+
include::_attributes/common-attributes.adoc[]
5+
:context: telco-troubleshooting-cert-maintenance
6+
7+
toc::[]
8+
9+
Certificate maintenance is required for continuous cluster authentication.
10+
As a cluster administrator, you must manually renew certain certificates, while others are automatically renewed by the cluster.
11+
12+
Learn about certificates in {product-title} and how to maintain them by using the following resources:
13+
14+
* link:https://access.redhat.com/solutions/5018231[Which OpenShift certificates do rotate automatically and which do not in Openshift 4.x?]
15+
* link:https://access.redhat.com/solutions/7000968[Checking etcd certificate expiry in OpenShift 4]
16+
17+
include::modules/telco-troubleshooting-certs-manual.adoc[leveloffset=+1]
18+
include::modules/telco-troubleshooting-certs-manual-proxy.adoc[leveloffset=+2]
19+
20+
[role="_additional-resources"]
21+
.Additional resources
22+
23+
* xref:../../../security/certificate_types_descriptions/proxy-certificates.adoc#cert-types-proxy-certificates[Proxy certificates]
24+
25+
include::modules/telco-troubleshooting-certs-manual-user-provisioned.adoc[leveloffset=+2]
26+
27+
[role="_additional-resources"]
28+
.Additional resources
29+
30+
* xref:../../../security/certificate_types_descriptions/user-provided-certificates-for-api-server.adoc#cert-types-user-provided-certificates-for-the-api-server[User-provisioned certificates for the API server]
31+
32+
include::modules/telco-troubleshooting-certs-auto.adoc[leveloffset=+1]
33+
34+
[role="_additional-resources"]
35+
.Additional resources
36+
37+
* xref:../../../security/certificate_types_descriptions/service-ca-certificates.adoc#cert-types-service-ca-certificates_cert-types-service-ca-certificates[Service CA certificates]
38+
* xref:../../../security/certificate_types_descriptions/node-certificates.adoc#cert-types-node-certificates_cert-types-node-certificates[Node certificates]
39+
* xref:../../../security/certificate_types_descriptions/bootstrap-certificates.adoc#cert-types-bootstrap-certificates_cert-types-bootstrap-certificates[Bootstrap certificates]
40+
* xref:../../../security/certificate_types_descriptions/etcd-certificates.adoc#cert-types-etcd-certificates-cert-types-etcd-certificates[etcd certificates]
41+
* xref:../../../security/certificate_types_descriptions/olm-certificates.adoc#cert-types-olm-certificates_cert-types-olm-certificates[OLM certificates]
42+
* xref:../../../security/certificate_types_descriptions/machine-config-operator-certificates.adoc#cert-types-machine-config-operator-certificates_cert-types-machine-config-operator-certificates[Machine Config Operator certificates]
43+
* xref:../../../security/certificate_types_descriptions/monitoring-and-cluster-logging-operator-component-certificates.adoc#cert-types-monitoring-and-cluster-logging-operator-component-certificates_cert-types-monitoring-and-cluster-logging-operator-component-certificates[Monitoring and cluster logging Operator component certificates]
44+
* xref:../../../security/certificate_types_descriptions/control-plane-certificates.adoc#cert-types-control-plane-certificates_cert-types-control-plane-certificates[Control plane certificates]
45+
* xref:../../../security/certificate_types_descriptions/ingress-certificates.adoc#cert-types-ingress-certificates_cert-types-ingress-certificates[Ingress certificates]
46+
47+
include::modules/telco-troubleshooting-certs-auto-etcd.adoc[leveloffset=+2]
48+
49+
[role="_additional-resources"]
50+
.Additional resources
51+
52+
* xref:../../../security/certificate_types_descriptions/etcd-certificates.adoc#cert-types-etcd-certificates_cert-types-etcd-certificates[etcd certificates]
53+
54+
include::modules/telco-troubleshooting-certs-auto-node.adoc[leveloffset=+2]
55+
56+
[role="_additional-resources"]
57+
.Additional resources
58+
59+
* xref:../../../security/certificate_types_descriptions/node-certificates.adoc#cert-types-node-certificates_cert-types-node-certificates[Node certificates]
60+
61+
include::modules/telco-troubleshooting-certs-auto-service-ca.adoc[leveloffset=+2]
62+
63+
[role="_additional-resources"]
64+
.Additional resources
65+
66+
* xref:../../../security/certificate_types_descriptions/service-ca-certificates.adoc#cert-types-service-ca-certificates_cert-types-service-ca-certificates[Service CA certificates]
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="telco-troubleshooting-mco"]
3+
= Machine Config Operator
4+
include::_attributes/common-attributes.adoc[]
5+
:context: telco-troubleshooting-mco
6+
7+
toc::[]
8+
9+
The Machine Config Operator provides useful information to cluster administrators and controls what is running directly on the bare-metal host.
10+
11+
The Machine Config Operator differentiates between different groups of nodes in the cluster, allowing control plane nodes and worker nodes to run with different configurations.
12+
These groups of nodes run worker or application pods, which are called `MachineConfigPool` (`mcp`) groups.
13+
The same machine config is applied on all nodes or only on one MCP in the cluster.
14+
15+
For more information about how and why to apply MCPs in a telco core cluster, see xref:../../../edge_computing/day_2_core_cnf_clusters/updating/telco-update-ocp-update-prep.adoc#telco-update-applying-mcp-labels-to-nodes-before-the-update_ocp-update-prep[Applying MachineConfigPool labels to nodes before the update].
16+
17+
For more information about the Machine Config Operator, see xref:../../../operators/operator-reference.adoc#machine-config-operator_cluster-operators-ref[Machine Config Operator].
18+
19+
include::modules/telco-troubleshooting-mco-purpose.adoc[leveloffset=+1]
20+
include::modules/telco-troubleshooting-mco-apply-several-mcs.adoc[leveloffset=+1]
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="telco-troubleshooting-security"]
3+
= Security
4+
include::_attributes/common-attributes.adoc[]
5+
:context: telco-troubleshooting-security
6+
7+
toc::[]
8+
9+
Implementing a robust cluster security profile is important for building resilient telco networks.
10+
11+
include::modules/telco-troubleshooting-security-authentication.adoc[leveloffset=+1]
12+
13+
[role="_additional-resources"]
14+
.Additional resources
15+
16+
* xref:../../../authentication/understanding-identity-provider.adoc#supported-identity-providers[Supported identity providers]
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="telco-troubleshooting-bmn-connect-to-node_{context}"]
7+
= Connecting to a bare-metal node in your cluster
8+
9+
You can connect to bare-metal cluster nodes for general maintenance tasks.
10+
11+
[NOTE]
12+
====
13+
Configuring the cluster node from the host operating system is not recommended or supported.
14+
====
15+
16+
To troubleshoot your nodes, you can do the following tasks:
17+
18+
* Retrieve logs from node
19+
* Use debugging
20+
* Use SSH to connect to the node
21+
22+
[IMPORTANT]
23+
====
24+
Use SSH only if you cannot connect to the node with the `oc debug` command.
25+
====
26+
27+
.Procedure
28+
29+
. Retrieve the logs from a node by running the following command:
30+
+
31+
[source,terminal]
32+
----
33+
$ oc adm node-logs <node_name> -u crio
34+
----
35+
36+
. Use debugging by running the following command:
37+
+
38+
[source,terminal]
39+
----
40+
$ oc debug node/<node_name>
41+
----
42+
43+
. Set `/host` as the root directory within the debug shell. The debug pod mounts the host’s root file system in `/host` within the pod. By changing the root directory to `/host`, you can run binaries contained in the host’s executable paths:
44+
+
45+
--
46+
[source,terminal]
47+
----
48+
# chroot /host
49+
----
50+
51+
.Output
52+
[source,terminal]
53+
----
54+
You are now logged in as root on the node
55+
----
56+
--
57+
58+
. Optional: Use SSH to connect to the node by running the following command:
59+
+
60+
[source,terminal]
61+
----
62+
$ ssh core@<node_name>
63+
----
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="telco-troubleshooting-bmn-move-apps-to-pods_{context}"]
7+
= Moving applications to pods within the cluster
8+
9+
For scheduled hardware maintenance, you need to consider how to move your application pods to other nodes within the cluster without affecting the pod workload.
10+
11+
.Procedure
12+
13+
* Mark the node as unschedulable by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc adm cordon <node_name>
18+
----
19+
20+
When the node is unschedulable, no pods can be scheduled on the node.
21+
For more information, see "Working with nodes".
22+
23+
[NOTE]
24+
====
25+
When moving CNF applications, you might need to verify ahead of time that there are enough additional worker nodes in the cluster due to anti-affinity and pod disruption budget.
26+
====
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="telco-troubleshooting-bmn-replace-dimm_{context}"]
7+
= DIMM memory replacement
8+
9+
Dual in-line memory module (DIMM) problems sometimes only appear after a server reboots.
10+
You can check the log files for these problems.
11+
12+
When you perform a standard reboot and the server does not start, you can see a message in the console that there is a faulty DIMM memory.
13+
In that case, you can acknowledge the faulty DIMM and continue rebooting if the remaining memory is sufficient.
14+
Then, you can schedule a maintenance window to replace the faulty DIMM.
15+
16+
Sometimes, a message in the event logs indicates a bad memory module.
17+
In these cases, you can schedule the memory replacement before the server is rebooted.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="telco-troubleshooting-bmn-replace-disk_{context}"]
7+
= Disk replacement
8+
9+
If you do not have disk redundancy configured on your node through hardware or software redundant array of independent disks (RAID), you need to check the following:
10+
11+
* Does the disk contain running pod images?
12+
* Does the disk contain persistent data for pods?
13+
14+
For more information, see "{product-title} storage overview" in _Storage_.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="telco-troubleshooting-bmn-replace-nw-card_{context}"]
7+
= Cluster network card replacement
8+
9+
When you replace a network card, the MAC address changes.
10+
The MAC address can be part of the DHCP or SR-IOV Operator configuration, router configuration, firewall rules, or application Cloud-native Network Function (CNF) configuration.
11+
Before you bring back a node online after replacing a network card, you must verify that these configurations are up-to-date.
12+
13+
[IMPORTANT]
14+
====
15+
If you do not have specific procedures for MAC address changes within the network, contact your network administrator or network hardware vendor.
16+
====

0 commit comments

Comments
 (0)