Skip to content

TELCODOCS-2171#Generalize Day2Ops Troubleshooting #96100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3517,19 +3517,19 @@ Topics:
Dir: troubleshooting
Topics:
- Name: Troubleshooting and maintaining telco core CNF clusters
File: telco-troubleshooting-intro
File: troubleshooting-intro
- Name: General troubleshooting
File: telco-troubleshooting-general-troubleshooting
File: troubleshooting-general-troubleshooting
- Name: Cluster maintenance
File: telco-troubleshooting-cluster-maintenance
File: troubleshooting-cluster-maintenance
- Name: Security
File: telco-troubleshooting-security
File: troubleshooting-security
- Name: Certificate maintenance
File: telco-troubleshooting-cert-maintenance
File: troubleshooting-cert-maintenance
- Name: Machine Config Operator
File: telco-troubleshooting-mco
File: troubleshooting-mco
- Name: Bare-metal node maintenance
File: telco-troubleshooting-bmn-maintenance
File: troubleshooting-bmn-maintenance
- Name: Observability
Dir: observability
Topics:
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,29 +1,29 @@
:_mod-docs-content-type: ASSEMBLY
[id="telco-troubleshooting-bmn-maintenance"]
[id="troubleshooting-bmn-maintenance"]
= Bare-metal node maintenance
include::_attributes/common-attributes.adoc[]
:context: telco-troubleshooting-bmn-maintenance
:context: troubleshooting-bmn-maintenance

toc::[]

You can connect to a node for general troubleshooting.
However, in some cases, you need to perform troubleshooting or maintenance tasks on certain hardware components.
This section discusses topics that you need to perform that hardware maintenance.
This section discusses topics that you need to perform for hardware maintenance.

include::modules/telco-troubleshooting-bmn-connect-to-node.adoc[leveloffset=+1]
include::modules/telco-troubleshooting-bmn-move-apps-to-pods.adoc[leveloffset=+1]
include::modules/troubleshooting-bmn-connect-to-node.adoc[leveloffset=+1]
include::modules/troubleshooting-bmn-move-apps-to-pods.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources

* xref:../../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working_nodes-nodes-working[Working with nodes]

include::modules/telco-troubleshooting-bmn-replace-dimm.adoc[leveloffset=+1]
include::modules/troubleshooting-bmn-replace-dimm.adoc[leveloffset=+1]
include::modules/troubleshooting-bmn-replace-disk.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources

* xref:../../../storage/index.adoc#storage-overview_storage-overview[{product-title} storage overview]

include::modules/telco-troubleshooting-bmn-replace-disk.adoc[leveloffset=+1]
include::modules/telco-troubleshooting-bmn-replace-nw-card.adoc[leveloffset=+1]
include::modules/troubleshooting-bmn-replace-nw-card.adoc[leveloffset=+1]
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
:_mod-docs-content-type: ASSEMBLY
[id="telco-troubleshooting-cert-maintenance"]
[id="troubleshooting-cert-maintenance"]
= Certificate maintenance
include::_attributes/common-attributes.adoc[]
:context: telco-troubleshooting-cert-maintenance
:context: troubleshooting-cert-maintenance

toc::[]

Expand All @@ -14,22 +14,22 @@ Learn about certificates in {product-title} and how to maintain them by using th
* link:https://access.redhat.com/solutions/5018231[Which OpenShift certificates do rotate automatically and which do not in Openshift 4.x?]
* link:https://access.redhat.com/solutions/7000968[Checking etcd certificate expiry in OpenShift 4]

include::modules/telco-troubleshooting-certs-manual.adoc[leveloffset=+1]
include::modules/telco-troubleshooting-certs-manual-proxy.adoc[leveloffset=+2]
include::modules/troubleshooting-certs-manual.adoc[leveloffset=+1]
include::modules/troubleshooting-certs-manual-proxy.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources

* xref:../../../security/certificate_types_descriptions/proxy-certificates.adoc#cert-types-proxy-certificates[Proxy certificates]

include::modules/telco-troubleshooting-certs-manual-user-provisioned.adoc[leveloffset=+2]
include::modules/troubleshooting-certs-manual-user-provisioned.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources

* xref:../../../security/certificate_types_descriptions/user-provided-certificates-for-api-server.adoc#cert-types-user-provided-certificates-for-the-api-server[User-provisioned certificates for the API server]

include::modules/telco-troubleshooting-certs-auto.adoc[leveloffset=+1]
include::modules/troubleshooting-certs-auto.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources
Expand All @@ -44,21 +44,21 @@ include::modules/telco-troubleshooting-certs-auto.adoc[leveloffset=+1]
* xref:../../../security/certificate_types_descriptions/control-plane-certificates.adoc#cert-types-control-plane-certificates_cert-types-control-plane-certificates[Control plane certificates]
* xref:../../../security/certificate_types_descriptions/ingress-certificates.adoc#cert-types-ingress-certificates_cert-types-ingress-certificates[Ingress certificates]

include::modules/telco-troubleshooting-certs-auto-etcd.adoc[leveloffset=+2]
include::modules/troubleshooting-certs-auto-etcd.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources

* xref:../../../security/certificate_types_descriptions/etcd-certificates.adoc#cert-types-etcd-certificates_cert-types-etcd-certificates[etcd certificates]

include::modules/telco-troubleshooting-certs-auto-node.adoc[leveloffset=+2]
include::modules/troubleshooting-certs-auto-node.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources

* xref:../../../security/certificate_types_descriptions/node-certificates.adoc#cert-types-node-certificates_cert-types-node-certificates[Node certificates]

include::modules/telco-troubleshooting-certs-auto-service-ca.adoc[leveloffset=+2]
include::modules/troubleshooting-certs-auto-service-ca.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
:_mod-docs-content-type: ASSEMBLY
[id="troubleshooting-cluster-maintenance"]
= Cluster maintenance
include::_attributes/common-attributes.adoc[]
:context: troubleshooting-cluster-maintenance

toc::[]

When using bare-metal deployments on {product-title}, you must pay more attention to certain configurations which can have a significant impact on cluster stability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [error] RedHat.TermsErrors: Use 'bare metal' rather than 'bare-metal'. For more information, see RedHat.TermsErrors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using hyphen here because it's an adjective

You can troubleshoot more effectively by completing these tasks:

* Monitor for failed or failing hardware components
* Periodically check the status of the cluster Operators

[NOTE]
====
For hardware monitoring, contact your hardware vendor to find the appropriate logging tool for your specific hardware.
====

include::modules/troubleshooting-clusters-check-cluster-operators.adoc[leveloffset=+1]
include::modules/troubleshooting-clusters-check-for-failed-pods.adoc[leveloffset=+1]
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
:_mod-docs-content-type: ASSEMBLY
[id="telco-troubleshooting-general-troubleshooting"]
[id="troubleshooting-general-troubleshooting"]
= General troubleshooting
include::_attributes/common-attributes.adoc[]
:context: telco-troubleshooting-general-troubleshooting
:context: troubleshooting-general-troubleshooting

toc::[]

When you encounter a problem, the first step is to find the specific area where the issue is happening.
To narrow down the potential problematic areas, complete one or more tasks:
To narrow down the potential problematic areas, complete one or more of the following tasks:

* Query your cluster
* Check your pod logs
* Debug a pod
* Review events

include::modules/telco-troubleshooting-general-query-cluster.adoc[leveloffset=+1]
include::modules/troubleshooting-general-query-cluster.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources

* xref:../../../cli_reference/openshift_cli/developer-cli-commands.adoc#oc-get[oc get]
* xref:../../../support/troubleshooting/investigating-pod-issues.adoc#reviewing-pod-status_investigating-pod-issues[Reviewing pod status]

include::modules/telco-troubleshooting-general-check-logs.adoc[leveloffset=+1]
include::modules/troubleshooting-general-check-logs.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources
Expand All @@ -32,37 +32,37 @@ include::modules/telco-troubleshooting-general-check-logs.adoc[leveloffset=+1]
* xref:../../../support/troubleshooting/investigating-pod-issues.adoc#inspecting-pod-and-container-logs_investigating-pod-issues[Inspecting pod and container logs]


include::modules/telco-troubleshooting-general-describe-pod.adoc[leveloffset=+1]
include::modules/troubleshooting-general-describe-pod.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources

* xref:../../../cli_reference/openshift_cli/developer-cli-commands.adoc#oc-describe[oc describe]

include::modules/telco-troubleshooting-general-review-events.adoc[leveloffset=+1]
include::modules/troubleshooting-general-review-events.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources

* xref:../../../security/container_security/security-monitoring.adoc#security-monitoring-events_security-monitoring[Watching cluster events]

include::modules/telco-troubleshooting-general-connect-to-pod.adoc[leveloffset=+1]
include::modules/troubleshooting-general-connect-to-pod.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources

* xref:../../../cli_reference/openshift_cli/developer-cli-commands.adoc#oc-rsh[oc rsh]
* xref:../../../support/troubleshooting/investigating-pod-issues.adoc#accessing-running-pods_investigating-pod-issues[Accessing running pods]

include::modules/telco-troubleshooting-general-debug-pod.adoc[leveloffset=+1]
include::modules/troubleshooting-general-debug-pod.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources

* xref:../../../cli_reference/openshift_cli/developer-cli-commands.adoc#oc-debug[oc debug]
* xref:../../../support/troubleshooting/investigating-pod-issues.adoc#starting-debug-pods-with-root-access_investigating-pod-issues[Starting debug pods with root access]

include::modules/telco-troubleshooting-general-run-command-on-pod.adoc[leveloffset=+1]
include::modules/troubleshooting-general-run-command-on-pod.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources
Expand Down
Original file line number Diff line number Diff line change
@@ -1,24 +1,23 @@
:_mod-docs-content-type: ASSEMBLY
[id="telco-troubleshooting-intro"]
= Troubleshooting and maintaining telco core CNF clusters
[id="troubleshooting-intro"]
= Troubleshooting and maintaining OpenShift clusters
include::_attributes/common-attributes.adoc[]
:context: telco-troubleshooting-intro
:context: troubleshooting-intro

toc::[]

Troubleshooting and maintenance are weekly tasks that can be a challenge if you do not have the tools to reach your goal, whether you want to update a component or investigate an issue.
Part of the challenge is knowing where and how to search for tools and answers.

To maintain and troubleshoot a bare-metal environment where high-bandwidth network throughput is required, see the following procedures.
To maintain and troubleshoot a bare-metal environment with high performance requirements, see the following procedures.

[IMPORTANT]
====
This troubleshooting information is not a reference for configuring {product-title} or developing Cloud-native Network Function (CNF) applications.
This troubleshooting information is not a reference for configuring {product-title} or developing cloud-native applications.

For information about developing CNF applications for telco, see link:https://redhat-best-practices-for-k8s.github.io/guide/[Red Hat Best Practices for Kubernetes].
For information about developing cloud-native applications on {product-title}, see link:https://redhat-best-practices-for-k8s.github.io/guide/[Red Hat Best Practices for Kubernetes].
====

include::modules/telco-troubleshooting-cnfs.adoc[leveloffset=+1]
include::modules/support-getting-support.adoc[leveloffset=+1]
include::modules/support-knowledgebase-about.adoc[leveloffset=+2]
include::modules/support-knowledgebase-search.adoc[leveloffset=+2]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
:_mod-docs-content-type: ASSEMBLY
[id="telco-troubleshooting-mco"]
[id="troubleshooting-mco"]
= Machine Config Operator
include::_attributes/common-attributes.adoc[]
:context: telco-troubleshooting-mco
:context: troubleshooting-mco

toc::[]

Expand All @@ -12,9 +12,7 @@ The Machine Config Operator differentiates between different groups of nodes in
These groups of nodes run worker or application pods, which are called `MachineConfigPool` (`mcp`) groups.
The same machine config is applied on all nodes or only on one MCP in the cluster.

For more information about how and why to apply MCPs in a telco core cluster, see xref:../../../edge_computing/day_2_core_cnf_clusters/updating/telco-update-ocp-update-prep.adoc#telco-update-applying-mcp-labels-to-nodes-before-the-update_ocp-update-prep[Applying MachineConfigPool labels to nodes before the update].

For more information about the Machine Config Operator, see xref:../../../operators/operator-reference.adoc#machine-config-operator_cluster-operators-ref[Machine Config Operator].

include::modules/telco-troubleshooting-mco-purpose.adoc[leveloffset=+1]
include::modules/telco-troubleshooting-mco-apply-several-mcs.adoc[leveloffset=+1]
include::modules/troubleshooting-mco-purpose.adoc[leveloffset=+1]
include::modules/troubleshooting-mco-apply-several-mcs.adoc[leveloffset=+1]
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
:_mod-docs-content-type: ASSEMBLY
[id="telco-troubleshooting-security"]
[id="troubleshooting-security"]
= Security
include::_attributes/common-attributes.adoc[]
:context: telco-troubleshooting-security
:context: troubleshooting-security

toc::[]

Implementing a robust cluster security profile is important for building resilient telco networks.
Implementing a robust cluster security profile is important for building resilient environments.

include::modules/telco-troubleshooting-security-authentication.adoc[leveloffset=+1]
include::modules/troubleshooting-security-authentication.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources
Expand Down
11 changes: 0 additions & 11 deletions modules/telco-troubleshooting-cnfs.adoc

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
// Module included in the following assemblies:
//
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/troubleshooting-bmn-maintenance.adoc

:_mod-docs-content-type: PROCEDURE
[id="telco-troubleshooting-bmn-connect-to-node_{context}"]
[id="troubleshooting-bmn-connect-to-node_{context}"]
= Connecting to a bare-metal node in your cluster

You can connect to bare-metal cluster nodes for general maintenance tasks.
Expand All @@ -15,9 +15,9 @@ Configuring the cluster node from the host operating system is not recommended o

To troubleshoot your nodes, you can do the following tasks:

* Retrieve logs from node
* Retrieve logs from a node
* Use debugging
* Use SSH to connect to the node
* Use SSH to connect to a node

[IMPORTANT]
====
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
// Module included in the following assemblies:
//
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/troubleshooting-bmn-maintenance.adoc

:_mod-docs-content-type: PROCEDURE
[id="telco-troubleshooting-bmn-move-apps-to-pods_{context}"]
[id="troubleshooting-bmn-move-apps-to-pods_{context}"]
= Moving applications to pods within the cluster

For scheduled hardware maintenance, you need to consider how to move your application pods to other nodes within the cluster without affecting the pod workload.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
// Module included in the following assemblies:
//
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/troubleshooting-bmn-maintenance.adoc

:_mod-docs-content-type: CONCEPT
[id="telco-troubleshooting-bmn-replace-dimm_{context}"]
[id="troubleshooting-bmn-replace-dimm_{context}"]
= DIMM memory replacement

Dual in-line memory module (DIMM) problems sometimes only appear after a server reboots.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
// Module included in the following assemblies:
//
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/telco-troubleshooting-bmn-maintenance.adoc
// * edge_computing/day_2_core_cnf_clusters/troubleshooting/troubleshooting-bmn-maintenance.adoc

:_mod-docs-content-type: CONCEPT
[id="telco-troubleshooting-bmn-replace-disk_{context}"]
[id="troubleshooting-bmn-replace-disk_{context}"]
= Disk replacement

If you do not have disk redundancy configured on your node through hardware or software redundant array of independent disks (RAID), you need to check the following:
Expand Down
Loading